ArticlePDF Available

Abstract and Figures

New fault tolerant methods are needed to cope with the fault rate augment in memory systems. Traditionally, Error Correction Codes (ECCs) have been used. This Fault-Tolerance method works well with single faults. Nevertheless, the increase of the integration density in current deep submicron chips, as well as the decrease of the energy needed to provoke Single Event Upsets (SEUs) in storage, has provoked the augment in the occurrence of Multiple Cell Upsets (MCUs). In this way, new ECCs able to tolerate MCUs are needed. In this work, we summarize the different ECCs that we have proposed in order to tolerate MCUs.
Content may be subject to copyright.
New configurations to improve reliability and
redundancy in high performance memory systems
J. Gracia-Morán1, P.J. Gil-Vicente1, D. Gil-Tomás1, L.J. Saiz-Adalid1,
1 Instituto ITACA, Universitat Politècnica de València,
Camino de Vera s/n, 46022, Spain
{jgracia, pgil, dgil, ljsaiz}@itaca.upv.es
Abstract. New fault tolerant methods are needed to cope with the fault rate
augment in memory systems. Traditionally, Error Correction Codes (ECCs)
have been used. This Fault-Tolerance method works well with single faults.
Nevertheless, the increase of the integration density in current deep submicron
chips, as well as the decrease of the energy needed to provoke Single Event Up-
sets (SEUs) in storage, has provoked the augment in the occurrence of Multiple
Cell Upsets (MCUs). In this way, new ECCs able to tolerate MCUs are needed.
In this work, we summarize the different ECCs that we have proposed in order
to tolerate MCUs.
1 Introduction
As memory area grows, also its probability of suffering faults [1]. In this way, the
reduction of the energy needed to provoke a Single Event Upset (SEU) in storage has
been reduced, as it can be seen in Fig. 1 (extracted from [2]). Conversely, this energy
reduction provokes an augment of Multiple Cell Upsets (MCU) [3][4].
Fig. 1. Simulated critical LET (Linear Energy Transfer) for unattenuated transient propagation
and SEU threshold LET as a function of scaling for bulk and SOI CMOS technologies (extract-
ed from [2]).
103 C.Fernandez-Llatas and M.Guillen (Ed.):
Workshop on Innovation on
Information and Communication Technologies
(ITACA-WIICT 2017)
ISBN 978-84-697-7327-7
Traditionally, Error Correction Codes (ECCs) were used to protect data stored in
memories, specially Single Error Correction (SEC) codes or Single Error Correction-
Double Error Detection (SEC-DED) codes [5][6][7]. SEC codes are able to correct an
error in one single memory cell, while SEC-DED codes can correct an error in one
single memory cell, as well as detect two errors in two independent cells.
In any case, it is necessary to add extra bits in order to carry on fault detection and
correction. Currently, ECC memories add 8 redundant bits (also called code bits) for
64-bit data words, that is, a 12.5% of redundancy [8]. The problem arises when
MCUs are present, as this number of code bits may not be sufficient to correct them.
In this work, we summarize the different ECCs that we have proposed in order to
tolerate MCUs. The common characteristic of these ECCs, together with special
memory configurations, is that they improves reliability on memory systems preserv-
ing, or even decreasing, the redundancy of actual memory protection methods.
This paper is organized as follows. Section 2 summarizes how to improve reliabil-
ity in DRAM storage systems. Section 3 introduces the basic properties of the new
ECCs that improve reliability in DRAM memories, and finally, Section 4 concludes
this work and presents some future work.
2 Improving reliability in DRAM devices
As just commented, Single Error Correction (SEC) codes or Single Error Correction-
Double Error Detection (SEC-DED) codes were traditionally used to protect memo-
ries. An important characteristic of these codes is that the redundancy they introduce
decreases with the length of the data word, as it can be seen in Table 1 (extracted
from [9]). Another fact we can see in Table 1 is that for the same data word length, a
greater error coverage needs a greater redundancy.
Table 1. Redundancy and coverage for common data word lengths (extracted from [9])
ECC
Redundancy
Coverage
SEC (12, 8)1
50%
Single Error Correction
SEC (21, 16)
31.25%
SEC (38, 32)
18.75%
SEC (71, 64)
10.94%
SEC-DED (13, 8)
62.50%
Single Error Correction
Double Error Detection
SEC-DED (22, 16)
37.50%
SEC-DED (39, 32)
21.88%
SEC-DED (72, 64)
12.50%
As commented previously, modern ECC memories introduce a 12.5% of redun-
dancy (that is, 8 redundant bits for 64-bit data words). These memories are usually
built with 4-bit or 8-bit DRAM chips. In this way, to construct a 64-bit data word plus
8 redundant bits, eighteen 4-bit or nine 8-bit DRAM chips are used. Nevertheless, this
1 An (n, k) binary ECC encodes a k-bit input word in an n-bit output word. In this way, the
number of redundant bits introduced can be calculated as the result of (n k).
104
12.5% of redundancy is not enough to cope with MCUs. Some changes have to be
introduced. For instance, some advanced methods able to tolerate MCUs, such as
hipkill Error Protection scheme [10], Hewlett Packard's Advanced ECC [11]
or Intel's Single Device Data Correction (SDDC) [12], interleave data and parity bits
through different memory chips in order to spread errors in different ECC words.
Another change is the data word length. Currently, memory channels can be joined
in order to feed with enough data multi-threaded applications running in multi-core
processors. This action can be implemented with the lockstep mode [13][14]. This
method runs the same memory command in various channels at a time. In this way, it
is possible to combine data from these various channels to form a data word with a
bigger length, and as it has just seen in Table 1, longer data word length allows the
reduction of the redundancy used by the ECCs.
In a previous work, we have defined the FUEC (Flexible Unequal Error Control)
methodology [16]. By using an algorithm based in this methodology, we are able to
design ECCs with a very low redundancy. In this way, when designing an ECC using
the FUEC methodology, four parameters must be set:
the data length (k);
the encoded word length (n);
the set of error vectors to be corrected ; and
the set of error vectors to be detected .
The number of redundant bits needed by an ECC depends on the size of the set of
error vectors to be corrected , as well as the size of the set of error vectors to be
detected . In this way, by reducing these sets, we can reduce the number of extra
bits needed. More information about this reduction can be seen in [15].
and depends on the type of errors to detect and correct. The term random er-
ror commonly refers to one or more bits in error, distributed randomly in the encoded
word (data bits plus parity bits generated by the ECC). Random errors can be single
(only one bit affected) or multiple.
Single errors are the simplest ones. They are commonly produced by single event
upsets (SEU) in random access memories, or single event transients (SET) in combi-
national logic [17].
Multiple errors usually manifest as burst errors, rather than randomly [18]. A burst
error is a multiple error that spans l bits in a word [19], i.e. a group of contiguous bits
where, at least, the first and the last bits are in error. The separation l is known as
burst length. The main physical causes of a burst error in the context of DRAM mem-
ories are diverse: high energy cosmic particles that hit some neighbor cells, crosstalk
between neighbor cells, etc. [20].
3 New ECCs to improve memory reliability
We have designed a series of new ECCs that improve memory reliability. To do this,
we have used:
Interleaving. By using this method, it is possible to spread a multibit error in
different ECC words. In this way, the corresponding ECC must correct a lower
105
number of bits in error. Thus, the combination of all ECCs included in a
memory system allows supporting greater multibit errors.
Lockstep multichannel to enlarge the data word size. As we have seen in Sec-
tion 2, longer data words length allows a greater coverage and reduces redun-
dancy.
Efficient reduction of the space of error vectors. It is possible to design effi-
cient and low redundant ECCs by reducing the number of error vectors. This is
possible by taking into account the data storage physical distribution. We have
decreased the number of error vectors by eliminating those error vectors that
represents very uncommon errors.
FUEC methodology with an algorithm developed by the authors. Once the dif-
ferent parameters to design an ECC have been established, our algorithm is
able to find a parity check matrix that defines this ECC. By using the FUEC
methodology, this ECC will be very efficient in terms of area and/or speed.
Table 2 and Table 3 show a summary of these new ECCs. Next paragraphs sum-
marize the main characteristics of the different ECCs. A more detailed description can
be found in [9][15].
Table 2. ECCs Fault Tolerance Capabilities Summary (I)
ECC1
ECC3
Data Word Size
64
128
Parity bits per ECC
8
12
Redundancy
12.5%
9.38%
Interleaving
Yes
Yes
Memory Channels
2
4
Correction capabilities
Complete 4-bit DRAM
device
Single bit errors
2- and 3-bit burst errors
Complete 8-bit
DRAM device
Single bit errors
2- to 5-bit burst
errors
3.1 Error Correction Code 1 (ECC1)
The ECC1 uses 2 ECCs with 64 data bits and 8 code bits each. In this way, by using 2
memory channels, a 128-bit data word and 16 code bits are used in this memory con-
figuration. Thus, redundancy is 12.5%.
This ECC has been designed for memory DIMMs built with 4-bit DRAM devices.
By interleaving data and code bits in pairs, this memory configuration is able to cor-
rect single random errors, as well as 2- and 3-bit burst errors. Also, this memory con-
figuration can recover the error of a complete 4-bit DRAM device. As far as we
know, this memory configuration presents the highest fault tolerance capabilities with
the lowest redundancy for 128-bit data words.
106
Table 3. 
ECC4
ECC5
ECC6
Data Word Size
128
256
256
Parity bits per ECC
8
12
10
Redundancy
6.25%
4.7%
3.9%
Interleaving
Yes
Yes
Yes
Memory Channels
8
8
8
Correction capabilities
Two adjacent 4-bit DRAM
devices or a single 8-bit
DRAM device
1 bit in error
2- to 7-bit burst errors
Single 8-bit
DRAM device
1 bit in error
2- to 5-bit burst
errors
Single 4-bit
DRAM device
1 bit in error
2- and 3-bit burst
errors
3.2 Error Correction Code 2 (ECC2)
The second ECC introduced (ECC2) uses also 2 ECCs, but now each ECC generates
8 code bits from a 128-bit data word. That is, the complete scheme uses now a 256-bit
data word and 16 code bits. To do this, 4 memory channels are needed. In this way,
we have introduced a 6.25% of redundancy, half ECC1 code.
This ECC has been also designed for memory DIMMs built with 4-bit DRAM de-
vices. In this way, by interleaving data and code bits in pairs, it is possible to correct
single random errors, as well as 2- and 3-bit burst errors. As in the case of the ECC1,
this memory configuration can also recover the error of a complete 4-bit DRAM de-
vice.
Another property of this code is that  -bit
data word and 8 code bits), we have now 16 unused code bits. These bits can be used
as spare bits, increasing the fault tolerance capabilities of the complete memory sys-
tem. The use of spare bits is a known method utilized by different memory protection
schemes, such as the IBM Memory ProteXion method [21].
3.3 Error Correction Code 3 (ECC3)
In this case, we have augmented the redundancy used by the ECC in order to be able
to use 8-bit DRAM devices. Specifically, we have now two ECCs with a 128-bit data
word and 12 code bits each one, getting a 9.38% of redundancy. As in the case of
ECC2, we have used also 4 memory channels, but now, we have interleaved data and
code bits in groups of 4 bits. With this configuration, this memory scheme is able to
correct single bit random errors, as well as 2- to 5-bit burst errors. Also, the failure of
a complete 8-bit DRAM device is supported.
As in the previous case, such a lower redundancy has provoked the presence of
some unused bits. Specifically, we have now 8 bits that can be used as spare bits,
increasing the fault tolerance capabilities of the complete system.
107
3.4 Error Correction Code 4 (ECC4)
This code can be used by 4-bit or 8-bit DRAM devices. In this case, we have used 8
memory channels. Combining 4 ECCs (with 128-bit data word and 8 code bits each
ECC), we can built a 512-bit data word with 32 code bits, provoking a 6.25% of re-
dundancy.
By interleaving data and code bits in pairs, this memory configuration can correct
one random bit in error, as well as from 2- to 7-bit burst errors. Also, this memory
scheme is able to correct the failure of two adjacent 4-bit DRAM devices or a single
8-bit DRAM device. Lastly, there are also 32 unused bits, which can be employed as
spare bits.
3.5 Error Correction Code 5 (ECC5)
This code has been designed for 8-bit DRAM memories. In this case, the redundancy
is 4.7%, as this code uses 12 parity bits for 256 data bits. By combining 2 of these
ECCs, and interleaving bits in groups of 4, this memory protection format can correct
all single errors, as well as 2- to 5-bit burst errors. Also, a complete single 8-bit
DRAM device can be recovered. On the other hand, we also have 40 spare bits.
3.5 Error Correction Code 6 (ECC6)
This code, designed for 4-bit DRAM devices, uses two ECCs with 256-bit data word
and 10 code bits each. As in the previous case, we have used 8 memory channels.
Combining 2 ECCs, we can build a 512-bit data word with 20 code bits, provoking a
3.9 % of redundancy, that is, the lowest redundancy of the ECCs presented in this
work.
This so low redundancy provokes that this design presents the greatest quantity of
spare bits available respect of the ECCs introduced in this work. Specifically, with
this ECC, there are 44 spare bits.
In this way, with this memory scheme, we are able to correct one bit in error, 2-
and 3-bit burst errors and the complete failure of a single 4-bit DRAM device.
4 Conclusions and Future work
As memory capacity increases, also its fault rate augments, provoking an increment in
the number of SEUs and MCUs. As traditional ECCs cannot cope with this fault rate
increment, new ECCs are needed.
By designing efficient ECCs, using various memory channels in lockstep mode, in-
terleaving bits and applying our algorithm based in the FUEC methodology, it is pos-
sible to design advanced memory systems able to work with the new fault rates. These
new codes are able to correct MCUs and improve the redundancy and reliability of
actual commercial solutions. In addition, they are able to tolerate the failure of a
whole memory chip.
108
In the future, we want to continue developing very low redundant ECCs, as well as
to study how to tolerate the failure of several DRAM devices.
References
1.              -Scale
            
Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN
2015), pp. 415-426, June/July 2015.
2.         
of Single-Event Transients in High-      on Nuclear
Science, vol. 51, no. 6, pp. 3278-3284, December 2004.
3.        m-
puters, vol. 22, no. 3, pp. 258266, May/Jun. 2005.
4. E. Ibe, H. Taniguchi, Y. Yahagi, K. Shimbo, and T. Toba, -
      c-
tron Devices, vol. 57, no. 7, pp. 15271538, July 2010.
5. E. Fujiwara, Code Design for Dependable Systems: Theory and Practical Application, Ed.
Wiley-Interscience, 2006.
6.            
Journal, vol. 29, pp. 147 160, 1950.
7.    -correcting codes for semiconductor memory applica-
tions: a state-of-the-art  IBM Journal of Research and Development, vol. 58, no.
2, pp. 124-134, March 1984.
8. Hewlett-        r-
th edition, Dec. 2010.
9. J. Gracia-Morán, P. J. Gil-Vicente, D. Gil-Tomás, L. J. Saiz-  
Correction Capabilities on Advanced ECC Memories by using Multichannel Configura-
   IEEE Transactions on Very Large Scale Integration (VLSI) Sys-
tems, May 2017.
10.          -correct ECC for PC server main
IBM Microelectronics Division, 1997.
11. Hewlett-        r-
hnology brief, 9th edition, Dec. 2010.
12.     
-726), Aug. 2002, available online
at http://www.ece.umd.edu/courses/enee759h.S2003/references/29227401.pdf.
13.        Drive your Memory Faster or
     online at: https://software.intel.com/en-
us/blogs/2014/07/11/independent-channel-vs-lockstep-mode-drive-you-memory-faster-or-
safer
14. J. Hoskins and B.   
Guide to Intel-
15. J. Gracia-Morán, L. J. Saiz-Adalid, D. Gil-Tomás, P. J. Gil-Improving Reliabil-
ity in High Performance DRAM Memories using new Error Correction Codes for 8
Memory Channels      Emerging Topics in Compu-
ting, March 2017.
16.  -  -    -     -
Unequal Error Control Codes with Selectable Error Detection and Correc-
109
         
(SAFECOMP 2013), pp. 178-189, September 2013.
17.  
NASA Electronic Parts and Packaging (NEPP) Program web site, August 2009. Available
online at https://nepp.nasa.gov/files/18365/Proton_RHAGuide_NASAAug09.pdf.
18.     -Locality-Aware Linear Coding to Correct Multi-bit
-10, November 2010.
19. E. Fujiwara, Code Design for Dependable Systems: Theory and Practical Application, Ed.
Wiley-Interscience, 2006.
20. M. Greenberg, "Reliability, availability, and serviceability (ras) for ddr dram interfaces," in
memcon.com. Available online at:
http://www.memcon.com/pdfs/proceedings2014/NET105.pdf, 2014.
21.             
IBM Redbooks, 2008, available online at:
http://www.gebruikteservers.nl/upload/1304416621.pdf.
110
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Trends in terrestrial neutron-induced soft-error in SRAMs from a 250 nm to a 22 nm process are reviewed and predicted using the Monte-Carlo simulator CORIMS, which is validated to have less than 20% variations from experimental soft-error data on 180-130 nm SRAMs in a wide variety of neutron fields like field tests at low and high altitudes and accelerator tests in LANSCE, TSL, and CYRIC. The following results are obtained: 1) Soft-error rates per device in SRAMs will increase x6-7 from 130 nm to 22 nm process; 2) As SRAM is scaled down to a smaller size, soft-error rate is dominated more significantly by low-energy neutrons (<; 10 MeV); and 3) The area affected by one nuclear reaction spreads over 1 M bits and bit multiplicity of multi-cell upset become as high as 100 bits and more.
Article
Full-text available
As the dimensions and operating voltages of computer electronics shrink to satisfy consumers' insatiable demand for higher density, greater functionality, and lower power consumption, sensitivity to radiation increases dramatically. In terrestrial applications, the predominant radiation issue is the soft error, whereby a single radiation event causes a data bit stored in a device to be corrupted until new data is written to that device. This article comprehensively analyzes soft-error sensitivity in modern systems and shows it to be application dependent. The discussion covers ground-level radiation mechanisms that have the most serious impact on circuit operation along with the effect of technology scaling on soft-error rates in memory and logic.
Conference Paper
Unequal Error Control (UEC) codes provide means for handling errors where the codeword digits may be exposed to different error rates, like in two-dimensional optical storage media, or VLSI circuits affected by intermittent faults or different noise sources. However, existing UEC codes are quite rigid in their definition. They split codewords in only two areas, applying different (but limited) error correction functions in each area. This paper introduces Flexible UEC (FUEC) codes, which can divide codewords into any required number of areas, establishing for each one the adequate error detection and/or correction levels. At design time, an algorithm automates the code generation process. Among all the codes meeting the requirements, different selection criteria can be applied. The code generated is implemented using simple logic operations, allowing fast encoding and decoding. Reported examples show their feasibility and potentials.
Article
Theoretical and practical tools to master matrix code design strategy and technique. Error correcting and detecting codes are essential to improving system reliability and have popularly been applied to computer systems and communication systems. Coding theory has been studied mainly using the code generator polynomials; hence, the codes are sometimes called polynomial codes. On the other hand, the codes designed by parity check matrices are referred to in this book as matrix codes. This timely book focuses on the design theory for matrix codes and their practical applications for the improvement of system reliability. As the author effectively demonstrates, matrix codes are far more flexible than polynomial codes, as they are capable of expressing various types of code functions. In contrast to other coding theory publications, this one does not burden its readers with unnecessary polynomial algebra, but rather focuses on the essentials needed to understand and take full advantage of matrix code constructions and designs. Readers are presented with a full array of theoretical and practical tools to master the fine points of matrix code design strategy and technique: Code designs are presented in relation to practical applications, such as high-speed semiconductor memories, mass memories of disks and tapes, logic circuits and systems, data entry systems, and distributed storage systems. New classes of matrix codes, such as error locating codes, spotty byte error control codes, and unequal error control codes, are introduced along with their applications. A new parallel decoding algorithm of the burst error control codes is demonstrated. In addition to the treatment of matrix codes, the author provides readers with a general overview of the latest developments and advances in the field of code design. Examples, figures, and exercises are fully provided in each chapter to illustrate concepts and engage the reader in designing actual code and solving real problems. The matrix codes presented with practical parameter settings will be very useful for practicing engineers and researchers. References lead to additional material so readers can explore advanced topics in depth. Engineers, researchers, and designers involved in dependable system design and code design research will find the unique focus and perspective of this practical guide and reference helpful in finding solutions to many key industry problems. It also can serve as a coursebook for graduate and advanced undergraduate students.
Article
The author was led to the study given in this paper from a consideration of large scale computing machines in which a large number of operations must be performed without a single error in the end result. This problem of “doing things right” on a large scale is not essentially new; in a telephone central office, for example, a very large number of operations are performed while the errors leading to wrong numbers are kept well under control, though they have not been completely eliminated. This has been achieved, in part, through the use of self-checking circuits. The occasional failure that escapes routine checking is still detected by the customer and will, if it persists, result in customer complaint, while if it is transient it will produce only occasional wrong numbers. At the same time the rest of the central office functions satisfactorily. In a digital computer, on the other hand, a single failure usually means the complete failure, in the sense that if it is detected no more computing can be done until the failure is located and corrected, while if it escapes detection then it invalidates all subsequent operations of the machine. Put in other words, in a telephone central office there are a number of parallel paths which are more or less independent of each other; in a digital machine there is usually a single long path which passes through the same piece of equipment many, many times before the answer is obtained.
Article
As the network-centric model of computing continues to mature, customers are constantly trying to evaluate what the correct price/performance point is for their business. For the growing number of businesses that choose a PC Server for a departmental, workgroup, or application server function, one of the key parame-ters is the reliability of the server. This paper addresses one area of concern in the RAS (Reliability, Availability, and Serviceability) arena of PC Servers that has been addressed thoroughly at the mainframe and midrange class of machines, but not at the lower end of the server spectrum: error recovery when an entire DRAM chip fails.
Article
This paper presents a state-of-the-art review of error-correcting codes for computer semiconductor memory applications. The construction of four classes of error-correcting codes appropriate for semiconductor memory designs is described, and for each class of codes the number of check bits required for commonly used data lengths is provided. The implementation aspects of error correction and error detection are also discussed, and certain algorithms useful in extending the error-correcting capability for the correction of soft errors such as α-particle-induced errors are examined in some detail.
Article
The production and propagation of single-event transients in scaled metal oxide semiconductor (CMOS) digital logic circuits are examined. Scaling trends to the 100-nm technology node are explored using three-dimensional mixed-level simulations, including both bulk CMOS and silicon-on-insulator (SOI) technologies. Significant transients in deep submicron circuits are predicted for particle strikes with linear energy transfer as low as 2 MeV-cm2/mg, and unattenuated propagation of such transients can occur in bulk CMOS circuits at the 100-nm technology node. Transients approaching 1 ns in duration are predicted in bulk CMOS circuits. Body-tied SOI circuits produce much shorter transients than their bulk counterparts, making them more amenable to transient filtering schemes based on temporal redundancy. Body-tied SOI circuits also maintain a significant advantage in single-event transient immunity with scaling.
Memory technology evolution: an overview of system memory technologies
  • L P Hewlett-Packard Development Company
Hewlett-Packard Development Company, L.P. "Memory technology evolution: an overview of system memory technologies", Technology brief, 9 th edition, Dec. 2010.