# **DASIA 2008**

# NAND-Flash Memory Technology in Mass Memory Systems for Space Applications

Mike Cassel<sup>(1)</sup>, Dietmar Walter<sup>(1)</sup>, Hagen Schmidt<sup>(1)</sup>, Fritz Gliem<sup>(1)</sup>, Harald Michalik<sup>(1)</sup>, Michael Stähle<sup>(2)</sup>, Karl Vögele<sup>(2)</sup>, Peter Roos<sup>(3)</sup>

(1) IDA, TU Braunschweig, 38106 Braunschweig, Germany, <u>cassel@ida.ing.tu-bs.de</u>
(2) EADS Astrium GmbH, 88039 Friedrichshafen, Germany, <u>michael.staehle@astrium.eads.net</u>
(3) ESA/ESTEC D/TEC-EDD, 2200 AG Noordwijk, The Netherlands, <u>peter.roos@esa.int</u>

### ABSTRACT

In the frame of the ESA/ESTEC Safeguard Data Recorder study a highly reliable NAND-Flash based memory system for use in space was developed. As for all space borne semiconductor memories specific measures against radiation induced data errors and malfunctions are mandatory in order to maintain the required high data integrity. For this purpose an Orthogonal Reed-Solomon Error Correction in combination with an Orthogonal Data Commutation is applied. Additionally algorithms to detect and to cope with radiation induced Single Event Functional Interrupts are implemented. In a demonstration model these functions were realized in a Field Programmable Gate Array. An associated file system enables an easy access to data structures of variable size. The demonstration model is ready for integration into the ESA/ESTEC "Aviation Lab". A representative version of the memory core is going to be flown on PROBA-2. Finally NAND-Flash usage for telemetry buffer memories is discussed.

### 1. INTRODUCTION

Non-volatility of stored data is highly desirable for several space borne memory applications. On ground currently both, Magnetic Discs and Flash memories are the dominant non-volatile technologies. For space borne memories solid state data storage is preferred clearly. Flash memories are solid state and descend from the well established EEPROMs [1]. They use a dual gate nchannel MOS structure with an additional floating gate (FG), which is embedded in the dielectricum between main gate and ptype substrate. The charge combination of both gates controls the channel inversion. Charging and discharging of the FG is performed by Fowler-Northeim tunneling between substrate and FG. A negative charge of the FG prevents the formation of a conductive channel and keeps the transistor in off state, which represents the logical "0" state.

Flash devices offer the highest storage density of contemporary solid state memory devices, using only one transistor per bit in contrast to DRAMs using one transistor plus one capacitor.

The apparent Flash specific advantages:

- (i.) non-volatility in case of power loss
- (ii.) highest storage density: Single Level NAND-Flash: 8Gbit / chip versus DDR2: 2 Gbit / chip
- are accompanied by some Flash specific disadvantages:
- (iii.) access only to rather large data portions, e.g. pages of 4224 bytes, each
- (iv.) only moderate write / read rate: 80 / 200 Mbps compared to up to DDR2: 6 Gbps
- (v.) wear out limitation to  $10^5$  write operations
- (vi.) necessity of bad block management

Accordingly, Flash devices are not well suited for memories with fast random access to small data portions (bytes, words of some bytes), as processor memories, SAR corner turning memories, a.s.o.

But, Flash devices, in particular NAND-Flashs are well suited for large background memories with data transfer in sequential order, and in particular for applications with high data integrity requirements, as the background storage of system data files. In preparation of those space applications ESA/ESTEC commissioned Astrium/IDA to study the implementation aspects of a Flash-based nonvolatile Safeguard Data Recorder (SGDR) with extremely high data integrity<sup>\*)</sup>.

The following primary aspects were treated:

- (a) response of 1Gbit...8Gbit Single Level NAND-Flash devices to space radiation
- (b) development and test of adequate countermeasures against radiation induced malfunctions
- (c) implementation and test of a demonstration model

### 2. ORGANIZATION OF NAND-FLASH DEVICES



Fig. 1: Principal Organization of a 4G NAND-Flash device

Fig. 1 displays the principal organization of NAND-Flash devices. The array of e.g. 4 Gbit is structured into 4k device blocks of 64 device pages. Each device page contains 2k + 64 = 2112 bytes. Data Write and Data Read are executed page wise. FG Charging (= Data Write) is performed individually for each transistor, and simultaneously for all respective transistors of the addressed page. Discharging (= Erase) is performed collectively for all transistors of the addressed device block. The peripheral circuitry comprises a Serial/Parallel register of page size, an address register, a status register, a state machine to control the internal sequencing and a high voltage pulser to deliver 20V pulses in order to generate the high field strength pulses of about 20 MV/cm for charge tunneling through the thin (about 8 nm) SiO<sup>2</sup> layer between FG and substrate.

## 3. RADIATION RESPONSE OF NAND-FLASH DEVICES

Radiation tests have been performed in order (i) to learn about the device specific failure mechanisms and to design and to test appropriate countermeasures, (ii) to assess the efficiency of the error correction scheme and (iii) to assess both the bit error rate and the frequency of power cycling actions in a given space environment.

As in all today's high capacity memory devices the device internal sequencing is decoupled from the timing of the external control signals. This results in a complex device structure. In consequence a large number of different radiation induced error conditions can occur. These error conditions were classified according to their damaging potential into four main classes and several subclasses [8]. Distributed Single Byte Errors (class A) are of less concern, because they can be corrected by an appropriately designed error correcting scheme. The scheme tailored for SGDR will be treated in the section 5. The classes B, C, D cover the more extended error pattern. Single Event Induced Functional Interrupts (SEFI) of class B disappear after a short time. SEFI of class C are persistent, but can be resolved by power cycling. Class D contains destructive and in consequence permanent failures. In SGDR any detected class B or class C SEFI initiates power cycling. This does not affect the stored data, in contrast to today's SDRAMs, where lock up situations of the device internal control circuitry can't be resolved without data loss.

Six Heavy Ion SEE tests, one proton test and one TID test were performed on 1G - 8G NAND-Flash devices. Test conditions, test execution and main results are described in [3, 4, 8-10]. The gained error cross sections vary from type to type, but in general they are – in orders of magnitude – quite comparable to those of high capacity SDRAMs and well in accordance with recent test results [5 - 7]. In our first SEE test campaigns some device types experienced destructive failures in operational modes comprising frequent Erase. As a potential remedy we introduced a power cycling after each Erase, which proved to be a very effective countermeasure. Accordingly SGDR makes use of this precaution.

The ESTEC Co-60 source was used for TID tests of 4G and 8G NAND-Flash devices. In Storage mode the 4G Samsung device delivered the first few errors at 40 krad and the 8G Samsung device at 30 krad. We got strong indications that occasional scrubbing would shift the first error occurrence to substantially higher dose values. This is to be be verified within the next TID campaign.

# 4. BASIC CONCEPT

The SGDR architecture is based on the following main principles:

- Extremely high memory density of Flash devices enables ample redundancy to be used for highly efficient Error Detection and Correction (EDAC)
- Retry procedures against SEFIs cause no data loss
- Non-volatility enables reconfiguration after power loss
- Device individual power switches to prevent error propagation: a permanent short circuit in one memory device does not impair a complete wordgroup
- Robust software based on atomic operations to maintain consistency: user and meta data are written and read with the same access
- File system tables and bad block tables are stored twice: distributed in the non-volatile memory for initialization, and additionally volatile in compact manner to increase access speed
- Cyclic Redundancy Check (CRC) to detect exhausted error correction capability (specific SEFI situations)

### 5. ERROR CORRECTION AND SEFI HANDLING

#### Orthogonal Reed-Solomon Error Detection and Correction

Today's Flash devices typically have a data bus width of 8Bit. Therefore the utilization of a 8Bit-symbol oriented error code is very advantageous. With such a code even the failure of one device is correctable. An extended, systematic Reed-Solomon (RS) code can be implemented with relatively low hardware expenditure. In order to correct 1 symbol out of  $\leq$ 255 symbols two 8Bit parity symbols (Bytes) are needed. A wordgroup (WG) is the smallest self-containing subunit of the SGDR memory array. To get a reasonable proportion of user symbols to parity symbols a wordgroup is assembled from six Flash devices. Four devices store the user symbols and the two remaining devices store the parity symbols. This error correction capability takes effect in the vertical direction.



Further increase in error correction capability is reached by the implementation of four EDAC units in horizontal direction. In that direction 2112Bytes (device page) can be used for 2112 / L\_SPG code words (L\_SPG = Subpage Length). Subpage sizes of 6 x L\_SPG Byte (Example of 6 x 6 subpage in Fig. 2) arise in these cases. Due to the orthogonal single symbol error correction all cases with less than 4 errors per subpage are correctable as a matter of principle.

Fig. 2: Example for 6x6 Subpage with L-shaped 3-Error Pattern

Error probability calculations for randomly distributed class A errors - and also for other error scenarios - have been carried out for code word lengths of n = 6, 32, 132 symbols. Fig. 3 depicts the post-correction error share esh in relation to the number of pre-correction errors nh. Repetition of correction cycles - each consisting of a vertical and a horizontal error correction - improves the correction efficiency by one order of magnitude (Fig. 4). More repetitions generate only small further improvement.



Fig. 3: Single Symbol Error Share versus Error Count before Correction, Parameter: Code Word Length



Fig. 4: Single Symbol Error Share versus Error Count before Correction, Parameter: Number of Correction Cycles

### Orthogonal Data Commutation

The SEE test of 1G and 2G Samsung devices showed some error clustering, i.e. a particle hit corrupts not only a single cell but several neighbored cells. At slant particle incidence this behavior could be observed with other devices, too. Clustering impairs the error correction efficiency in horizontal direction. Data commutation (interleaving) in horizontal and vertical direction is applied to generate a pseudo random data distribution and to remove clustering.

### SEFI Handling

In addition to EDAC a second stage of error correction is provided by SEFI handling algorithms. A CRC hardware unit calculates the checksum across all user data in a WG Page consisting of 6 device pages. In case of CRC errors the read access is repeated in combination with a power cycle of the respective wordgroup. Permanently high error counts detected by the EDAC trigger the swapping of the affected memory area to a non-affected memory area. In case of a SEFI during a write access a new physical page address is selected, already written data to the affected block is copied and this block is marked as garbage. The logical block address does not change. A failed erase access will be repeated once.

# **DASIA 2008**

### 6. SGDR FILE SYSTEM (SGDRFS)

The SGDR is designed for high data integrity. As already mentioned it incorporates a very effective error correction mechanism, which is optimized to handle the error patterns of Flash devices under particle irradiation. Device specific errors and complete device failures can be coped with the applied error correction scheme. If potentially the error correction capability should be exhausted, then the respective file will be tagged as corrupted. Also, inconsistencies could be induced by power failures. Therefore the data management of the flash storage area is designed for a high data consistency, i.e. in case of errors or malfunctions which cannot be corrected anymore, the data management can be kept consistent to retrieve other correctly stored files and to identify corrupted files.

The file system on the SGDR is aimed on providing simple user access to data structures of variable size. It is composed of the File Management System FMS and the Flash Translation Layer FTL. The FMS has interfaces to both (i) the main application comprising for example task scheduling and Telecommand (TC) verification, and (ii) to the FTL. Function of the FMS is management of active files and management of WG pages based on logical block addresses. The FTL exposes an interface to the FMS hiding the NAND-Flash specific device handling. For that purpose the FTL translates logical block addresses to physical block addresses. Consequently the FMS manages a pool of logical blocks, whereas "Bad Block Management" and "Garbage Collection" are performed by the FTL.

## 7. RECONFIGURATION AFTER POWER LOSS

After power up an autonomous reconfiguration sequence provides all necessary tables to the file system. These tables are reconstructed from meta data stored in the non-volatile memory array. The reconfiguration is performed in ten phases (runlevels). Runlevels 0 to 3 are executed by software stored in PROM.

- 0) Initial test, initialization of software
- 1) Power up of Flash memory wordgroups
- 2) Scan for active wordgroups, application software (ASW) and other system files
- 3) In case of errors wait for user command
- 4) If ASW is copied successfully to main memory, SGDR is restarted, software is executed from RAM now
- 5) Power up of Flash memory wordgroups
- 6) Scan for system and quick start files in active wordgroups
- 7) Automatic transmission of predefined files
- 8) Scan of complete Flash memory array to reconstruct complete file system tables
- 9) Nominal operational state

# 8. DEVELOPMENT MODEL

The development was performed in three phases. For the first model orthogonal error correction, data commutation, cyclic redundancy check and direct memory access (DMA, between FPGA und Flash memory array) were implemented in software on the PPC405 CPU. Goal of this first model was a proof of function under Californium Cf\_252 irradiation at ESA/ESTEC. Therefore a wordgroup was equipped with opened NAND-Flash devices. By this test the proper function of the error correction and also the effectiveness of the retry procedures were validated. The second model was targeted for a significant increase of the access speed. For that purpose the orthogonal error correction, data commutation, CRC and DMA where converted into firmware on the Xilinx FPGA. Data rates of 20Mbps for write and 50Mbps for read operation were achieved by this measure. The third model was upgraded by SpaceWire (SpW) and telemetry/telecommand (TM/TC) interfaces as needed for integration into the target environment 'Aviation Lab' at ESA/ESTEC.

### 9. TELEMETRY BUFFER MEMORIES

In the frame of the SGDR study, we also investigated NAND-Flash device usage for telemetry buffer memories. At the time being 512 Mbit SDRAM devices are used for most of the mass memory applications in space. The SDRAM devices can be stacked into cubes containing 8 devices, so that one package provides a storage capacity of 4 Gbit. The current NAND-Flash technology provides a maximum capacity of 8 Gbit per die. Also this device could be stacked by a similar approach. Because commercial manufacturers offer high density TSOP1 packages containing four dies, the expenditure for additional stacking can be saved. Astrium runs currently a pre-qualification programme to make this package available for space applications, too. This package provides a storage capacity of 32 Gbit, i.e. a factor of 8 higher compared to the stacked SDRAM package. Additionally the weight of a TSOP1 package is much lower. With view on one of our existing SDRAM Memory Module, UFM (Ultra Fast Memory Module) which provides the form factor of a Double Europe Board, the capacities on module level are reflected as follows:

# DASIA 2008

|                | 512M SDRAM         | 8G NAND Flash |
|----------------|--------------------|---------------|
| Gross Capacity | 144 Gbit           | 1536 Gbit     |
| Weight         | 800 g              | 560 g         |
| Board Size     | 243.35 mm x 160 mm |               |

Fig. 5: Comparison of a NAND Flash and an SDRAM UFM Memory Module. Same board size and similar mounting area for memory devices.

### 10. ACCESS PERFORMANCE AND DATA RATES

A lower write access performance is commonly encountered by parallel operation of several devices. A mass memory implementation can improve the performance at least on two levels. The first level is the module level, i.e. parallel operation of several memory modules. The second level could be on wordgroup level. A specific feature of Flash devices is that the write data transfer is followed by a comparatively long programming phase. During programming no further access is possible. But, several other interleaved write data transfers can be executed within this pause to a series of other WGs. This interleaving concept grants to boost the data rate of a single module without the disadvantages imposed by pure parallel device operation, i.e. without increasing the accessible page and block sizes.

### **11. WEAR OUT MITIGATION**

To mitigate the wear out limitation, most Flash memory systems, e.g. solid state replacements of magnetic disks, are equipped with an address management system, which distributes the write accesses rather uniformly over the address space. This is called Wear Leveling. In typical planetary missions the telemetry buffer is only occasionally filled and retrieved during the cruise phase. During the observation phase the count of daily "load – retrieve" cycles is normally low, e.g. 1 to 10 cycles resulting in less than 36000 write operations on the same logical address within 10 years. Furthermore the very high device capacity of NAND Flash devices offers the opportunity to implement a physical address space, which exceeds the required logical user address space by a factor of n. By this the wear out limit of the logical addresses can be enhanced by the factor n, too. In summary there are two methods to keep the total count of write accesses to the same physical address below the wear out limit. Wear Leveling and implementation of more memory capacity without drawbacks in weight and volume.

### 12. SUMMARY

In-situ radiation tests on device and on wordgroup level clearly proved the feasibility of a space borne NAND-Flash based memory system designed for extreme data integrity. The device inherent imperfections can be mitigated by common procedures like EDAC, periodic scrubbing, wear levelling. Access retry operations boost the data integrity further.

The lacking flight heritage of high storage density NAND-Flash devices could be regarded as a disadvantage. Therefore a representative version of the SGDR is going to be flown in the TDM test slot on PROBA-2 to gain space flight experience with stateof-the-art 8Gbit NAND-Flash memory devices. For that purpose the SGDR core design is transferred into a radiation tolerant Actel FPGA.

Further considerations in combination with the SGDR study had shown that NAND-Flash also is an adequate technology for many mass memory applications, in particular for telemetry buffer memories. Access performance can be increased by interleaving on wordgroup level without enlarging the accessible wordgroup page size.

#### REFERENCES

[1] Ashok K. Sharma, "Advanced Semiconductor Memories", IEEE Press, 2003

- [2] T.R.Oldham et al., "SEE and TID Characterization of an Advanced Commercial 2Gbit NAND Flash Nonvolatile Memory", IEEE Trans.Nucl.Sc., Vol53, 2006, pp3217-3222
- [3] M. Brüggemann et al., "SEE Tests of NAND Flash Memory Devices for Use in a Safeguard Data Recorder", RADECS 2006, A-3.
- [4] M. Brüggemann et al., "Further Heavy Ion and Proton SEE Evaluation of High Capacity NAND-FLASH Memory Devices for Safeguard Data Recorder", 8th ESA/ESTEC D/TEC-QCA Final Presentation Day, February 2007, <u>https://escies.org/GetFile?rsrcid=5512</u>.
- [5] T.R.Oldham et al., "TID and SEE Response of an Advanced 4Gb NAND Flash Memory", IEEE NSREC 2007, W-40L
- [6] F. Irom, D. N. Nguyen, "Single Event Characterization of High Density Commercial NAND and NOR Nonvolatile Memories", IEEE NSREC 2007, PJ-4.
- [7] D.N. Nguyen, F.Irom, "Total Ionizing Dose (TID) Tests on Non-Volatile Memories: Flash and MRAM", IEEE NSREC 2007, W-34
- [8] H. Schmidt et al., "Heavy Ion SEE Studies on 4-Gbit NAND-Flash Memories", RADECS 2007, DWL-14
- [9] H. Schmidt et al., "Annealing of Static Errors in NAND-Flash Memories", RADECS 2007, PJ-4
- [10] H. Schmidt et al., "TID and SEE Tests of an Advanced 8GBit NAND-Flash Memory", accepted for IEEE NSREC 2008