Content uploaded by Borja Peleato

Author content

All content in this area was uploaded by Borja Peleato on Sep 13, 2015

Content may be subject to copyright.

1

Adaptive Read Thresholds for NAND Flash

Borja Peleato, Member, IEEE, Rajiv Agarwal, Member, IEEE, John Ciofﬁ, Fellow, IEEE,

Minghai Qin, Member, IEEE, and Paul H. Siegel, Fellow, IEEE

Abstract—A primary source of increased read time on NAND

ﬂash comes from the fact that in the presence of noise, the ﬂash

medium must be read several times using different read threshold

voltages for the decoder to succeed. This paper proposes an

algorithm that uses a limited number of re-reads to characterize

the noise distribution and recover the stored information. Both

hard and soft decoding are considered. For hard decoding, the

paper attempts to ﬁnd a read threshold minimizing bit-error-rate

(BER) and derives an expression for the resulting codeword-

error-rate. For soft decoding, it shows that minimizing BER and

minimizing codeword-error-rate are competing objectives in the

presence of a limited number of allowed re-reads, and proposes

a trade-off between the two.

The proposed method does not require any prior knowledge

about the noise distribution, but can take advantage of such

information when it is available. Each read threshold is chosen

based on the results of previous reads, following an optimal policy

derived through a dynamic programming backward recursion.

The method and results are studied from the perspective of an

SLC Flash memory with Gaussian noise but the paper explains

how the method could be extended to other scenarios.

Index Terms—Flash memory, multi-level memory, voltage

threshold, adaptive read, soft information, symmetric capacity.

I. INT RO DUC TI ON

A. Overview

The introduction of Solid State Drives (SSD) based on

NAND ﬂash memories has revolutionized mobile, laptop, and

enterprise storage by offering random access to the information

with dramatically higher read throughput and power-efﬁciency

than hard disk drives. However, SSD’s are considerably more

expensive, which poses an obstacle to their widespread use.

NAND ﬂash manufacturers have tried to pack more data in

the same silicon area by scaling the size of the ﬂash cells and

storing more bits in each of them, thus reducing the cost per

gigabyte (GB) and making ﬂash more attractive to consumers,

but this cell-size shrinkage has come at the cost of reduced

performance. As cell-size shrinks to sub-16nm limits, noise

can cause the voltage residing on the cell at read time to be

signiﬁcantly different from the voltage that was intended to

be stored at the time of write. Even in current state-of-the-art

19nm NAND, noise is signiﬁcant towards the end of life of

the drive. One way to recover host data in the presence of

noise is to use advanced signal processing algorithms [1]–[4],

but excessive re-reads and post-read signal processing could

jeopardize the advantages brought by this technology.

B. Peleato is with Purdue University, West Lafayette, IN, 47907 USA (e-

mail: bpeleato@purdue.edu).

R. Agarwal and J. Ciofﬁ are with Stanford University, Stanford, CA 94305

USA (e-mail: {rajivag, ciofﬁ}@stanford.edu).

M. Qin and P. H. Siegel are with the University of California, San Diego,

La Jolla, CA 92093, USA (e-mail:{mqin, psiegel}@ucsd.edu)

Manuscript received January 25, 2015; revised June 29, 2015.

Typically, all post-read signal processing algorithms require

re-reads using different thresholds, but the default read thresh-

olds, which are good for voltage levels intended during write,

are often suboptimal for read-back of host data. Furthermore,

the noise in the stored voltages is random and depends on

several factors such as time, data, and temperature; so a ﬁxed

set of read thresholds will not be optimal throughout the entire

life of the drive. Thus, ﬁnding optimal read thresholds in a

dynamic manner to minimize BER and speed up the post-

processing is essential.

The ﬁrst half of the paper proposes an algorithm for charac-

terizing the distribution of the noise for each nominal voltage

level and estimating the read thresholds which minimize BER.

It also presents an analytical expression relating the BER

found using the proposed methods to the minimum possible

BER. Though BER is a useful metric for algebraic error

correction codes, the distribution of the number of errors is

also important. Some ﬂash memory controllers use a weaker

decoder when the number of errors is small and switch to a

stronger one when the former fails, both for the same code

(e.g. bit-ﬂipping and min-sum for decoding an LDPC code

[5]). The average read throughput and total power consumption

depends on how frequently each decoder is used. Therefore,

the distribution of the number of errors, which is also derived

here, is a useful tool to ﬁnd NAND power consumption.

The second half of the paper modiﬁes the proposed algo-

rithm to address the quality of the soft information generated,

instead of just the number of errors. In some cases, the BER is

too large for a hard decoder to succeed, even if the read is done

at the optimal threshold. It is then necessary to generate soft

information by performing multiple reads with different read

thresholds. The choice of read thresholds has a direct impact

on the quality of the soft information generated, which in turn

dictates the number of decoder iterations and the number of

re-reads required. The paper models the ﬂash as a discrete

memoryless channel with mismatched decoding and attempts

to maximize its capacity through dynamic programming.

The overall scheme works as follows. First, the controller

reads with an initial threshold and attempts a hard-decoding

of the information. If the noise is weak and the initial

threshold was well chosen, the decoding succeeds and no fur-

ther processing is needed. Otherwise, the controller performs

additional reads with adaptively chosen thresholds to estimate

the mean and/or variance of the voltages for each level. These

estimates are in turn used to estimate the minimum feasible

BER and the corresponding optimal read threshold. The ﬂash

controller then decides whether to perform an additional read

with that estimated threshold to attempt hard decoding again,

or directly attempt a more robust decoding of the information,

leveraging the previous reads to generate soft information.

2

B. Literature review

Most of the existing literature on optimizing the read

thresholds for NAND ﬂash assumes that prior information on

the noise is available (e.g., [6]–[10]). Some methods, such

at the one proposed by Wang et al. in [11], assume complete

knowledge of the noise and choose the read thresholds so as to

maximize the mutual information between the values written

and read, while others attempt to predict the noise from the

number of program-erase (PE) cycles and then optimize the

read thresholds based on that prediction. An example of the

latter was proposed by Cai et al. in [12]. References [13] and

[14] also address threshold selection and error-correction.

However, in some practical cases there is no prior informa-

tion available, or the prior information is not accurate enough

to build a reliable noise model. In these situations, a common

approach is to perform several reads with different thresholds

searching for the one that returns an equal number of cells on

either side, i.e., the median between the two levels1. However,

the median threshold is suboptimal in general, as was shown

in [1]. In [2] and [15] Zhou et al. proposed encoding the

data using balanced, asymmetric, or Berger codes to facilitate

the threshold selection. Balanced codes guarantee that all

codewords have the same number of ones and zeros, hence

narrowing the gap between the median and optimal thresholds.

Asymmetric and Berger codes, ﬁrst described in [16], leverage

the known asymmetry of the channel to tolerate suboptimal

thresholds. Berger codes are able to detect any number of

unidirectional errors. In cases of signiﬁcant leakage, where all

the cells reduce their voltage level, it is possible to perform

several reads with progressively decreasing thresholds until the

Berger code detects a low enough number of errors, and only

then attempt decoding to recover the host information.

Researchers have also proposed some innovative data rep-

resentation schemes with different requirements in terms of

read thresholds. For example, rank modulation [17]–[21] stores

information in the relative voltages between the cells instead

of using pre-deﬁned voltage levels. The strategy of writing

data represented by rank modulation in parallel to ﬂash

memories is studied in [22]. Theoretically, rank modulation

does not require actual read thresholds, but just comparisons

between the cell voltages. Unfortunately, there are a few

technological challenges that need to be overcome before

rank modulation becomes practical. Other examples include

constrained codes [23], [24]; write-once memories codes [25],

[25]–[27]; and other rewriting codes [28]. All these codes

impose restrictions on the levels that can be used during

a speciﬁc write operation. Since read thresholds need only

separate the levels being used, they can often take advantage

of these restrictions.

The scheme proposed in this paper is similar to those

described in [29] and [30] in that it assumes no prior in-

formation about the noise or data representation, but it is

signiﬁcantly simpler and more efﬁcient. We propose using

a small number of reads chosen by a dynamic program to

1In many cases this threshold is not explicitly identiﬁed as the median cell

voltage, but only implicitly as the solution of t−µ1

σ1=t−µ2

σ2, where (µ1, σ1)

and (µ2, σ2)are the mean and standard deviation of the level voltages.

simultaneously estimate the noise and recover the information,

instead of periodically testing multiple thresholds (as in [29])

or running a computationally intensive optimization algorithm

to perfect the model (as in [30]). A prior version of this paper

was published in [31], but the work presented here has been

signiﬁcantly extended.

II. SY S TE M MOD EL

Cells in a NAND ﬂash are organized in terms of pages,

which are the smallest units for write and read operations.

Writing the cells in a page is done through a program and

verify approach where voltage pulses are sent into the cells

until their stored voltage exceeds the desired one. Once a cell

has reached its desired voltage, it is inhibited from receiving

subsequent pulses and the programming of the other cells

in the page continues. However, the inhibition mechanism is

non-ideal and future pulses may increase the voltage of the

cell [12], creating write noise. The other two main sources

of noise are inter-cell interference (ICI), caused by interaction

between neighboring cells [32], and charges leaking out of the

cells with time and heat [33].

Some attempts have been made to model these sources

of noise as a function of time, voltage levels, amplitude

of the programming pulses, etc. Unfortunately, the noise is

temperature- and page-dependent as well as time- and data-

dependent [34]. Since the controller cannot measure those fac-

tors, it cannot accurately estimate the noise without performing

additional reads. This paper assumes that the overall noise

follows a Gaussian distribution for each level, as is common

in the literature, but assumes no prior knowledge about their

means or variances. Section VI will explain how the same idea

can be used when the noise is not Gaussian.

Reading the cells in a page is done by comparing their

stored voltage with a threshold voltage t. The read operation

returns a binary vector with one bit for each cell. Bits

corresponding to cells with voltage lower than tare 1 and

those corresponding to cells with voltage higher than tare 0.

However, the aforementioned sources of voltage disturbance

can cause some cells to be misclassiﬁed, introducing errors in

the bit values read. The choice of a read threshold therefore

becomes important to minimize the BER in the reads.

In a b-bit MLC ﬂash, each cell stores one of 2bdistinct

predeﬁned voltage levels. When each cell stores multiple

bits, i.e. b≥2, the mapping of information bits to voltage

levels is done using Gray coding to ensure that only one bit

changes between adjacent levels. Since errors almost always

happen between adjacent levels, Gray coding minimizes the

average BER. Furthermore, each of the bbits is assigned to

a different page, as shown in Fig. 1. This is done so as to

reduce the number of comparisons required to read a page. For

example, the lower page of a TLC (b= 3) ﬂash can be read

by comparing the cell voltages with a single read threshold

located between the fourth and ﬁfth levels, denoted by D in

Fig. 1. The ﬁrst four levels encode a bit value 1 for the lower

page, while the last four levels encode a value 0. Unfortunately,

reading the middle and upper pages require comparing the cell

voltages with more read thresholds: two (B,F) for the middle

page and four (A,C,E,G) for the upper page.

3

1

1

1

Voltage

1

1

0

1

0

0

1

0

1

0

0

1

0

0

0

0

1

0

0

1

1Upper page

Lower page

Middle page

A DCB E GF Thresholds

Fig. 1. Typical mapping of bits to pages and levels in a TLC Flash memory.

Upper pages take longer to read than lower ones, but

the difference is not as large as it might seem. Flash chips

generally incorporate dedicated hardware for performing all

the comparisons required to read upper pages, without the

additional overhead that would arise from issuing multiple

independent read requests. The ﬂash controller can then be

oblivious to the type of page being read. Read commands only

need to specify the page being read and a scalar parameter

representing the desired shift in the read thresholds from their

default value. If the page is a lower one, which employs only

one threshold, the scalar parameter is understood as the shift

in this threshold. If the page is an upper one, which employs

multiple read thresholds, their shifts are parameterized by the

scalar parameter. For example, a parameter value of ∆when

reading the middle page in Fig. 1 could shift thresholds B

and Fby ∆and −3

2∆mV, respectively. Then, cells whose

voltage falls between the shifted thresholds Band Fwould

be read as 0and the rest as 1.

After ﬁxing this parametrization, the ﬂash controller views

all the pages in an MLC or TLC memory as independent

SLC pages with a single read shift parameter that needs

to be optimized. In theory, each low level threshold could

be independently optimized, but the large number of reads

and amount of memory required would render that approach

impractical. Hence, most of the paper will assume a SLC

architecture for the ﬂash and Section VI will show how the

same method and results can be readily extended to memories

with more bits per cell.

Figure 2 (a) shows two overlapping Gaussian probability

density functions (pdfs), corresponding to the two voltage lev-

els to which cells can be programmed. Since data is generally

compressed before being written onto ﬂash, approximately

the same number of cells is programmed to each level. The

ﬁgure also includes three possible read thresholds. Denoting by

(µ1, σ1)and (µ2, σ2)the means and standard deviations of the

two Gaussian distributions, the thresholds are: tmean =µ1+µ2

2,

tmedian =µ1σ2+µ2σ1

σ1+σ2, and t⋆, which minimizes BER. If the

noise variance was the same for both levels all three thresholds

would be equal, but this is not the case in practice. The plot

legend provides the BER obtained when reading with each of

the three thresholds.

There exist several ways in which the optimal threshold,

t⋆, can be found. A common approach is to perform several

reads by shifting the thresholds in one direction until the

decoding succeeds. Once the data has been recovered, it can be

compared with the read outputs to ﬁnd the threshold yielding

the lowest BER [29]. However, this method can require a

large number of reads if the initial estimate is inaccurate,

voltage

Probability

tmean (BER = 5.6%)

tmedian (BER = 4.8%)

t* (BER = 4.5%)

(a)

0

0.2

0.4

0.6

0.8

1

voltage

cdf(voltage)

(b)

Fig. 2. (a): Cell voltages pdf in an SLC page, and BER for three different

thresholds: tmean = (µ1+µ2)/2is the average of the cell voltages, tmedian

returns the same number of 1s and 0s and t⋆minimizes the BER and is located

at the intersection of the two pdfs. (b): cdf corresponding to pdf in (a).

which reduces read throughput, and additional memory to store

and compare the successive reads, which increases cost. The

approach taken in this paper consists of estimating (µ1, σ1)

and (µ2, σ2)and deriving t⋆analytically. It will be shown

how this can be done with as few reads as possible, thereby

reducing read time. Furthermore, the mean and standard

deviation estimates can also be used for other tasks, such as

generating soft information for LDPC decoding.

A read operation with a threshold voltage treturns a binary

vector with a one for each cell whose voltage level is lower

than tand zero otherwise. The fraction of ones in the read

output is then equal to the probability of a randomly chosen

cell having a voltage level below t. Consequently, a read with

a threshold voltage tcan be used to obtain a sample from

the cumulative distribution function (cdf) of cell voltages at t,

illustrated in Fig. 2 (b).

The problem is then reduced to estimating the means and

variances of a mixture of Gaussians using samples from their

joint cdf. These samples will be corrupted by model, read, and

quantization noise. Model noise is caused by the deviation

of the actual distribution of cell voltages from a Gaussian

distribution. Read noise is caused by the intrinsic reading

mechanism of the ﬂash, which can read some cells as storing

higher or lower voltages than they actually have. Quantiza-

tion noise is caused by limited computational accuracy and

rounding of the Gaussian cdf2. All these sources of noise are

collectively referred to as read noise in this paper. It is assumed

to be zero mean, but no other restriction is imposed in our

derivations.

It is desirable to devote as few reads as possible to the

estimation of (µ1, σ1)and (µ2, σ2). The accuracy of the

estimates would improve with the number of reads, but read

time would also increase. Since there are four parameters to

be estimated, at least four reads will be necessary. Section III

describes how the locations of the read thresholds should be

2Since the Gaussian cdf has no analytical expression, it is generally

quantized and stored as a lookup table

4

chosen in order to achieve accurate estimates and Section IV

extends the framework to consider how these reads could

be reused to obtain soft information for an LDPC decoder.

If the soft information obtained from the ﬁrst four reads is

enough for the LDPC decoding to succeed, no additional

reads will be required, thereby reducing the total read time

of the ﬂash. Section V proposes a dynamic programming

method for optimizing the thresholds for a desired objective.

Finally, Section VI explains how to extend the algorithm

for MLC or TLC memories, as well as for non-Gaussian

noise distributions. Section VII provides simulation results

to evaluate the performance of the proposed algorithms and

Section VIII concludes the paper.

III. HARD DE COD IN G:MI NI MIZ ING BER

A. Parameter estimation

Let ti,i= 1,...,4be four voltage thresholds used for

reading a page and let yi,i= 1,...,4be the fraction of

ones in the output vector for each of the reads, respectively.

If (µ1, σ1)and (µ2, σ2)denote the voltage mean and variance

for the cells programmed to the two levels, then

yi=1

2Qµ1−ti

σ1+1

2Qµ2−ti

σ2+nyi, i = 1,...,4,

(1)

where

Q(x) = Z∞

x

(2π)−1

2e−t2

2dt (2)

and nyidenotes the read noise associated to yi. In theory,

it is possible to estimate (µ1, σ1)and (µ2, σ2)from (ti, yi),

i= 1,...,4by solving the system of non-linear equations in

Eq. (1), but in practice the computational complexity could

be too large for some systems. Another possible approach

would be to restrict the estimates to a pre-deﬁned set of values

and generate a lookup table for each combination. Finding

the table which best ﬁts the samples would require negligible

time but the amount of memory required could render this

approach impractical for some systems. This section proposes

and evaluates a progressive read algorithm that combines these

two approaches, providing similar accuracy to the former and

requiring only a standard normal (µ= 0,σ= 1) look-up

table.

Progressive Read Algorithm: The key idea is to perform

two reads at locations where one of the Qfunctions is known

to be either close to 0 or close to 1. The problem with solving

the system in Eq. (1) was that a sum of Qfunctions cannot

be easily inverted. However, once one of the two Qfunctions

is ﬁxed at 0 or 1, the equation can be put in linear form

using a standard normal table to invert the other Qfunction.

The system of linear equations can then be solved to estimate

the ﬁrst mean and variance. Once the ﬁrst mean and variance

have been estimated they can be used to evaluate a Qfunction

from each of the two remaining equations in Eq. (1), which

can then be solved in a similar way. For example, if t1and t2

are signiﬁcantly smaller than µ2, then

Qµ2−t1

σ2≃0≃Qµ2−t2

σ2

and Eq. (1) can be solved for cµ1and cσ1to get

cσ1=t2−t1

Q−1(2y1)−Q−1(2y2)

cµ1=t2+cσ1Q−1(2y2).

(3)

Substituting these in the equations for the third and fourth

reads and solving gives

cσ2=t4−t3

Q−1(2y3−q3)−Q−1(2y4−q4)

cµ2=t4+cσ2Q−1(2y4−q4),

(4)

where

q3=Qcµ1−t3

cσ1q4=Qcµ1−t4

cσ1.

It could be argued that, since the pdfs are not known a

priori, it is not possible to determine two read locations where

one of the Qfunctions is close to 0 or close to 1. In practice,

however, each read threshold can be chosen based on the result

from the previous ones. For example, say the ﬁrst randomly

chosen read location returned y1= 0.6. This read, if used for

estimating the higher level distribution, will be a bad choice

because there will be signiﬁcant overlap from the lower level.

Hence, a smart choice would be to obtain two reads for the

lower level that are clear of the higher level by reading to the

far left of t1. Once the lower level is canceled, the y1= 0.6

read can be used in combination with a fourth read to the right

of t1to estimate the higher level distribution.

Once the mean and variance of both pdfs have been esti-

mated, it is possible to derive an estimate for the read threshold

minimizing the BER. The BER associated to a given read

threshold tis given by

BER(t) = 1

2Qµ2−t

σ2+ 1 −Qµ1−t

σ1.(5)

Making its derivative equal to zero gives the following equa-

tion for the optimal threshold t⋆

1

σ2

φµ2−t⋆

σ2=1

σ1

φµ1−t⋆

σ1,(6)

where φ(x) = (2π)−(1/2)e−x2/2. The optimal threshold t⋆is

located at the point where both Gaussian pdfs intersect. An

estimate b

t⋆for t⋆can be found from the quadratic equation

2 log cσ2

cσ1= b

t⋆−cµ1

cσ1!2

− b

t⋆−cµ2

cσ2!2

,(7)

which can be shown to be equivalent to solving Eq. (6) with

(µ1, σ1)and (µ2, σ2)replaced by their estimated values.

If some parameters are known, the number of reads can

be reduced. For example, if µ1is known, the ﬁrst read can

be replaced by t1=µ1,y1= 0.25 in the above equations.

Similarly, if σ1is known (t1, y1)are not needed in Eqs. (3)-

(4).

5

B. Error propagation

This subsection ﬁrst studies how the choice of read locations

affects the accuracy of the estimators (cµ1,cσ1),(cµ2,cσ2), and

correspondingly b

t⋆. Then it analyzes how the accuracy of

b

t⋆translates into BER(b

t⋆), and provides some guidelines as

to how the read locations should be chosen. Without loss

of generality, it will be assumed that (µ1, σ1)are estimated

ﬁrst using (t1, y1)and (t2, y2)according to the Progressive

Read Algorithm described in Section III-A, and (µ2, σ2)are

estimated in the second stage. In this case, Eq. (1) reduces to

Qµ1−ti

σ1= 2yi−2nyi

for i= 1,2and the estimates are given by Eqs. (3).

If the read thresholds are on the tails of the distributions, a

small perturbation in the cdf value ycould cause a signiﬁcant

change in Q−1(y). This will in turn lead to a signiﬁcant change

in the estimates. Speciﬁcally, a ﬁrst-order Taylor expansion of

Q−1(y+ny)at ycan be written as

Q−1(y+ny) = x−√2πe x2

2ny+O(n2

y),(8)

where x=Q−1(y). Since the exponent of eis always positive,

the ﬁrst-order error term is minimized when x= 0, i.e., when

the read is performed at the mean. The expressions for (cµ1,cσ1)

and (cµ2,cσ2)as seen in Eqs. (3)-(4) use inverse Qfunctions, so

the estimation error due to read noise will be reduced when the

reads are done close to the mean of the Gaussian distributions.

The ﬁrst order Taylor expansion of Eq. (3) at σ1is given by

cσ1=σ1−σ2

1

t2−t1

(n2−n1) + O(n2

1, n2

2)(9)

where

n1= 2√2πe

(t1−µ1)2

2σ2

1ny1+O(n2

y1)

n2= 2√2πe

(t2−µ1)2

2σ2

1ny2+O(n2

y2).

(10)

A similar expansion can be performed for cµ1, obtaining

bµ1=µ1−σ1

(t2−µ1)n1−(t1−µ1)n2

t2−t1

+O(n2

1, n2

2).(11)

Two different tendencies can be observed in the above

expressions. On one hand, Eqs. (10) suggest that both t1and

t2should be chosen close to µ1so as to reduce the magnitude

of n1and n2. On the other hand, if t1and t2are very close

together, the denominators in Eq. (9) and (11) can become

small, increasing the estimation error.

The error expansions for cµ2,cσ2and b

t⋆, are omitted for

simplicity, but it can be shown that the dominant terms are

linear in nyi,i= 1,...,4as long as all nyiare small enough.

The Taylor expansion for BER(b

t⋆)at t⋆is

BER( b

t⋆) =BER(t⋆) + 1

2σ2

φµ2−t⋆

σ2−

1

2σ1

φµ1−t⋆

σ1et⋆+O(e2

t⋆)

=BER(t⋆) + O(e2

t⋆),

(12)

10−4 10−3 10−2 10−1

10−8

10−6

10−4

10−2

100

Read noise

Relative error

|ˆσ−σ|/σ

|ˆ

t−t⋆|/t⋆

|ˆµ−µ|/µ

BER(ˆ

t)−BER(t⋆)

BER(t⋆)

Fig. 3. The relative error in the mean, variance, and threshold estimates

increases linearly with the read noise (slope=1), but the BER error grows

quadratically (slope=2) and is negligible for a wide range of read noise

amplitudes.

where b

t⋆=t⋆+et⋆. The cancellation of the ﬁrst-order term

is justiﬁed by Eq. (6). Summarizing, the mean and variance

estimation error increases linearly with the read noise, as

does the deviation in the estimated optimal read threshold.

The increase in BER, on the other hand, is free from linear

terms. As long as the read noise is not too large, the resulting

BER( b

t⋆) is close to the minimum possible BER. The numerical

simulations in Fig. 3 conﬁrm these results.

In view of these results, it seems that the read thresholds

should be spread out over both pdfs but close to the levels’

mean voltages. Choosing the thresholds in this way will

reduce the error propagating from the reads to the estimates.

However, read thresholds can be chosen sequentially, using the

information obtained from each read in selecting subsequent

thresholds. Section V proposes a method for ﬁnding the

optimal read thresholds more precisely.

IV. SOF T DECO DI NG: TRAD EO FF BER-LLR

This section considers a new scenario where a layered

decoding approach is used for increased error-correction ca-

pability. After reading a page, the controller may ﬁrst attempt

to correct any bit errors in the read-back codeword using

a hard decoder alone, typically a bit-ﬂipping hard-LDPC

decoder [35]. Reading with the threshold b

t⋆found through

Eq. (7) reduces the number of hard errors but there are cases

in which even BER(b

t⋆) is too high for the hard decoder to

succeed. When this happens, the controller will attempt a

soft decoding, typically using a min-sum or sum-product soft

LDPC decoder.

Soft decoders are more powerful, but also signiﬁcantly

slower and less power efﬁcient than hard decoders. Conse-

quently, invoking soft LDPC decoding too often can signiﬁ-

cantly impact the controller’s average read time. In order to

estimate the probability of requiring soft decoding, one must

look at the distribution of the number of errors, and not at BER

alone. For example, if the number of errors per codeword is

uniformly distributed between 40 and 60 and the hard decoder

can correct 75 errors, soft decoding will never be needed.

However, if the number of errors is uniformly distributed

between 0 and 100 (same BER), soft decoding will be required

to decode 25% of the reads. Section IV-A addresses this topic.

The error-correction capability of a soft decoder depends

heavily on the quality of the soft information at its input.

It is always possible to increase such quality by performing

6

Failure rate pe= 0.008 pe= 0.01 pe= 0.012

α= 23 0.05 0.28 0.62

α= 25 0.016 0.15 0.46

α= 27 0.004 0.07 0.31

TABLE I

FAI LU RE R ATE FO R A N= 2048 BCH C O DE A S A F UN CT I ON O F

PRO BABI LI TY OF B IT ER ROR peAN D C ORRE CTI ON C APA BILI TY α.

additional reads, but this decreases read throughput. Sec-

tion IV-B shows how the Progressive Read Algorithm from

the previous section can be modiﬁed to provide high quality

soft information.

A. Distribution of the number of errors

Let Nbe the number of bits in a codeword. Assuming

that both levels are equally likely, the probability of error

for any given bit, denoted pe, is given in Eq. (5). Errors

can be considered independent, hence the number of them

in a codeword follows a binomial distribution with parameters

Nand pe. Since Nis usually large, it becomes convenient

to approximate the binomial by a Gaussian distribution with

mean Npeand variance N pe(1 −pe), or by a Poisson

distribution with parameter N pewhen Npeis small.

Under the Gaussian approximation paradigm, a codeword

fails to decode with probability Qα−Npe

√Npe(1−pe), where

αdenotes the number of bit errors that can be corrected.

Table I shows that a small change in the value of αmay

increase signiﬁcantly the frequency with which a stronger

decoder is needed. This has a direct impact on average power

consumption of the controller. The distribution of bit errors

can thus be used to judiciously obtain a value of αin order

to meet a power constraint.

B. Obtaining soft inputs

After performing Mreads on a page, each cell can be

classiﬁed as falling into one of the M+1 intervals between the

read thresholds. The problem of reliably storing information

on the ﬂash is therefore equivalent to the problem of reliable

transmission over a discrete memoryless channel (DMC), such

as the one in Fig. 4. Channel inputs represent the levels to

which the cells are written, outputs represent read intervals,

and channel transition probabilities specify how likely it is

for cells programmed to a speciﬁc level to be found in each

interval at read time.

12

5

1 2 34

P25

P24

P23

P21

P15

P14

P13

P12

P11

P22

Fig. 4. DMC channel equivalent to Flash read channel with four reads.

It is well known that the capacity of a channel is given by the

maximum mutual information between the input and output

over all input distributions (codebooks) [36]. In practice,

however, the code must be chosen at write time when the

channel is still unknown, making it impossible to adapt the

input distribution to the channel. Although some asymmetric

codes have been proposed (e.g. [15], [24], [37]), channel

inputs are equiprobable for most practical codes. The mutual

information between the input and the output is then given by

I(X;Y) = 1

2

M+1

X

j=1

p1jlog(p1j) + p2jlog(p2j)−

(p1j+p2j) log p1j+p2j

2,

(13)

where pij ,i= 1,2,j= 1,...,M+1 are the channel transition

probabilities. For Gaussian noise, these transition probabilities

can be found as

pij =Qµi−tj

σi−Qµi−tj−1

σi,(14)

where t0=−∞ and tM+1 =∞.

The inputs to a soft decoder are given in the form of log-

likelihood ratios (LLR). The LLR value associated with a read

interval kis deﬁned as LLRk= log(p1k/p2k). When the mean

and variance are known it is possible to obtain good LLR

values by reading at the locations that maximize I(X;Y)[11],

which tend to be on the so-called uncertainty region, where

both pdfs are comparable. However, the mean and variance

are generally not known and need to be estimated. Section III

provided some guidelines on how to choose read thresholds in

order to obtain accurate estimates, but those reads tend to pro-

duce poor LLR values. Hence, there are two opposing trends:

spreading out the reads over a wide range of voltage values

yields more accurate mean and variance estimates but degrades

the performance of the soft decoder, while concentrating the

reads on the uncertainty region provides better LLR values but

might yield inaccurate estimates which in turn undermine the

soft decoding.

Some ﬂash manufacturers are already incorporating soft

read commands that return 3 or 4 bits of information for each

cell, but the thresholds for those reads are often pre-speciﬁed

and kept constant throughout the lifetime of the device.

Furthermore, most controller manufacturers use a pre-deﬁned

mapping of read intervals to LLR values regardless of the

result of the reads. We propose adjusting the read thresholds

and LLR values adaptively to ﬁt our channel estimates.

Our goal is to ﬁnd the read locations that maximize the

probability of successful decoding when levels are equiproba-

ble and the decoding is done based on the estimated transition

probabilities. With this goal in mind, Section IV-C will derive

a bound for the (symmetric and mismatched) channel capacity

in this scenario and Section V will show how to choose the

read thresholds so as to maximize this bound. The error-free

coding rate speciﬁed by the bound will not be achievable in

practice due to ﬁnite code length, limited computational power,

etc., but the BER at the output of a decoder is closely related

to the capacity of the channel [38], [39]. The read thresholds

that maximize the capacity of the channel are generally the

same ones that minimize the BER, in practice.

7

C. Bound for maximum transmission rate

Shannon’s channel coding theorem states that all transmis-

sion rates below the channel capacity are achievable when the

channel is perfectly known to the decoder; unfortunately this

is not the case in practice. The channel transition probabilities

can be estimated by substituting the noise means and variances

cµ1,cµ2,cσ1,cσ2into Eq. (14) but these estimates, denoted cpij ,

i= 1,2,j= 1,...,5, are inaccurate. The decoder is therefore

not perfectly matched to the channel.

The subject of mismatched decoding has been of interest

since the 1970’s. The most notable early works are by Hui [40]

and Csisz´ar and K¨orner [41], who provided bounds on the

maximum transmission rates under several different condi-

tions. Merhav et al. [42] related those results to the concept of

generalized mutual information and, more recently, Scarlett et

al. [39] found bounds and error exponents for the ﬁnite code

length case. It is beyond the scope of this paper to perform

a detailed analysis of the mismatched capacity of a DMC

channel with symmetric inputs; the interested reader can refer

to the above references as well as [43]–[46]. Instead, we will

derive a simpliﬁed lower bound for the capacity of this channel

in the same scenario that has been considered throughout the

paper.

Theorem 1. The maximum achievable rate of transmission

with vanishing probability of error over a Discrete Memoryless

Channel with equiprobable binary inputs, output alphabet Y,

transition probabilities pij,i= 1,2,j= 1,...,|Y|, and

maximum likelihood decoding according to a different set of

transition probabilities cpij ,i= 1,2,j= 1,...,|Y| is lower

bounded by

CP, ˆ

P=1

2

|Y|

X

j=1

p1jlog( cp1j) + p2jlog( cp2j)−

(p1j+p2j) log cp1j+cp2j

2(15)

Proof: Provided in the Appendix.

It is worth noting that CP, ˆ

Pis equal to the mutual infor-

mation given in Eq. (13) when the estimates are exact, and

decreases as the estimates become less accurate. In fact, the

probability of reading a given value y∈ Y can be measured

directly as the fraction of cells mapped to the corresponding

interval, so it is usually the case that cp1k+cp2k=p1k+p2k.

The bound then becomes CP, ˆ

P=I(X;Y)−D(P|| ˆ

P),

where I(X;Y)is the symmetric capacity of the channel with

matched ML decoding and D(P|| ˆ

P)is the relative entropy

(also known as Kullback-Leibler distance) between the exact

and the estimated transition probabilities.

D(P|| ˆ

P) = 1

2

|Y|

X

j=1

p1jlog p1j

cp1j+p2jlog p2j

cp2j.(16)

In this case CP, ˆ

Pis a concave function of the transition prob-

abilities (pij ,cpij ),i= 1,2,j= 1,...,|Y|, since the relative

entropy is convex and the mutual information is concave [36].

The bound attains its maximum when the decoder is matched

to the channel (i.e. pij =cpij ∀i, j) and the read thresholds

are chosen so as to maximize the mutual information between

Xand Y, but that solution is not feasible for our problem.

In practice, both the capacity of the underlying channel

and the accuracy of the estimates at the decoder depend on

the location of the read thresholds and cannot be maximized

simultaneously. Finding the read thresholds t1,t2,t3, and t4

which maximize CP, ˆ

Pis not straightforward, but it can be

done numerically. Section V describes a dynamic program-

ming algorithm for choosing each read threshold based on

prior information about the noise and the result of previous

reads.

V. OP TIM IZ I NG R EAD T HR E SH OLD S

In most practical cases, the ﬂash controller has prior infor-

mation about the voltage distributions, based on the number

of PE cycles that the page has endured, its position within the

block, etc. This prior information is generally not enough to

produce accurate noise estimates, but it can be used to improve

the choice of read thresholds. We wish to determine a policy

to choose the optimal read thresholds sequentially, given the

prior information about the voltage distributions and the results

in previous reads.

This section proposes a dynamic programming framework

to ﬁnd the read thresholds that maximize the expected value

of a user-deﬁned reward function. If the goal is to minimize

the BER at the estimated threshold b

t⋆, as in Section III, an

appropriate reward would be 1−BER(b

t⋆). If the goal is to

maximize the channel capacity, the reward could be chosen to

be I(X;Y)−D(Pkˆ

P), as shown in Section IV-C.

Let x= (µ1, µ2, σ1, σ2)and ri= (ti, yi),i= 1,...,4be

vector random variables, so as to simplify the notation. If the

read noise distribution fnis known, the prior distribution for

xcan be updated based on the result of each read riusing

Bayes rule and Eq. (1):

fx|r1,...,ri=K·fx|r1,...,ri−1

·fyi|x,ti

=K·fx|r1,...,ri−1

·fnyi−

1

2Qµ1−ti

σ1

+Qµ2−ti

σ2,

(17)

where Kis a normalization constant. Furthermore, let

R(r1,r2,r3,r4)denote the expected reward associated with

the reads r1,...,r4, after updating the prior fxaccordingly. In

the following, we will use Rto denote this function, omitting

the arguments for the sake of simplicity.

Choosing the fourth read threshold t4after the ﬁrst three

reads r1,...,r3is relatively straightforward: t4should be

chosen so as to maximize the expected reward, given the

results of the previous three reads. Formally,

t⋆

4= arg max

t4

E{R|r1,...,r3, t4},(18)

where the expectation is taken with respect to (y4,x)by

factoring their joint distribution in a similar way to Eq. (17):

fy4,x|r1,...,r3=fy4|x,t4·fx|r1,...,r3.

This deﬁnes a policy πfor the fourth read, and a value V3

for each possible state after the ﬁrst three reads:

π4(r1,...,r3) = t⋆

4(19)

V3(r1,...,r3) = E{R|r1,...,r3, t⋆

4}.(20)

8

In practice, the read thresholds tiand samples yican only

take a ﬁnite number of values, hence the number of feasible

arguments in these functions (states) is also ﬁnite. This number

can be fairly large, but it is only necessary to ﬁnd the value

for a small number of them, those which have non-negligible

probability according to the prior fxand value signiﬁcantly

larger than 0. For example, states are invariant to permutation

of the reads so they can always be reordered such that t1<

t2< t3. Then, states which do not fulﬁll y1< y2< y3can be

ignored. If the number of states after discarding meaningless

ones is still too large, it is also possible to use approximations

for the policy and value functions [47], [48].

Equations (19) and (20) assign a value and a fourth read

threshold to each meaningful state after three reads. The same

idea, using a backward recursion, can be used to decide the

third read threshold and assign a value to each state after two

reads:

π3(r1,r2) = arg max

t3

E{V3(r1,...,r3)|r1,r2, t3}(21)

V2(r1,r2) = max

t3

E{V3(r1,...,r3)|r1,r2, t3},(22)

where the expectation is taken with respect to (y3,x). Simi-

larly, for the second read threshold

π2(r1) = arg max

t2

E{V2(r1,r2)|r1, t2}(23)

V1(r1) = max

t2

E{V2(r1,r2)|r1, t2},(24)

where the expectation is taken with respect to (y2,x). Finally,

the optimal value for the ﬁrst read threshold is

t⋆

1= arg max

t1

E{V1(t1, y1)|t1}.

These policies can be computed ofﬂine and then pro-

grammed in the memory controller. Typical controllers have

multiple modes tailored towards different conditions in terms

of number of PE cycles, whether an upper or lower page is

being read, etc. Each of these modes would have its own

prior distributions for (µ1, µ2, σ1, σ2), and would result in a

different policy determining where to perform each read based

on the previous results. Each policy can be stored as a partition

of the feasible reads, and value functions can be discarded,

so memory requirements are very reasonable. Section VII

presents an example illustrating this scheme.

Just like in Section III-A, the number of reads can be

reduced if some of the noise parameters or prior information

is available. The same backward recursion could be used to

optimize the choice of thresholds, but with fewer steps.

VI. EX T EN S IO NS

Most of the paper has assumed that cells can only store

two voltage levels, with their voltages following Gaussian

distributions. This framework was chosen because it is the

most widely used in the literature, but the method described

can easily be extended to memories with more than two levels

and non-Gaussian noise distributions.

Section II explained how each wordline in a MLC (two

bits per cell, four levels) or TLC (three bits per cell, eight

levels) memory is usually divided into two or three pages

which are read independently as if the memory was SLC.

In that case, the proposed method can be applied without

any modiﬁcations. However, if the controller is capable of

simultaneously processing more than two levels per cell, it

is possible to accelerate the noise estimation by reducing the

number of reads. MLC and TLC memories generally have

dedicated hardware that performs multiple reads in the ranges

required to read the upper pages and returns a single binary

value. For example, reading the upper page of a TLC memory

with the structure illustrated in Fig. 1 requires four reads with

thresholds (A, C, E, G) but cells between A and C would

be indistinguishable from cells between E and G; all of them

would be read as 0. However, one additional read of the lower

page (D threshold) would allow the controller to tell them

apart.

Performing four reads (t1,...,t4)on the upper page of

a TLC memory would entail comparing the cell voltages

against 16 different thresholds but obtaining only four bits of

information for each cell. The means and variances in Eqs. (3)-

(4) would correspond to mixtures of all the levels storing

the same bit value, assumed to be approximately Gaussian.

The same process would then be repeated for the middle and

lower page. A better approach, albeit more computationally

intensive, would be to combine reads from all three pages

and estimate each level independently. Performing one single

read of the lower page (threshold D), two of the middle page

(each involving two comparisons, with thresholds B and F) and

three of the upper page (each involving four comparisons, with

thresholds A, C, E, G) would theoretically provide more than

enough data to estimate the noise in all eight Gaussian levels.

A similar process can be used for MLC memories performing,

for example, two reads of the lower page and three of the upper

page.

Hence, ﬁve page reads are enough to estimate the noise

mean and variance in all 4 levels of an MLC memory and

6 page reads are enough for the 8 levels in a TLC memory.

Other choices for the pages to be read are also possible, but it

is useful to consider that lower pages have smaller probabilities

of error, so they often can be successfully decoded with fewer

reads. Additional reads could provide more precise estimates

and better LLR values for LDPC decoding.

There are papers suggesting that a Gaussian noise model

might not be accurate for some memories [49]. The proposed

scheme can also be extended to other noise distributions,

as long as they can be characterized by a small number

of parameters. Instead of the Q-function in Eq. (2), the

estimation should use the cumulative density function (cdf)

for the corresponding noise distribution. For example, if the

voltage distributions followed a Laplace instead of Gaussian

distribution, Eq. (1) would become

yi=1

2−1

4e−ti−µ1

b1+1

2e−ti−µ2

b2+nyi,(25)

for µ1≤ti≤µ2and the estimator ˆ

b1of b1would become

b

b1=t2−t1

log(1 −2y1)−log(1 −2y2)(26)

when t1,t2are signiﬁcantly smaller than µ2. Similar formulas

can be found to estimate the other parameters.

9

VII. N U ME RIC AL R ESU LTS

This section presents simulation results evaluating the per-

formance of the dynamic programming algorithm proposed in

Section V. Two scenarios will be considered, corresponding to

a fresh page with BER(t⋆) = 0.0015 and a worn-out page with

BER(t⋆) = 0.025. The mean voltage values for each level will

be the same in both scenarios, but the standard deviations will

differ. Speciﬁcally, µ1= 1 and µ2= 2 for both pages, but the

fresh page will be modeled using σ1= 0.12 and σ2= 0.22,

while the worn page will be modeled using σ1= 0.18

and σ2= 0.32. These values, however, are unknown to the

controller. The only information that it can use to choose the

read locations are uniform prior distributions on µ1,µ2,σ1,

and σ2, identical for both the fresh and the worn-out pages.

Speciﬁcally, µ1is known to be in the interval (0.75,1.25),µ2

in (1.8,2.1),σ1in (0.1,0.24) and σ2in (0.2,0.36).

For each scenario, three different strategies for selecting the

read thresholds were evaluated. The ﬁrst strategy, S1, tries to

obtain accurate noise estimates by spreading out the reads. The

second strategy, S2, concentrates all of them on the uncertainty

region, attempting to attain highly informative LLR values.

Finally, the third strategy, S3, follows the optimal policy

obtained by the dynamic programming recursion proposed in

Section V, with CP, ˆ

Pas reward function. The three strategies

are illustrated in Fig. 5 and the results are summarized in

Table II, but before proceeding to their analysis we describe

the process employed to obtain S3.

The dynamic programming scheme assumed that read

thresholds were restricted to move in steps of 0.04, and

quantized all cdf measurements also in steps of 0.04 (making

the noise nyfrom Eq. (1) uniform between −0.02 and 0.02).

Starting from these assumptions, Eqs. (19) and (20) were used

to ﬁnd the optimal policy π4and expected value V3for all

meaningful combinations of (t1, y1, t2, y2, t3, y3), which were

in the order of 106(very reasonable for ofﬂine computations).

The value function V3was then used in the backward recursion

to ﬁnd the policies and values for the ﬁrst three reads as

explained in Section V. The optimal location for the ﬁrst read,

in terms of maximum expected value for I(X;Y)−D(Pkˆ

P)

after all four reads, was found to be t⋆

1= 1.07. This read

resulted in y1= 0.36 for the fresh page and y1= 0.33 for

the worn page. The policy π2dictated that t2= 0.83 for

y1∈(0.34,0.38), and t2= 1.63 for y1∈(0.3,0.34), so those

were the next reads in each case. The third and fourth read

thresholds t3and t4were chosen similarly according to the

corresponding policies.

Finally, as depicted in Fig. 5, the read thresholds were

•S1:t= (0.85,1.15,1.75,2.125).

•S2:t= (1.2,1.35,1.45,1.6).

•S3(fresh page): t= (1.07,0.83,1.79,1.31) resulting

in y= (0.36,0.04,0.58,0.496), respectively.

•S3(worn page): t= (1.07,1.63,1.19,1.43) resulting

in y= (0.33,0.56,0.43,0.51), respectively.

For the fresh page, the policy dictates that the ﬁrst three reads

should be performed well outside of the uncertainty region,

so as to obtain good estimates of the means and variances.

Then, the fourth read is performed as close as possible to

FRESH PAGE S1S2S3

|ˆµ−µ|/µ 0.004 0.182 0.012

|ˆσ−σ|/σ 0.03 0.91 0.12

|b

t⋆−t⋆|/t⋆0.01 0.07 0.02

|BER(b

t⋆)−BER(t⋆)|/BER(t⋆)0.1 1.4 0.11

LDPC fail rate 1 0.15 0

Genie LDPC fail rate 1 0 0

OLD PAGE S1S2S3

|ˆµ−µ|/µ 0.005 0.053 0.021

|ˆσ−σ|/σ 0.03 0.27 0.13

|b

t⋆−t⋆|/t⋆0.006 0.015 0.011

|BER(b

t⋆)−BER(t⋆)|/BER(t⋆)0.003 0.009 0.007

LDPC fail rate 1 0.19 0.05

Genie LDPC fail rate 1 0 0.01

TABLE II

TRA DE -OFF B ET W EE N BE R AND L DP C FA IL UR E R ATE.

the BER-minimizing threshold. Since the overlap between

both levels is very small, soft decoding would barely provide

any gain over hard decoding. Picking the ﬁrst three reads

for noise characterization regardless of their value towards

building LLRs seems indeed to be the best strategy. For the

worn-out page, the policy attempts to achieve a trade-off by

combining two reads away from the uncertainty region, good

for parameter estimation, with another two inside it to improve

the quality of the LLR values used for soft decoding.

0.5 1 1.5 2 2.5

0

0.5

1

1.5

2

2.5

3

3.5

Voltage

Probability

Fresh page

S1

S2

S3

0.5 1 1.5 2 2.5

0

0.5

1

1.5

2

2.5

Voltage

Probability

Worn−out page

S1

S2

S3

Fig. 5. Read thresholds for strategies S1,S2and S3for a fresh and a

worn-out page.

Table II shows the relative error in our estimates and

sector failure rates averaged over 5000 simulation instances,

with read noise nyi,i= 1,...,4uniformly distributed

between −0.02 and 0.02. The ﬁrst three rows show the

relative estimation error of the mean, variance, and optimal

threshold. It can be observed that S1provides the lowest

estimation error, while S2produces clearly wrong estimates.

The estimates provided by S3are noisier than those provided

by S1, but are still acceptable. The relative increase in BER

when reading at b

t⋆instead of at t⋆is shown in the fourth row

of each table. It is worth noting that the BER(b

t⋆)does not

10

increase signiﬁcantly, even with inaccurate mean and variance

estimates. This validates the derivation in Section III-B.

Finally, the last two rows on each table show the failure

rate after 20 iterations of a min-sum LDPC decoder for two

different methods of obtaining soft information. The LDPC

code had 18% redundancy and codeword length equal to

35072 bits. The ﬁfth row corresponds to LLR values obtained

using the mean and variance estimates from the Progressive

Read Algorithm and the last row, labeled “Genie LDPC”,

corresponds to using the actual values instead of the estimated

ones. It can be observed that strategy S1, which provided very

accurate estimates, always fails in the LDPC decoding. This

is due to the wide range of cell voltages that fall between the

middle two reads, being assigned an LLR value close to 0.

The fact that the “Genie LDPC” performs better with S2than

with S3shows that the read locations chosen by the former

are better. However, S3provides lower failure rates in the

more realistic case where the means and variances need to

be estimated using the same reads used to produce the soft

information.

In summary, S3was found to be best from an LDPC

code point of view and S1from a pure BER-minimizing

perspective. S2as proposed in [11] is worse in both cases

unless the voltage distributions are known. When more than

four reads are allowed, all three schemes perform similarly.

After the ﬁrst four reads, all the strategies have relatively

good estimates for the optimal threshold. Subsequent reads

are located close to the optimal threshold, achieving small

BER. Decoding failure rates are then limited by the channel

capacity, rather than by the location of the reads.

VIII . CON CLU SI ON

NAND ﬂash controllers often require several re-reads using

different read thresholds to recover host data in the presence

of noise. In most cases, the controller tries to guess the noise

distribution based on the number of PE cycles and picks the

read thresholds based on that guess. However, unexpected

events such as excessive leakage or charge trapping can make

those thresholds suboptimal. This paper proposed algorithms

to reduce the total read time and sector failure rate by using a

limited number of re-reads to estimate the noise and improve

the read thresholds.

The overall scheme will work as follows. First, the con-

troller will generally have a prior estimation of what a good

read threshold might be. It will read at that threshold and

attempt a hard-decoding of the information. If the noise is

weak and the initial threshold was well chosen, this decoding

will succeed and no further processing will be needed. In

cases when this ﬁrst decoding fails, the controller will perform

additional reads to estimate the mean and/or variance of the

voltage values for each level. These estimates will in turn

be used to estimate the minimum achievable BER and the

corresponding optimal read threshold. The ﬂash controller

then decides whether to perform an additional read with this

threshold to attempt hard decoding again, or directly attempt a

more robust decoding of the information, for example LDPC,

leveraging the reads already performed to generate the soft

information.

The paper proposes using a dynamic programming back-

ward recursion to ﬁnd a policy for progressively picking the

read thresholds based on the prior information available and

the results from previous reads. This scheme will allow us

to ﬁnd the thresholds that optimize an arbitrary objective.

Controllers using hard decoding only (e.g., BCH) may wish to

ﬁnd the read threshold providing minimum BER, while those

employing soft decoding (e.g., LDPC) will prefer to maximize

the capacity of the resulting channel. The paper provides an

approximation for the (symmetric and mismatched) capacity

of the channel and presents simulations to illustrate the per-

formance of the proposed scheme in such scenarios.

REF ERE NC ES

[1] B. Peleato and R. Agarwal, “Maximizing MLC NAND lifetime and

reliability in the presence of write noise,” in IEEE Int. Conf. on

Communications (ICC). IEEE, 2012, pp. 3752–3756.

[2] H. Zhou, A. Jiang, and J. Bruck, “Error-correcting schemes with

dynamic thresholds in nonvolatile memories,” in IEEE Int. Symp. on

Information Theory (ISIT). IEEE, 2011, pp. 2143–2147.

[3] B. Peleato, R. Agarwal, and J. Ciofﬁ, “Probabilistic graphical model

for ﬂash memory programming,” in IEEE Statistical Signal Processing

Workshop (SSP). IEEE, 2012, pp. 788–791.

[4] M. Asadi, X. Huang, A. Kavcic, and N. P. Santhanam, “Optimal detector

for multilevel NAND ﬂash memory channels with intercell interference,”

IEEE J. Sel. Areas Commun., vol. 32, no. 5, pp. 825–835, May 2014.

[5] M. Anholt, N. Sommer, R. Dar, U. Perlmutter, and T. Inbar, “Dual ECC

decoder,” Apr. 23 2013, US Patent 8,429,498.

[6] G. Dong, N. Xie, and T. Zhang, “On the use of soft-decision error-

correction codes in nand ﬂash memory,” IEEE Trans. Circuits Syst. I:

Reg. Papers, vol. 58, no. 2, pp. 429–439, Nov. 2011.

[7] F. Sala, R. Gabrys, and L. Dolecek, “Dynamic threshold schemes for

multi-level non-volatile memories,” IEEE Trans. Commun., vol. 61,

no. 7, pp. 2624–2634, Jul. 2013.

[8] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Error patterns in MLC

NAND ﬂash memory: Measurement, characterization, and analysis,” in

Proc. Conf. Design, Automation and Test in Europe. IEEE, Mar. 2012,

pp. 521–526.

[9] Q. Li, A. Jiang, and E. F. Haratsch, “Noise modeling and capacity

analysis for NAND ﬂash memories,” in IEEE Int. Symp. on Information

Theory (ISIT). IEEE, Jul. 2014, pp. 2262–2266.

[10] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Threshold voltage

distribution in MLC NAND ﬂash memory: Characterization, analysis,

and modeling,” in Proc. Conf. Design, Automation and Test in Europe.

EDA Consortium, Mar. 2013, pp. 1285–1290.

[11] J. Wang, T. Courtade, H. Shankar, and R. Wesel, “Soft information for

LDPC decoding in ﬂash: Mutual information optimized quantization,”

in IEEE Global Communications Conf. (GLOBECOM), 2011, pp. 5–9.

[12] Y. Cai, O. Mutlu, E. F. Haratsch, and K. Mai, “Program interference in

MLC NAND ﬂash memory: Characterization, modeling, and mitigation,”

in IEEE Int. Conf. on Computer Design (ICCD). IEEE, 2013, pp. 123–

130.

[13] R. Gabrys, F. Sala, and L. Dolecek, “Coding for unreliable ﬂash memory

cells,” IEEE Commun. Lett., vol. 18, no. 9, pp. 1491–1494, Jul. 2014.

[14] R. Gabrys, E. Yaakobi, and L. Dolecek, “Graded bit-error-correcting

codes with applications to ﬂash memory,” IEEE Trans. Inf. Theory,

vol. 59, no. 4, pp. 2315–2327, Apr. 2013.

[15] H. Zhou, A. Jiang, and J. Bruck, “Non-uniform codes for asymmetric

errors,” in IEEE Int. Symp. on Information Theory (ISIT) . IEEE, 2011.

[16] J. Berger, “A note on error detection codes for asymmetric channels,”

Inform. and Control, vol. 4, no. 1, pp. 68–73, 1961.

[17] A. Jiang, M. Schwartz, and J. Bruck, “Error-correcting codes for rank

modulation,” in IEEE Int. Symp. on Information Theory (ISIT) . IEEE,

2008, pp. 1736–1740.

[18] A. Jiang, R. Mateescu, M. Schwartz, and J. Bruck, “Rank modulation

for ﬂash memories,” IEEE Trans. on Inf. Theory, vol. 55, no. 6, pp.

2659–2673, June 2009.

[19] E. En Gad, A. Jiang, and J. Bruck, “Compressed encoding for rank

modulation,” in IEEE Int. Symp. on Information Theory Proc. (ISIT).

IEEE, Aug. 2011, pp. 884–888.

11

[20] Q. Li, “Compressed rank modulation,” in 50th Annu. Allerton Conf. on

Communication, Control, and Computing (Allerton). IEEE, Oct. 2012,

pp. 185–192.

[21] E. E. Gad, E. Yaakobi, A. Jiang, and J. Bruck, “Rank-modulation

rewriting codes for ﬂash memories,” in Proc. IEEE Int. Symp. on

Information Theory (ISIT). IEEE, Jul. 2013, pp. 704–708.

[22] M. Qin, A. A. Jiang, and P. H. Siegel, “Parallel programming of rank

modulation,” in Proc. IEEE Int. Symp. on Information Theory (ISIT).

IEEE, Jul. 2013, pp. 719–723.

[23] M. Qin, E. Yaakobi, and P. H. Siegel, “Constrained codes that mitigate

inter-cell interference in read/write cycles for ﬂash memories,” IEEE J.

Sel. Areas Commun., vol. 32, no. 5, pp. 836–846, May 2014.

[24] S. Kayser and P. H. Siegel, “Constructions for constant-weight ici-free

codes,” in IEEE Int. Symp. on Information Theory (ISIT) . IEEE, Jul.

2014, pp. 1431–1435.

[25] R. Gabrys and L. Dolecek, “Constructions of nonbinary WOM codes

for multilevel ﬂash memories,” IEEE Trans. Inf. Theory, vol. 61, no. 4,

pp. 1905–1919, Apr. 2015.

[26] E. Yaakobi, P. H. Siegel, A. Vardy, and J. K. Wolf, “Multiple error-

correcting WOM-codes,” IEEE Trans. on Inf. Theory, vol. 58, no. 4,

pp. 2220–2230, Apr. 2012.

[27] A. Bhatia, M. Qin, A. R. Iyengar, B. M. Kurkoski, and P. H. Siegel,

“Lattice-based WOM codes for multilevel ﬂash memories,” IEEE J. Sel.

Areas Commun., vol. 32, no. 5, pp. 933–945, May 2014.

[28] Q. Li and A. Jiang, “Polar codes are optimal for write-efﬁcient mem-

ories.” in 51th Annu. Allerton Conf. on Communication, Control, and

Computing (Allerton), 2013, pp. 660–667.

[29] N. Papandreou, T. Parnell, H. Pozidis, T. Mittelholzer, E. Eleftheriou,

C. Camp, T. Grifﬁn, G. Tressler, and A. Walls, “Using adaptive read

voltage thresholds to enhance the reliability of MLC NAND ﬂash

memory systems,” in Proc. 24th Great Lakes Symp. on VLSI. ACM,

2014, pp. 151–156.

[30] D.-H. Lee and W. Sung, “Estimation of NAND ﬂash memory threshold

voltage distribution for optimum soft-decision error correction,” IEEE

Trans. Signal Process., vol. 61, no. 2, pp. 440–449, Jan. 2013.

[31] B. Peleato, R. Agarwal, J. Ciofﬁ, M. Qin, and P. H. Siegel, “To-

wards minimizing read time for NAND ﬂash,” in IEEE Global

Communications Conf. (GLOBECOM). IEEE, 2012, pp. 3219–3224.

[32] G. Dong, S. Li, and T. Zhang, “Using data postcompensation and

predistortion to tolerate cell-to-cell interference in MLC NAND ﬂash

memory,” IEEE Trans. Circuits Syst. I: Reg. Papers, vol. 57, no. 10, pp.

2718–2728, Oct. 2010.

[33] A. Torsi, Y. Zhao, H. Liu, T. Tanzawa, A. Goda, P. Kalavade, and

K. Parat, “A program disturb model and channel leakage current study

for sub-20nm NAND ﬂash cells,” IEEE Trans. E lectron Devices , vol. 58,

no. 1, pp. 11–16, Jan. 2011.

[34] E. Yaakobi, J. Ma, L. Grupp, P. Siegel, S. Swanson, and J. Wolf,

“Error characterization and coding schemes for ﬂash memories,” in

GLOBECOM Workshops (GC Wkshps). IEEE, 2010, pp. 1856–1860.

[35] D. Nguyen, B. Vasic, and M. Marcellin, “Two-bit bit ﬂipping decoding

of LDPC codes,” in IEEE Int. Symp. on Information Theory (ISIT).

IEEE, 2011, pp. 1995–1999.

[36] T. M. Cover and J. A. Thomas, Elements of Information Theory. John

Wiley & Sons, 2012.

[37] A. Berman and Y. Birk, “Constrained ﬂash memory programming,” in

IEEE Int. Symp. on Information Theory (ISIT). IEEE, 2011, pp. 2128–

2132.

[38] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “Channel coding rate in the

ﬁnite blocklength regime,” IEEE Trans. Inf. Theory, vol. 56, no. 5, pp.

2307–2359, May 2010.

[39] J. Scarlett, A. Martinez, and A. G. i F `abregas, “Mismatched decoding:

Finite-length bounds, error exponents and approximations,” submitted

for publication. [Online: http://arxiv. org/abs/1303.6166], 2013.

[40] J. Y. N. Hui, “Fundamental issues of multiple accessing,” Ph.D.

dissertation, Mass. Inst. Technol., 1983.

[41] I. Csisz´ar and J. K ¨orner, “Graph decomposition: A new key to coding

theorems,” IEEE Trans. Inf. Theory, vol. 27, no. 1, pp. 5–12, Jan. 1981.

[42] N. Merhav, G. Kaplan, A. Lapidoth, and S. Shamai Shitz, “On informa-

tion rates for mismatched decoders,” IEEE Trans. Inf. Theory, vol. 40,

no. 6, pp. 1953–1967, Nov. 1994.

[43] A. Lapidoth, P. Narayan et al., “Reliable communication under channel

uncertainty,” IEEE Trans. Inf. Theory, vol. 44, no. 6, pp. 2148–2177,

Oct. 1998.

[44] V. B. Balakirsky, “A converse coding theorem for mismatched decoding

at the output of binary-input memoryless channels,” IEEE Trans. Inf.

Theory, vol. 41, no. 6, pp. 1889–1902, Nov. 1995.

[45] M. Alsan and E. Telatar, “Polarization as a novel architecture to

boost the classical mismatched capacity of B-DMCs,” arXiv preprint

arXiv:1401.6097, 2014.

[46] E. Arikan, “Channel polarization: A method for constructing capacity-

achieving codes for symmetric binary-input memoryless channels,” IEEE

Trans. Inf. Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009.

[47] D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena

Scientiﬁc Belmont, MA, 1995, vol. 1, no. 2.

[48] Y. Wang, B. O’Donoghue, and S. Boyd, “Approximate dynamic pro-

gramming via iterated Bellman inequalities,” Int. Journal of Robust and

Nonlinear Control, 2014.

[49] T. Parnell, N. Papandreou, T. Mittelholzer, and H. Pozidis, “Modelling

of the threshold voltage distributions of sub-20nm nand ﬂash memory,”

in IEEE Global Communications Conf. (GLOBECOM). IEEE, 2014,

pp. 2351–2356.

Borja Peleato (S’12-M’13) is a Visiting Assistant Professor in the Electrical

and Computer Engineering department at Purdue University. He received

his B.S. degrees in telecommunications and mathematics from Universitat

Politecnica de Catalunya, Barcelona, Spain, in 2007, and his M.S. and Ph.D.

degrees in electrical engineering from Stanford University in 2009 and 2013,

respectively. He was a visiting student at the Massachusetts Institute of

Technology in 2006, and a Senior Flash Channel Architect with Proton

Digital Systems in 2013. His research interests include signal processing and

coding for non-volatile storage, convex optimization, and communications.

Dr. Peleato received a ”La Caixa” Graduate Fellowship in 2006.

Rajiv Agarwal completed his B.Tech. degree in electrical engineering from

I.I.T. Kanpur in 2003, and his M.S. and Ph.D. degrees from Stanford

University in 2005 and ??? respectively. He has worked at ??? and is currently

at ???. His aca- demic interests include ???.

John M. Ciofﬁ (S’77-M’78-SM’90-F’96) received the B.S. degree in electri-

cal engineering from the University of Illinois at Urbana-Champaign, Urbana,

IL, USA, in 1978 and the Ph.D. degree in electrical engineering from Stanford

University, Stanford, CA, USA, in 1984. He was with Bell Laboratories

in 19781984 and IBM Research in 19841986. Since 1986, he has been

with Stanford University, where he was a Professor in electrical engineering

and is currently an Emeritus Professor. He founded Amati Communications

Corporation in 1991 (purchased by TI in 1997) and was an Ofﬁcer/Director

from 1991 to 1997. He is also an Adjunct Professor of computing/information

technology with King Abdulaziz University, Jeddah, Saudi Arabia. Currently,

he is also with the Board of Directors of ASSIA (Chairman and CEO),

Alto Beam, and the Marconi Foundation. He has published more than 600

papers and holds more than 100 patents, of which many are heavily licensed,

including key necessary patents for the international standards in ADSL,

VDSL, DSM, and WiMAX. His speciﬁc interests are in the area of high-

performance digital transmission. Prof. Ciofﬁ was the recipient of the IEEE’s

Alexander Graham Bell and Millennium Medals (2010 and 2000); Member

Internet Hall of Fame (2014); Economist Magazine 2010 Innovations Award;

International Marconi Fellow (2006); Member, U.S. National and U.K. Royal

Academies of Engineering (2001, 2009); IEEE Kobayashi and Kirchmayer

Awards (2001 and 2014); IEEE Fellow (1996); IEE JJ Tomson Medal (2000);

1991 and 2007 IEEE Communications Magazine Best Paper; and numerous

conference best paper awards.

Minghai Qin (S’11) is a Research Principal Engineer in Storage Architecture

at HGST. He received the B.E. degree in electronic and electrical engineering

from Tsinghua University, Beijing, China, in 2009, and the Ph.D. degree in

electrical engineering from the University of California, San Diego, in 2014.

He was also associated with the Center for Magnetic Recording Research

(CMRR) from 2010 to 2014. His research interests include coding and signal

processing for non-volatile memories, polar codes implementation, and coding

for distributed storage.

12

Paul H. Siegel (M’82-SM’90-F’97) received the S.B. and Ph.D. degrees

in mathematics from the Massachusetts Institute of Technology (MIT),

Cambridge, in 1975 and 1979, respectively. He held a Chaim Weizmann

Postdoctoral Fellowship at the Courant Institute, New York University. He

was with the IBM Research Division in San Jose, CA, from 1980 to 1995.

He joined the faculty at the University of California, San Diego in July

1995, where he is currently Professor of Electrical and Computer Engineering

in the Jacobs School of Engineering. He is afﬁliated with the Center for

Magnetic Recording Research where he holds an endowed chair and served

as Director from 2000 to 2011. His primary research interests lie in the areas of

information theory and communications, particularly coding and modulation

techniques, with applications to digital data storage and transmission. Prof.

Siegel was a member of the Board of Governors of the IEEE Information

Theory Society from 1991 to 1996 and from 2009 to 2011. He was re-elected

for another 3-year term in 2012. He served as Co-Guest Editor of the May

1991 Special Issue on ”Coding for Storage Devices” of the IEEE Transactions

on Information Theory. He served the same Transactions as Associate Editor

for Coding Techniques from 1992 to 1995, and as Editor-in-Chief from July

2001 to July 2004. He was also Co-Guest Editor of the May/September 2001

two-part issue on ”The Turbo Principle: From Theory to Practice” of the IEEE

Journal on Selected Areas in Communications. Prof. Siegel was co-recipient,

with R. Karabed, of the 1992 IEEE Information Theory Society Paper Award

and shared the 1993 IEEE Communications Society Leonard G. Abraham

Prize Paper Award with B. H. Marcus and J.K. Wolf. With J. B. Soriaga and

H. D. Pﬁster, he received the 2007 Best Paper Award in Signal Processing

and Coding for Data Storage from the Data Storage Technical Committee of

the IEEE Communications Society. He holds several patents in the area of

coding and detection, and was named a Master Inventor at IBM Research

in 1994. He is an IEEE Fellow and a member of the National Academy of

Engineering.

IX. AP P EN DIX

Proof: (Theorem 1) The proof is very similar to that

for Shannon’s Channel Coding Theorem, but a few changes

will be introduced to account for the mismatched decoder.

Let X∈ {1,2}ndenote the channel input and Y∈ Yn

the channel output, with Xiand Yidenoting their respective

components for i= 1,...,n. Throughout the proof, ˆ

P(A)will

denote the estimate for the probability of an event Aobtained

using the transition probabilities cpij ,i= 1,2,j= 1,...,|Y|,

to differentiate it from the exact probability P(A)obtained

using transition probabilities pij ,i= 1,2,j= 1,...,|Y|. The

inputs are assumed to be symmetric, so ˆ

P(X) = P(X)and

ˆ

P(X, Y ) = ˆ

P(Y|X)P(X).

We start by generating 2nR random binary sequences of

length nto form a random code Cwith rate Rand length

n. After revealing the code Cto both the sender and the

receiver, a codeword xis chosen at random among those in

Cand transmitted. The conditional probability of receiving a

sequence y∈ Yngiven the transmitted codeword xis given

by P(Y=y|X=x) = Qn

i=1 pxiyi, where xiand yidenote

the i-th components of xand y, respectively.

The receiver then attempts to recover the codeword x

that was sent. However, the decoder does not have access

to the exact transition probabilities pij and must use the

estimated probabilities cpij instead. When pij =cpij ∀i, j,

the optimal decoding procedure is maximum likelihood de-

coding (equivalent to maximum a posteriori decoding, since

inputs are equiprobable). In maximum likelihood decoding, the

decoder forms the estimate ˆx = arg maxx∈C ˆ

P(y|x), where

ˆ

P(Y=y|X=x) = Qn

i=1 dpxiyiis the estimated likelihood

of x, given ywas received.

Denote by ˆ

A(n)

ǫthe set of length-nsequences {(x,y)}

whose estimated empirical entropies are ǫ-close to the typical

estimated entropies:

ˆ

A(n)

ǫ={(x,y)∈ {1,2}n× Yn:(27)

−1

nlog P(X=x)−1< ǫ, (28)

−1

nlog ˆ

P(Y=y)−µY< ǫ, (29)

−1

nlog ˆ

P(X=x, Y =y)−µX Y < ǫ,(30)

where µYand µXY represent the expected values of

−1

nlog ˆ

P(Y)and −1

nlog ˆ

P(X, Y ), respectively, and the log-

arithms are in base 2. Hence,

µY=−1

n

n

X

i=1

|Y|

X

k=1

P(Yi=k) log ˆ

P(Yi=k)(31)

=−

|Y|

X

k=1

p1k+p2k

2log cp1k+cp2k

2,(32)

µXY =−1

n

n

X

i=1

2

X

b=1

|Y|

X

k=1

P(Xi=b, Yi=k)·

log ˆ

P(Xi=b, Yi=k)(33)

=−

|Y|

X

k=1 p1k

2log cp1k

2+p2k

2log cp2k

2,(34)

where the exact transition probabilities are used as weights in

the expectation and the estimated ones are the variable values.

Particularly, (x,y)∈ˆ

A(n)

ǫimplies that ˆ

P(Y=y|X=x)>

2n(1−µXY −ǫ)and ˆ

P(Y=y)<2−n(µY−ǫ). We will say that

a sequence x∈ {1,2}nis in ˆ

A(n)

ǫif it can be extended to a

sequence (x,y)∈ˆ

A(n)

ǫ, and similarly for y∈ Yn.

First we show that with high probability, the transmitted and

received sequences (x,y)are in the ˆ

A(n)

ǫset. The weak law

of large numbers states that for any given ǫ > 0, there exists

n0, such that for any codeword length n > n0

P−1

nlog P(X=x)−1≥ǫ<ǫ

3,(35)

P−1

nlog ˆ

P(Y=y)−µY≥ǫ<ǫ

3,(36)

P−1

nlog ˆ

P(X=x, Y =y)−µX Y ≥ǫ<ǫ

3.(37)

Applying the union bound to these events shows that for n

large enough, P(x,y)/∈ˆ

A(n)

ǫ< ǫ.

When a codeword x∈ {1,2}nis transmitted and y∈ Yn

is received, an error will occur if there exists another code-

word z∈ C such that ˆ

P(Y=y|X=z)≥ˆ

P(Y=

y|X=x). The estimated likelihood of xis greater than

2n(1−µXY −ǫ)with probability at least 1−ǫ, as was just

shown. The other nR −1codewords in Care independent

from the received sequence. For a given y∈ˆ

A(n)

ǫ, let

Sy=nx∈ {1,2}n:ˆ

P(Y=y|X=x)≥2n(1−µXY −ǫ)ode-

note the set of input sequences whose estimated likelihood is

13

greater than 2n(1−µXY −ǫ). Then

1 = X

x∈{1,2}n

ˆ

P(X=x|Y=y)(38)

>X

x∈Sy

ˆ

P(Y=y|X=x)P(X=x)

ˆ

P(Y=y)(39)

>|Sy|2n(1−µXY −ǫ)2−n2n(µY−ǫ)(40)

which implies |Sy|<2n(µXY −µY+2ǫ)for all y∈ˆ

A(n)

ǫ.

If (x,y)∈ˆ

A(n)

ǫ, any other codeword causing an error must

be in Sy. Let Ei,i= 1,...,nR−1denote the event that the

i-th codeword in the codebook Cis in Sy, and Fthe event

that (x,y)are in ˆ

A(n)

ǫ. The probability of error can be upper

bounded by

P(ˆx 6=x) = P(Fc)P(ˆx 6=x|Fc) + P(F)P(ˆx 6=x|F)(41)

≤ǫP (ˆx 6=x|Fc) +

2nR−1

X

i=1

P(Ei|F)(42)

≤ǫ+ 2nR|Sy|2−n(43)

≤ǫ+ 2n(R+µXY −µY−1+2ǫ)(44)

Consequently, as long as

R < 1

2

|Y|

X

k=1

(p1klog ( cp1k) + p2klog ( cp2k)) −

(p1k+p2k) log cp1k+cp2k

2,

(45)

for any δ > 0, we can choose ǫand nǫso that for any n > nǫ

the probability of error, averaged over all codewords and over

all random codes of length n, is below δ. By choosing a code

with average probability of error below δand discarding the

worst half of its codewords, we can construct a code of rate

R−1

nand maximal probability of error below 2δ, proving

the achievability of any rate below the bound CP, ˆ

Pdeﬁned in

Eq. (15). This concludes the proof.