ArticlePDF Available
Composition Check Codes
Kees A. Schouhamer Immink and Kui Cai
Abstract—We present composition check codes for noisy
storage and transmission channels with unknown gain and/or
offset. In the proposed composition check code, like in
systematic error correcting codes, the encoding of the main
data into a constant composition code is completely avoided.
To the main data a coded label is appended that carries
information regarding the composition vector of the main
data. Slepian’s optimal detection technique of codewords
that are taken from a constant composition code is applied
for detection. A first Slepian detector detects the label, and
subsequently restores the composition vector of the main data.
The composition vector, in turn, is used by a second Slepian
detector to optimally detect the main data. We compute the
redundancy and error performance of the new method, and
results of computer simulations are presented.
Index Terms—Constant composition code, permutation
code, flash memory, optical recording
I. INTRODUCTION
The receiver of a transmission or storage system is often
ignorant of the exact value of the amplitude (gain) and/or
offset (translation) of the received signal, which depend
on the actual, time-varying, conditions of the channel.
In wireless communications, for example, the amplitude
of the received signal may vary rapidly due to multi-
path propagation or due to obstacles affecting the wave
propagation. In optical disc recording, both the gain and
offset depend on the reflective index of the disc surface
and the dimensions of the written features. Fingerprints on
optical discs may result in rapid gain and offset variations
of the retrieved signal. Assume the q-level pulse amplitude
modulated (PAM) signal, xi,i= 1,2, . . . , is sent, and
received as ri, where
ri=a(xi+νi) + b.
The reals a > 0and bare called the gain and offset of
the received signal, respectively, and we assume that the
receiver is ignorant of the actual values of aand b. The
stochastic component is called ‘noise’ and is denoted by νi.
We further assume that the parameters aand bvary slowly
over time or position, so that for a plurality of n,n > 1,
Kees A. Schouhamer Immink is with Turing Machines Inc, Willemskade
15d, 3016 DK Rotterdam, The Netherlands. E-mail: immink@turing-
machines.com.
Kui Cai is with Singapore University of Technology and Design (SUTD),
8 Somapah Rd, 487372, Singapore. E-mail: cai kui@sutd.edu.sg.
This work is supported in part by Singapore Agency of Science and
Technology (A*Star) Public Sector Research Funding (PSF) grant.
Copyright (c) 2017 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending a request to pubs-permissions@ieee.org
symbol time slots the parameters aand bcan be considered
fixed, but unknown to the receiver. The receiver’s ignorance
of the exact value of aand bmay seriously degrade the
error performance of the transmission or storage channel,
as has been shown in [1].
There are a myriad of proposals to handle the problem
of the channel’s unknown gain and offset. Automatic gain
control (AGC) has been applied in many practical trans-
mission systems, but an automatic gain control (AGC) is
close to useless if the gain and offset vary very rapidly.
Redundant training sequences or reference memory cells
with prescribed levels are placed between ‘user’ data for
estimating the unknown parameters. The parameter estima-
tion will, by necessity, be based on an average over a limited
time-interval, and the estimated values may be inaccurate as
they lag behind the actual values. A more frequent insertion
of reference cells may improve the parameter estimation,
which, however, comes at the cost of higher redundancy
and thus decreased payload.
Slepian showed in his seminal paper [2] that the error per-
formance of optimal detection of codewords that are drawn
from a single constant composition code is immune to gain
and offset mismatch. He also presented an implementation
of optimal detection whose complexity grows with nlog n.
A constant composition code of n-length codewords over
the q-ary alphabet has the property that the numbers of
occurrences of the symbols within a codeword is the same
for each codeword [3].
In practice, however, Slepian’s detection method has
limited applicability as it depends heavily on the efficient
and simple encoding and decoding of arbitrary user data
into a constant composition code. Encoding and decoding
of constant composition code is a field of active research,
see, for example [4], [5], [6]. For the binary case, Weber
and Immink [7] and Skachek et al. [8] presented methods
that translate arbitrary data into a codeword having a pre-
scribed number of one’s and zero’s. Enumerative methods
for generating codewords have been presented in [9], [10],
[11]. A serious drawback of enumeration schemes is error
propagation, a phenomenon illustrated in Section VII. The
lack of simple and efficient encoding and decoding schemes
has been a major barrier for the application of Slepian’s
optimal detection method. Thus, an efficient technique to
eliminate, or at least significantly alleviate, the drawbacks
and deficiencies of Slepian’s prior art system has been a
desideratum.
The scheme proposed and analyzed here, coined com-
position check code, meets the above desideratum as it
has the virtues of Slepian’s optimal detection method, but
its drawback, the encoding of the main data, or payload
symbols, into a constant composition code, is removed.
In the proposed scheme, the main data are sent to the
receiver without modification. Attached to the main data
word is a relatively short, fixed-length, label that informs the
receiver regarding the constant composition code to which
the sent main data word belongs. The information conveyed
by the label is used by the receiver to optimally recover
the main data using Slepian’s optimal detection method.
The proposed method is reminiscent of a systematic error
correcting code, where unmodified main data is sent, and a
parity check word is appended to make error correction or
detection possible. A system using a variant of the proposed
scheme was discussed recently by Li et al. [12]. In Li’s
method, however, the label is not encoded into a constant
composition code, and hence this portion is not immune
to the unknown gain and offset. Also, Li’s system needs
two detectors for the payload portion and the label portion
(i.e. the conventional threshold detector and the Slepian
detector), which all but doubles the detector complexity. The
proposed method is also reminiscent of Knuth’s method [13]
for generating codewords having equal numbers of one’s
and zero’s, where an appended prefix carries information
regarding the specific segment of the codeword that has been
modified.
It should be noted that the proposed technique has the
principal virtues of Slepian’s prior art method, such as
enabling simple optimal detection of the noisy codewords
and immunity to the channel’s gain and offset mismatch.
However, the generated codewords do not belong to a
prescribed constant composition code, and therefore they
do not possess the spectral properties, specifically reduced
power at the low-frequency end, of codewords that are
drawn from a constant composition code.
A second advantage of the new scheme was noted
in [12]. Since the payload is ”systematic“, the payload
can be protected by a conventional error-correcting code.
Much stronger error correcting schemes (ECCs) are known
for conventional channels than those that are subsets of
constant-composition codes.
The paper is organized as follows. In Section II, we
set the scene, introduce preliminaries and discuss the state
of the art. In Section III, we present our approach. In
Section IV we compute the redundancy of the proposed
method. Complexity issues are dealt with in Section V. In
Sections VI and VII, we analyze and compute the error
performance. In Section VIII, we describe our conclusions.
II. PRELIMINARIES
We assume that user data is recorded in groups of n q-
level symbols, called a codeword. We consider a codebook,
Sw, of chosen codewords x= (x1, x2, . . . , xn)over the q-
ary alphabet Q={0, . . . , q 1}, where n, the length of x,
is a positive integer. In line with the adopted linear channel
model, we assume that the codeword, x, is retrieved as
r=a(x+ν) + b1,(1)
where r= (r1, . . . , rn),riR, and 1= (1, . . . , 1). The
basic premises are that xis retrieved with an unknown
(positive) gain a,a > 0, is offsetted by an unknown uniform
offset, b1, where aand bR, and corrupted by additive
Gaussian noise ν=(ν1, . . . , νn),νiRare noise samples
with distribution N(0, σ2), where σ2Rdenotes the
variance of the additive noise.
A. Constant composition codes
Define the composition vector w(x)=(w0, . . . , wq1)
of x, where the qentries wj,j∈ Q, of w(x)indicate
the number of occurrences of the symbol xi=j∈ Q,
1in, in x. That is, for a q-ary sequence x, we denote
the number of appearances of the symbol jby
wj=|{i:xi=j}| for j= 0,1, . . . , q 1.(2)
Clearly, wj=nand wj∈ {0, . . . , n}. A constant
composition code comprising all possible n-vectors with
the same composition vector w(x), is denoted by Sw.
Evidently, every codeword has wjoccurrences of symbol
j∈ Q. The code Swconsists of all permutations of the
symbols defined by the composition vector w, so that the
size of Swequals the multinomial coefficient
|Sw|=n!
i∈Qwi!.(3)
A constant composition code is also known as permuta-
tion modulation code (Variant I), which was introduced
by Slepian [2] in 1965. Slepian showed that a constant
composition code allows optimal detection using a simple
algorithm.
B. Slepian’s Algorithm
The well-known (squared) Euclidean distance, δe(r,ˆ
x),
between the received signal vector rand the codeword ˆ
x
Swis defined by
δe(r,ˆ
x) =
n
i=1
(riˆxi)2.(4)
A minimum Euclidean distance detector outputs the code-
word xodefined by
xo= arg min
ˆ
xSw
δe(r,ˆ
x).(5)
Working out (4) gives
δe(r,ˆ
x) =
n
i=1
(x
i+b)22
n
i=1
x
iˆxi2b
n
i=1
ˆxi+
n
i=1
ˆx2
i,
(6)
where x
i=a(xi+νi). Evidently, the Euclidean distance
δe(r,ˆ
x)depends on the quantities aand b, which may lead
to a serious degradation of the error performance [1].
The first term of (6), n
i=1(x
i+b)2, is independent of
ˆ
x, and clearly dropping the constant term does not effect
the outcome of (5). In a similar fashion, we can drop
the quantities 2bn
i=1 ˆxiand n
i=1 ˆx2
i, since the vector ˆ
x
is drawn from a constant composition code so that both
quantities are constant for all ˆ
xSw.
Then we find
δe(r,ˆ
x)≡ −
n
i=1
riˆxi,(7)
where the sign denotes equivalence between (4) and (7)
since the outcome of (5) is the same when (7) is used instead
of (4). Thus the channel’s unknown gain, a, and offset, b,
do not affect the outcome of (5) when codewords are drawn
from a constant composition code Sw. We now address the
efficient elaboration of the inner product (7) using Slepian’s
algorithm.
Slepian [2] showed that the minimization (5) can be
replaced by a simple sorting of the symbols of the received
signal vector r. He proved that for two given vectors,
x1, . . . , ˆxn)and (r1, . . . , rn)that the inner product (7)
r1ˆxi1+r2ˆxi2+. . . +rnˆxin(8)
is maximized under all permutations i1,i2, . . . , inof the
integers 1,2, . . . , n, by pairing the largest ˆxiwith the largest
ri, the second largest ˆxiwith the second largest ri, etc. To
that end, the nelements of the received vector, r, are sorted
from largest to smallest. From the composition vector of
the codeword, x, at hand, w, we deduce the reference
vector xr= (q1, . . . q 1, q 2, . . . , q 2,...,0,...,0),
where the symbols are sorted from largest to smallest, and
the numbers of q1’s, q2s and so on in xrequal
wq1,wq2, . . .,w0. Slepians algorithm is attractive since
the complexity of sorting nsymbols grows with nlog n,
which is far less complex than the evaluation of (5) whose
complexity grows exponentially with n. A small example
may clarify Slepian’s algorithm.
Example 1: Let n= 5,q= 3, and let the composition
vector be w= (1,2,2). Thus each sent codeword is a
permutation of the reference vector xr=(2, 2, 1, 1, 0),
where the symbols of the reference vector have been sorted
largest to smallest. Let the received vector be r=(0.2,
1.4, 0.9, 1.2, 1.6). We sort the symbols in the received
vector rin decreasing order and obtain (1.6, 1.4, 1.2, 0.9,
0.2). Then the detector assigns the symbols by pairing
the two largest symbols of xrand r, that is, the symbol
valued 1.6 to a ‘2’, then 1.4 to ‘2’, 1.2 to a ‘1’, 0.9 to
a ‘1’, and finally the symbol valued 0.2 to a ‘0’. The
detector decides that the codeword (0, 2, 1, 1, 2) was sent.
III. COM PO SI TI ON C HE CK C ODES
A drawback of the usage of a constant composition code
in Slepian’s prior art is the complexity of the encoding and
decoding operation in case the payload is large. Encoding
algorithms, such as enumerative encoding [14] [15], require
much smaller look-up tables than direct look-up tables, but
they often require complex look-up tables and algorithms.
In the proposed composition check code, the encoding of
arbitrary data into a constant composition code is avoided.
The n-symbol main data word, x, is sent without any
modification, and a separate p-symbol label, denoted by
z= (z1, . . . , zp),zi∈ Q, is appended to the main data
word. The appended p-symbol label, z, informs the Slepian
detector to which constant composition code the main data
word, x, belongs. To that end, we define a one-to-one
correspondence between the set of all possible composition
vectors of the n-symbol payload and the set of p-symbol
labels. The number of possible distinct constant composition
vectors, denoted by N(q, n), of a q-ary n-vector equals [16],
page 38,
N(q, n) = n+q1
q1.(9)
The length of the label, p, must be chosen sufficiently
large so that the label can uniquely convey the identity
of the constant composition code. In the binary case, the
encoded label represents the number of one’s in the main
data word. The procedure for encoding and decoding is
succinctly written as follows.
Encoding/Decoding The main (user) data, denoted by x,
which consists of n q-ary symbols, is transferred to the
encoder. The encoder first forms the composition vector
w= (w0, . . . , wq1)of xusing (2), and translates the
vector winto the p-symbol q-ary label, z, using a predefined
one-to-one correspondence, z=ϕ(w). The label, z, is
appended to the main data, and the main data plus the label
are sent. The one-to-one correspondence z=ϕ(w)can
be simply embodied by a look-up table for small values
of qand n. In practice, for larger values of nand q, the
function z=ϕ(w)is a two-step process, where z=ϕ(w)
is partitioned into a cascade of two functions, I=ϕ1(w)
and z=ϕ2(I), where Iis a non-negative integer. In the
first step, the (compression) function, I=ϕ1(w), translates
the composition vector winto an integer in the range 0 to
at most (n+ 1)q1. The vector wis redundant since we
have the constraint q1
i=0 wi=n. In case the composition
vector wis ideally compressed, the integer Iranges from 0
to N(q, n)1. In the second step, the function, z=ϕ2(I),
translates the integer Iinto the p-symbol q-ary label. Prac-
tical issues regarding the implementation of the functions
I=ϕ1(w)and z=ϕ2(I)for larger values of nand qare
given in Section V.
Note that the sent concatenation of xand zdoes not
have special spectral characteristics, it is not ‘balanced’ or
‘dc-free’.
The label, z, is detected, preferably using Slepian’s
optimal method, decoded by a look-up table, ϕ1(z), and
the composition vector w=ϕ1(z), is retrieved. Following
Slepian’s method, see Subsection II-B, the received main
data symbols are sorted and assigned to the symbols in
accordance with the retrieved composition vector w.
It is sufficient to uniquely encode the N(q, n)different
composition vectors into p=logqN(q, n)label symbols,
but in a preferred embodiment, the label is a codeword taken
from a predefined p-symbol constant composition code.
The preferred embodiment has the advantage that firstly,
Slepian’s optimal method is used for both the main data
word and the label giving them both a high resilience to
additive noise, and, secondly, both the main data word and
the label are immune to channel mismatch. These attractive
virtues come at a price, and in the next section, we compute
the redundancy of composition check codes.
IV. REDUNDANCY ANALYSIS
We discuss two label formatting options, where a) the
label is uncoded as in [12], and b) the label is encoded
using a constant composition code.
The p-symbol label must be able to uniquely represent
all distinct composition vectors, N(q, n), of the n-symbol
payload. Thus, for an uncoded label, we find the condition
p≥ ⌈logqN(q, n).(10)
For asymptotically large nand limited q, we obtain using
Stirling’s Approximation for a binomial coefficient
N(q, n) = n+q1
q11
(q1)!nq1, n 1,(11)
so that the code redundancy, p, equals
p(q1) logqnlogq(q1)!.(12)
In case the p-symbol label is encoded into a q-ary constant
composition code, we have the condition
p!
i∈Q ˆwi!N(q, n),
where ˆ
wdenotes the composition vector of the p-symbol
label. The number of labels is maximized if we choose p=
aq, and ˆwi=a,a∈ Q. Then the label length, p, must be
sufficiently large to satisfy
p!
(p
q!)qN(q, n).(13)
Since, using Stirling’s Approximation,
p!
(p
q!)qαq
qp
p(q1)/2,(14)
where
αq=q(q/2)
(2π)(q1)/2,
we have
αq
qp
p(q1)/21
(q1)!nq1, p, n 1,(15)
or
logqαqq1
2logqp+p(q1) logqnlogq(q1)!.
(16)
For asymptotically large n, we have the estimate of the
redundancy
p > (q1) logqn. (17)
For q= 2 we simply find
p
p
2n+ 1,(18)
which is about the required redundancy of Knuth’s code for
balancing binary sequences.
The redundancy, rs, of Slepian’s prior art method, where
the payload is translated into a constant composition code
where all symbols appear with frequency n
q, is
rs= logq
qn
n!
(n
q!)n
q1
2logqnlogqαq.(19)
A comparison with (17) reveals that for large nthe redun-
dancy of the proposed scheme is approximately a factor of
two more than can be obtained by the conventional method
using a fixed constant composition code. Apparently, this is
the price to pay for a simple implementation. A variable-
length label that takes into account the probability of occur-
rence of the label instead of the fixed-length label studied
here, will reduce the required redundancy of the method [7].
V. COMPLEXITY IS SU ES
For relatively small nand q, the composition vector w
can be straightforwardly translated into a p-symbol q-ary
label zby using a look-up table that embodies the one-to-
one correspondence z=ϕ(w). We infer from (11) that,
although N(q, n)grows polynomially with the codeword
length, n, that for larger alphabet size qthe number of
entries of a look-up table can be prohibitively large. For
a practical application, we must try to find an algorithmic
routine in lieu of look-up tables. We present two alternative
scenario’s. We encode (compress) the composition vector w
using an algorithmic (enumeration) approach. Alternatively,
we do not compress the composition vector w, and we
compute the redundancy loss.
We commence, in the next subsection, with the compres-
sion of the composition vector, w, using Cover’s enumera-
tive coding techniques [17].
A. Compressed composition vector, enumerative encoding
of the composition vector
The translation function, I=ϕ1(w), of the composition
vector, w, into an integer I,0IN(q, n)1,
can be accomplished using enumerative encoding. In an
enumerative coding scheme, the codewords are ranked in
lexicographical order [17]. The lexicographical index, or
rank, I, of a codeword, x, in the ordered list equals the
number of codewords preceding xin the ordered list. Using
the findings of [17], we write down the next Theorem.
Theorem 1:
I=
q1
i=1
xi11
j=0 nj+qi1
qi1,(20)
where
n=n
i1
i=1
xi1,
and I∈ {0. . . N (q, n)1}.
Proof: We follow Cover’s approach [17]. Let
ns(w0, w1, . . . , wk1)denote the number of composition
vectors for which the first kcoordinates are given by
(w0, w1, . . . , wk1). According to Cover, the lexicographic
index, I, is given by
I=
q1
i=1
xi11
j=0
ns(w0, w1, . . . , wi2, j).(21)
We have
ns(w0, w1, . . . , wi1) = n+qi1
qi1,
where
n=n
i
i=1
xi1.
Substitution yields (20), which concludes the proof.
The inverse function, w=ϕ1
1(I), is also calculated
using an algorithmic approach, and we refer to [17] for
details. The binomial coefficients can be computed on the
fly, and look-up tables are not required.
B. Uncompressed composition vector
Alternatively, we investigate the case that the vector, w, is
not compressed. The qentries wiof the composition vector
ware in the alphabet {0, . . . , n}, so that the composition
vector wcan be seen as a positive integer number of q
(n+ 1)-ary digits. We may slightly compress the vector
wby noting that the observation of q1entries uniquely
identifies wsince q1
i=0 wi=n. We study the increase of
the redundancy as the label must be able to accommodate
(n+ 1)qdifferent integer numbers that are associated with
the uncompressed w.
To that end, let pdenote the length of the label. In case
the label is uncoded, the vector wis translated into the
q-ary p-symbol label using a well-known base conversion
algorithm [18]. We have
pqlogq(n+ 1).(22)
The relative increase in redundancy with respect to the
compressed vector, w, is defined by
η=pp
p.(23)
Then
η=qlogq(n+ 1)⌉ − (q1) logqnlogq(q1)!
(q1) logqnlogq(q1)! .
(24)
For asymptotically large n, we find
η1
q, n 1,(25)
and we conclude that the relative increase in redundancy
by using uncompressed composition vector, w, is inversely
proportional with q.
We proceed and take a look at the redundancy of the
coded label. The algorithmic encoding and decoding of an
integer number in any base into a codeword of symbols in
any base of a constant composition code using enumerative
encoding has been published extensively in the literature,
see for example [5]. The coded label length, p, must be
sufficiently large to satisfy (see (13) and (14))
p!
(p
q!)q(n+ 1)q,
which, for large n, can be approximated by
αq
qp
p(q1)/2(n+ 1)q, n 1,
or
logqαqq1
2logqp+pqlogq(n+ 1).
For asymptotically large n, we find
p> q logq(n+ 1).(26)
The relative extra redundancy required by the unconstrained
algorithmic encoding of the composite vector wequals
ηqlogq(n+ 1) (q1) logqn
qlogq(n+ 1) 1
q, n 1.(27)
We infer that the relative extra redundancy for the method
that employs traditional enumerative algorithmic encoding
equals 1
q. For small values of qwe may, dependent of
the codeword length n, apply look-up tables for encoding
the label, while for larger qwe may employ algorithmic
encoding without significant loss in redundancy. The next
example shows numerical results.
Example 2: Let q= 3 and n= 64. From (9), we find
that the number of distinct constant composition vectors
equals N(q, n) = 2145. The N(q, n) = 2145 vectors can
be encoded into a ternary label taken from a constant
composition code of length 10. In case the label is not a
member of a specified constant composition code, the label
length can be slightly shorter, namely log32145= 7.
In case we do not compress the composition vector, we
require a look-up table of (n+ 1)2= 65 ×65 = 4225
entries. The 4225 entries can be encoded into a specified
constant composition code of length 11 or, alternatively,
into an uncoded label of length log34225= 8.
VI. ERROR PERFORMANCE ANALYSIS
Decoding is a two-step process: first the label is detected,
and subsequently the payload is retrieved by using the data
conveyed by the label. Clearly, the payload is received in
error if the p-symbol label is received in error, or, in case
the label is correctly received, the payload itself is received
in error by the Slepian detector. We concentrate here on the
block error rate of the outputted payload.
The p-symbol label is drawn from a fixed constant
composition code, while the n-symbol payload is a member
of a constant composition code (not necessarily the same
code as that of the label), which may be different for each
source word. We start by computing the error performance
of a, given, constant composition code.
To that end, let the codeword xbe a codeword taken
from the constant composition code Sw. The word error rate
(WER) averaged over all words xSwis upperbounded
(union bound) by
WER <1
|Sw|
xSw
ˆ
x̸=x
Qδe(x,ˆ
x)
2σ,(28)
where the Q-function is defined by
Q(x) = 1
2π
x
eu2
2du. (29)
Note that the error performance of the proposed method is
invariant to unknown gain, a, and offset, b, see (7), and that,
obviously, these parameters are not present in the word error
rate (28). For asymptotically large signal-to-noise-ratio’s
(SNR), i.e. for σ << 1, the word error rate is overbounded
by [19]
WER < Nw(q, n)Qdmin
2σ,(30)
where Nw(q, n)is the average number of pairs of code-
words (neighbors) at minimum Euclidean distance, dmin,
and the squared minimum Euclidean distance is defined by
d2
min = min
x,ˆ
xS
x̸=ˆ
x
δe(x,ˆ
x).(31)
A codeword xSwis at minimum (squared) Euclidean
distance δe(x,ˆ
x)=2to ˆ
xSwsince ˆ
xcan be obtained
by swapping two symbols in x, say xiand xj, where |xi
xj|= 1. So we have
d2
min 2.(32)
In our analysis we assume a simple code having dmin =2.
The computation of the average number of neighboring pairs
of codewords ˆ
xof xboth in Swat minimum distance
dmin is a combinatorics exercise. Since xis a member
of a constant composition code, we infer, for reasons of
symmetry, that each xhas the same number of nearest
neighbors, so that it suffices to compute the number of
nearest neighbors for one, given, x. A codeword ˆ
xis at
(squared) Euclidean distance δe(x,ˆ
x) = 2 if ˆ
xcan be
obtained by swapping two symbols in x, say xiand xj,
where |xixj|= 1. We conclude that the number of pairs
of codewords at distance δe(x,ˆ
x) = 2 equals
Nw(q, n) =
q2
i=0
wiwi+1.(33)
For the binary case, q= 2, we simply find
Nw(2, n) = w0w1=w0(nw0),(34)
where w0denotes the number of zeros in a codeword. In
case all qsymbols appear exactly u,u1, times, thus,
n=uq, and wi=u,0iq1, we simply obtain,
using (33),
Nw(q, n) = (q1)u2=q1
q2n2.(35)
As the label is encoded into a p-symbol constant composi-
tion code, we can straightforwardly compute, using (30) and
(33), the error rate, denoted by WERlabel, of the p-symbol
label,
WERlabel < N ˆ
w(q, p)Q1
2σ,(36)
where ˆ
wdenotes the constant composition code used for
encoding the label. The label error rate, in case the labels
are taken from the constant composition code with constant
composition vector ˆ
w= (p
q, . . . , p
q), equals
WERlabel <q1
q2p2Q1
2σ.(37)
The computation of the error performance of the n-symbol
payload is more involved. The n-symbol payload consists
of n+q1
q1
distinct constant composition codes of size
|Sw|=n!
i∈Qwi!.
The error performance of the payload is the weighted error
performance of each constant composition code. Let WERpl
denote the word error rate of the payload given the label is
received correctly. Then
WERpl < Npl(q, n)Q1
2σ,(38)
where
Npl(q, n) = 1
qn
w|Sw|Nw(q, n)(39)
=1
qn
w0+...+wq=q1
n!
i∈Qwi!
q2
i=0
wiwi+1.
The next Theorem simplifies the above expression by invok-
ing some well-known properties of multinomial coefficients.
Theorem 2:
1
qn
w0+...+wq=q1
n!
i∈Qwi!
q2
i=0
wiwi+1 =q1
q2n(n1).
Proof. Following the multinomial theorem [20], we can
write a sum of q(dummy) terms, xi,i∈ Q as
(x0+. . . +xq1)n=
w0+...+wq1=n
n!
i∈Qwi!
1tq
xwt
t.
(40)
We find after substituting x0=. . . =xq1= 1 the well-
known identity
w0+...+wq1=n
n!
i∈Qwi!=qn.
After differentiating the right and left-hand side of (40) with
respect to xiand xj,i, j ∈ Q,i̸=j, and substituting
x0=...=xq1= 1, we obtain
w0+...+wq1=n
n!
i∈Qwi!wiwj=n(n1)qn2.
Then,
1
qn
w0+...+wq=q1
n!
i∈Qwi!
q2
i=0
wiwi+1 =q1
q2n(n1),
which proofs the Theorem.
With the above Theorem, we simply have
WERpl <q1
q2n(n1)Q1
2σ.(41)
A comparison of (37) and (41) makes it clear that at
high SNRs, the difference in label and payload WERs is
approximately a factor of n2/p2. As the length of the label,
p, is normally considerably shorter than the length, n, of the
payload, we conclude that the probability of the label error
is much smaller than that of a payload error, so that, in this
range, only in rare cases label errors will be the cause of
payload errors. In the range n1and σ1, the bit error
rate (BER) of the payload can be approximated by
BER 2WERpl, n 1, σ 1,(42)
as the majority of errors is caused by a swapping of
two symbols. In the next section, we present results of
computations and simulations.
VII. RES ULTS O F CO MP UTATI ON S AN D SI MU LATI ON S
We have implemented the proposed coding and detection
technique, and verified the computed error performance
using computer simulations. Figure 1 shows an example of
computations and simulations results for the case q= 2 and
n= 64. The label has length p= 8, where each label has
four zero’s and four one’s. The signal-to-nose ratio (SNR)
equals 20 log10 σdB. The diagram shows the word error
rate of the main data (including the errors caused by errors
in a label) and the word error rate of the label versus SNR.
The difference between the WERs of the payload and label
at high SNRs is approximately a factor of n(n1)/p2= 63,
see (37) and (41).
In order to compare the new technique with the prior
art technology, we have simulated the encoding of a 64-
bit payload into a 68-bit codeword having 34 ones and 34
zeros by applying Schalkwijk’s enumeration technique [15].
Note that 68 is the smallest even integer mfor which
m
m/2>264. Figure 2 shows the bit error rate (BER) of
a) the prior art using Schalkwijk’s enumeration scheme,
and b) the new technique that uses an 8-bit label (as
displayed in Figure 1). Both schemes carry a 64-bit payload.
The difference in rate of the two techniques, 64/68 versus
64/72, effects the magnitude of the noise variance. The rate
effect is insignificant in this case and therefore ignored
in the simulations presented in Figure 2. We notice that
Schalkwijk’s prior art enumeration scheme shows severe
error propagation, a phenomenon that has been reported in
the literature [21].
VIII. CONCLUSIONS
In the proposed composition check codes, the n-symbol
q-ary main data are sent unmodified to the receiver. The
encoder computes the composition vector of the main data,
and appends a p-symbol q-ary label to the main data, which
carries information regarding the composition vector of
the main data. The receiver detects the label using a first
Slepian detector, and subsequently retrieves the composition
vector of the main data. The retrieved composition vector,
in turn, is used by a second Slepian detector to optimally
detect the n-symbol q-ary main data. We have analyzed the
redundancy of the proposed method, described complexity
issues of the en(de)coding of the p-symbol q-ary label, and
analyzed the error performance of the main data and the la-
bel. We have shown results of simulations and computations
of both word and bit error rates.
14 14.5 15 15.5 16 16.5 17
10−6
10−5
10−4
10−3
10−2
10−1
100
WER
SNR (dB)
payload
label
Fig. 1. Word error rate (WER) of the main data of length n= 64
and appended label of length p= 8 for the binary case q= 2.
The signal-to-nose ratio (SNR) equals 20 log10 σdB. The dotted
lines are obtained by simulations, while the undotted lines show
the computed performance invoking (37) and (41).
15.5 16 16.5 17 17.5
10−4
10−3
10−2
10−1
100
BER
SNR (dB)
prior art
new
Fig. 2. Bit error rate (BER) of the main binary data of length n=
64 using the prior art enumeration method and the new method.
The signal-to-nose ratio (SNR) equals 20 log10 σdB. The dotted
lines are obtained by simulations, while the undotted line shows
the computed performance invoking (42).
REFERENCES
[1] K. A. S. Immink and J. H. Weber, “Minimum Pearson Distance De-
tection for Multi-Level Channels with Gain and/or Offset Mismatch,
IEEE Trans. Inform. Theory, vol. IT-60, pp. 5966-5974, Oct. 2014.
[2] D. Slepian, “Permutation Modulation,Proc. IEEE, vol. 53, pp. 228-
236, March 1965.
[3] W. Chu, C. J. Colbourn, and P. Dukes, “On Constant Composition
Codes,” Discrete Applied Mathematics, Volume 154, Issue 6, 15 pp.
912-929, April 2006.
[4] W. E. Ryan and S. Lin, Channel Codes, Classical and Modern,
Cambridge University Press, 2009.
[5] S. Datta and S. W. McLaughlin, “An Enumerative Method for
Runlength-Limited Codes: Permutation Codes,” IEEE Trans. Inform.
Theory, vol. IT-45, no. 6, pp. 2199-2204, Sept. 1999.
[6] D. Pelusi, S. Elmougy, L. G. Tallini, and B. Bose, “m-ary Balanced
Codes With Parallel Decoding,IEEE Transactions on Inform. The-
ory, vol. IT-61, pp. 3251-3264, May 2015.
[7] J. H. Weber and K. A. S. Immink, “Knuth’s Balancing of Codewords
Revisited,” IEEE Trans. Inform. Theory, vol. 56, no. 4, pp. 1673-1679,
2010.
[8] V. Skachek and K. A. S. Immink, “Constant Weight Codes: An
Approach Based on Knuths Balancing Method,” IEEE Journal on
Selected Areas in Communications, Special Issue on Mass Storage
Systems, Vol. 32, No. 5, pp. 908-918, May 2014.
[9] R. M. Capocelli. L. Gargano, and U. Vaccaro, “Efficient q-ary
immutable codes,” Discrete Applied Mathematics, vol. 33, pp. 25-
41, 1991.
[10] L. G. Tallini and U. Vaccaro, “Efficient m-ary balanced codes,”
Discrete Applied Mathematics, vol. 92, no. 1, pp. 17-56, 1999.
[11] T. G. Swart and J. H. Weber, “Efficient Balancing of q-ary Sequences
with Parallel Decoding,” IEEE International Symposium on Informa-
tion Theory, ISIT2009, pp. 1564-1568, Seoul, June 29 - July 3, 2009.
[12] Y. Li, E. En Gad, A. Jiang, and J. Bruck, “Data archiving in
1x-nm NAND flash memories: Enabling long-term storage using
rank modulation and scrubbing,“ 2016 IEEE International Reliability
Physics Symposium, 2016.
[13] D. E. Knuth, “Efficient Balanced Codes,IEEE Trans. Inform.
Theory, vol. IT-32, no. 1, pp. 51-53, Jan. 1986.
[14] O. Milenkovic and B. Vasic, “Permutation (d, k)codes: Efficient
Enumerative Coding and Phrase Length Distribution Shaping,IEEE
Trans. Inform. Theory, vol. IT-46, no. 7, pp. 2671-2675, Nov. 2000.
[15] J. P. M. Schalkwijk, “An Algorithm for Source Coding,” IEEE Trans.
Inform. Theory, IT-18, pp. 395-399, 1972.
[16] W. Feller, An Introduction to Probability Theory and Its Applications,
Volume I, Wiley and Sons Inc., New York, 1950.
[17] T. M. Cover, “Enumerative Source Coding,” IEEE Trans. Inform.
Theory, vol. IT-19, no. 1, pp. 73-77, Jan. 1973.
[18] D. E. Knuth, “Positional Number Systems,The Art of Computer
Programming, Vol. 2: Semi-numerical Algorithms, 3rd ed. Reading,
MA: Addison-Wesley, pp. 195-213, 1998.
[19] G. D. Forney Jr., “Maximum-Likelihood Sequence Estimation of
Digital Sequences in the Presence of Intersymbol Interference,” IEEE
Trans. Inform. Theory, vol. IT-18, pp. 363-378, May. 1972.
[20] J. Riordan, An Introduction to Combinatorial Analysis, Princeton
University Press, 1980.
[21] K. A. S. Immink and A. J. E. M. Janssen, “Error propagation
assessment of enumerative coding schemes,IEEE Trans. Inform.
Theory, vol. IT-45, no. 7, pp. 2591-2594, Nov. 1999.
Kees Schouhamer Immink (M’81-SM’86-F’90) re-
ceived his PhD degree from the Eindhoven University of
Technology. He was from 1994 till 2014 an adjunct pro-
fessor at the Institute for Experimental Mathematics, Essen,
Germany. In 1998, he founded Turing Machines Inc., an
innovative start-up focused on novel signal processing for
hard disk drives and solid-state (Flash) memories.
He received a Knighthood in 2000, a personal Emmy
award in 2004, the 2017 IEEE Medal of Honor, the 1999
AES Gold Medal, the 2004 SMPTE Progress Medal, and
the 2015 IET Faraday Medal. He received the Golden
Jubilee Award for Technological Innovation by the IEEE
Information Theory Society in 1998. He was elected into
the (US) National Academy of Engineering. He received
an honorary doctorate from the University of Johannesburg
in 2014.
Kui Cai received B.E. degree in information and control
engineering from Shanghai Jiao Tong University, Shanghai,
China, M.Eng degree in electrical engineering from National
University of Singapore, and joint Ph.D. degree in electrical
engineering from Technical University of Eindhoven, The
Netherlands, and National University of Singapore.
Currently, she is an Associate Professor with Singapore
University of Technology and Design (SUTD). She received
2008 IEEE Communications Society Best Paper Award in
Coding and Signal Processing for Data Storage. She served
as the Vice-Chair (Academia) of IEEE Communications
Society, Data Storage Technical Committee (DSTC) during
2015 and 2016. Her main research interests are in the areas
of coding theory, information theory, and signal processing
for various data storage systems and digital communica-
tions.
... (ii) Up to now, various coding techniques have been applied to alleviate the detection in case of channel mismatch, such as, rank modulation [32], balanced codes [33][34][35][36][37], and composition check codes [38]. ...
... The error performance of optimal detection of codewords that are drawn from a single constant composition code is immune to offset mismatch as showed by Slepian [37]. To further increase the size of the codebook, a composition check code is proposed [38]. Composition check codes have the virtues of Slepian's optimal detection method. ...
... Other approaches are errorcorrecting techniques. Up to now, various coding techniques have been applied to alleviate the detection in case of channel mismatch, specifically rank modulation [32], balanced codes [34], and composition check codes [38]. These methods are often considered too expensive in terms of redundancy and complexity. ...
... Constrained coding techniques, such as the rank modulation [12], balanced codes [13], and the constant composition codes [14], have also been proposed which can mitigate the unknown offset of the channel through sorting the channel readback signals. By leveraging on the balanced codes and the constant composition codes, the dynamic threshold schemes [15,16] were proposed for both the SLC and MLC flash memories. ...
... , b λ * n−1 } is the optimal solution of P (m, n). From (14), the optimal solution of P (m, n) can be obtained by solving its subproblems P (λ n−1 , n − 1), where n − 1 ≤ λ n−1 < m. Similarly, P (λ n−1 , n−1) can also be solved by its subproblems P (λ n−2 , n−2), where n−2 ≤ λ n−2 < λ n−1 . ...
... In this way, the optimal solution of P (m, n) can be calculated in a recursive manner, such that DP can be employed [34]. According to (14), the complexity of DP is given by O(m 2 n). Since n ≪ m, the complexity of DP is much lower than the exhaustive search method. ...
Preprint
The practical NAND flash memory suffers from various non-stationary noises that are difficult to be predicted. Furthermore, the data retention noise induced channel offset is unknown during the readback process. This severely affects the data recovery from the memory cell. In this paper, we first propose a novel recurrent neural network (RNN)-based detector to effectively detect the data symbols stored in the multi-level-cell (MLC) flash memory without any prior knowledge of the channel. However, compared with the conventional threshold detector, the proposed RNN detector introduces much longer read latency and more power consumption. To tackle this problem, we further propose an RNN-aided (RNNA) dynamic threshold detector, whose detection thresholds can be derived based on the outputs of the RNN detector. We thus only need to activate the RNN detector periodically when the system is idle. Moreover, to enable soft-decision decoding of error-correction codes, we first show how to obtain more read thresholds based on the hard-decision read thresholds derived from the RNN detector. We then propose integer-based reliability mappings based on the designed read thresholds, which can generate the soft information of the channel. Finally, we propose to apply density evolution (DE) combined with differential evolution algorithm to optimize the read thresholds for LDPC coded flash memory channels. Computer simulation results demonstrate the effectiveness of our RNNA dynamic read thresholds design, for both the uncoded and LDPC-coded flash memory channels, without any prior knowledge of the channel.
... Estimation of the unknown shifts may be achieved by using reference cells, but this is very expensive with respect to redundancy. Also, coding techniques can be applied to strengthen the detector's reliability in case of scaling and offset mismatch; these include rank modulation [9], balanced codes [10], and composition check codes [11]. However, these methods often suffer from large redundancy and high complexity. ...
... Finally, note that p = r,x r 2 r if p 0 ∈ R 1 ∪ R 5 ∪ R 9 , which corresponds to the 'otherwise' case in (11), and that L e (p,x) = L e r,x ...
... In Fig. 2, we draw the three cases in (11), where the subset {r/a |0 < a 1 ≤ a ≤ a 2 } is a line segment in the direction of r. The circle points are the closest points on this line segment tox. ...
Conference Paper
Full-text available
Reliability is a critical issue for modern multi-level cell memories. We consider a multi-level cell channel model such that the retrieved data is not only corrupted by Gaussian noise, but hampered by scaling and offset mismatch as well. We assume that the intervals from which the scaling and offset values are taken are known, but no further assumptions on the distributions on these intervals are made. We derive maximum likelihood (ML) decoding methods for such channels, based on finding a codeword that has closest Euclidean distance to a specified set defined by the received vector and the scaling and offset parameters. We provide geometric interpretations of scaling and offset and also show that certain known criteria appear as special cases of our general setting.
... The detector resilience to unknown mismatch by drift can be improved in various ways, for example, by employing coding techniques. Balanced codes [6]- [9] and composition check codes [10], [11], in conjunction with Slepian's optimal detection [12] offer excellent resilience in the face of channel mismatch on a block of symbols basis. These coding and signal processing techniques are often considered too expensive in terms of code redundancy and hardware, in particular when high-speed applications are considered. ...
... Note that the above initialization step of the modified k-means clustering technique has the same effect as the scaling used in the min-max detector (11). Figure 4 shows results of computer simulations for the case q = 4 and n = 64 and a gain a = 1.5. ...
Article
Full-text available
We report on the feasibility of k-means clustering techniques for the dynamic threshold detection of encoded q-ary symbols transmitted over a noisy channel with partially unknown channel parameters. We first assess the performance of k-means clustering technique without dedicated constrained coding. We apply constrained codes which allows a wider range of channel uncertainties so improving the detection reliability.
... • Step 1: Calculating probabilities for programming the memory cells to the LRS; • Step 2: Encoding row-by-row the user message M of length-into A , such that every row and column satisfies the corresponding weight-constraints. For the first row, we can use any coding methods for constructing the 1-D constant-weight codes [9]- [11]. In this work, we apply the enumerative coding technique [10] so as to achieve a high coding efficiency. ...
Article
This paper proposes novel methods for designing two-dimensional (2-D) weight-constrained codes for reducing the parasitic currents in the crossbar resistive memory array. In particular, we present efficient encoding/decoding algorithms for capacity-approaching 2-D weight-constrained codes of size m×n, where each row has a weight pn with p < 1/2; and each column has a weight qm with q ≤ 1/2. We show that the designed codes provide higher code rates compared to the prior art codes for p ≤ 1/2 and q ≤ 1/2.
... Constrained coding techniques have also been proposed to improve the channel detection for NVM channels with unknown offset. Typical codes proposed in the literature include the balanced codes [9], and the composition check codes [10]. These codes can mitigate the unknown offset of the channel when used in conjunction with the Slepian detector. ...
... Other approaches are error correcting techniques. Up to now, various coding techniques have been applied to alleviate the detection in case of channel mismatch, specifically rank modulation [4], balanced codes [5] and composition check codes [6]. These methods are often considered too expensive in terms of redundancy and complexity. ...
Conference Paper
Full-text available
Data storage systems may not only be disturbed by noise. In some cases, the error performance can also be seriously degraded by offset mismatch. Here, channels are considered for which both the noise and offset are bounded. For such channels, Euclidean distance-based decoding, Pearson distance-based decoding, and Maximum Likelihood decoding are considered. In particular, for each of these decoders, bounds are determined on the magnitudes of the noise and offset intervals which lead to a word error rate equal to zero. Case studies with simulation results are presented confirming the findings.
Article
In many channels, the transmitted signals do not only face noise, but offset mismatch as well. In the prior art, maximum likelihood (ML) decision criteria have already been developed for noisy channels suffering from signal independent offset . In this paper, such ML criterion is considered for the case of binary signals suffering from Gaussian noise and signal dependent offset . The signal dependency of the offset signifies that it may differ for distinct signal levels, i.e., the offset experienced by the zeroes in a transmitted codeword is not necessarily the same as the offset for the ones. Besides the ML criterion itself, also an option to reduce the complexity is considered. Further, a brief performance analysis is provided, confirming the superiority of the newly developed ML decoder over classical decoders based on the Euclidean or Pearson distances.
Article
The practical NAND flash memory suffers from various non-stationary noises that are difficult to be predicted. For example, the data retention noise induced channel offset is unknown during the readback process, and hence severely affects the reliability of data recovery from the memory cell. In this paper, we first propose a novel recurrent neural network (RNN)-based detector to effectively detect the data stored in the multi-level-cell (MLC) flash memory without the prior knowledge of the channel. However, compared with the conventional threshold detector, the proposed RNN detector introduces much longer read latency and more power consumption. To tackle this problem, we further propose an RNN-aided (RNNA) dynamic threshold detector, whose detection thresholds can be derived based on the outputs of the RNN detector. We thus only need to activate the RNN detector periodically when the system is idle. Moreover, to enable soft-decision decoding of error-correction codes, we first show how to obtain more read thresholds based on the hard-decision read thresholds derived from the RNN detector. We then propose integer-based reliability mappings based on the designed read thresholds, which can generate the soft information of the channel. Finally, we propose to apply density evolution (DE) combined with the differential evolution algorithm to optimize the read thresholds for low-density parity-check (LDPC) coded flash memory channels. Computer simulation results demonstrate the effectiveness of our proposed RNNA dynamic read thresholds design, for both the uncoded and LDPC-coded flash memory channels, without any prior knowledge of the channel.
Article
Full-text available
Channel coding lies at the heart of digital communication and data storage, and this detailed introduction describes the core theory as well as decoding algorithms, implementation details, and performance analyses. Known for their writing clarity, Professors Ryan and Lin provide the latest information on modern channel codes, including turbo and low-density parity-check (LDPC) codes. They also present detailed coverage of BCH codes, Reed-Solomon codes, convolutional codes, finite geometry codes, and product codes, providing a one-stop resource for both classical and modern coding techniques. Assuming no prior knowledge in the field of channel coding, the opening chapters begin with basic theory to introduce newcomers to the subject. Later chapters then extend to advanced topics such as code ensemble performance analyses and algebraic code design. 250 varied and stimulating end-of-chapter problems are also included to test and enhance learning, making this an essential resource for students and practitioners alike.
Article
Full-text available
The performance of certain transmission and storage channels, such as optical data storage and nonvolatile memory (flash), is seriously hampered by the phenomena of unknown offset (drift) or gain. We will show that minimum Pearson distance (MPD) detection, unlike conventional minimum Euclidean distance detection, is immune to offset and/or gain mismatch. MPD detection is used in conjunction with (T) -constrained codes that consist of (q) -ary codewords, where in each codeword (T) reference symbols appear at least once. We will analyze the redundancy of the new (q) -ary coding technique and compute the error performance of MPD detection in the presence of additive noise. Implementation issues of MPD detection will be discussed, and results of simulations will be given.
Article
Full-text available
In 1986, Don Knuth published a very simple algorithm for constructing sets of bipolar codewords with equal numbers of one's and zero's, called balanced codes. Knuth's algorithm is well suited for use with large codewords. The redundancy of Knuth's balanced codes is a factor of two larger than that of a code comprising the full set of balanced codewords. In this paper, we will present results of our attempts to improve the performance of Knuth's balanced codes.
Conference Paper
The challenge of using inexpensive and high-density NAND flash for archival storage was posed recently for reducing data center costs. However, such flash memory is becoming more susceptible to noise, and its reliability issues has become the major concern for its adoption by long-term storage systems. This paper studies the system-level reliability of archival storage that uses 1x-nm NAND flash memory. We analyze retention error behavior, and show that 1x-nm MLC and TLC flash do not immediately qualify for long-term storage. We then implement the rank modulation (RM) scheme and memory scrubbing (MS) for retention period (RP) enhancement. The RM scheme provides a new data representation using the relative order of cell voltages, which provides higher reliability against uniform asymmetric threshold voltage shift due to charge leakage. Results show that the new representation reduces raw bit error rate (RBER) by 45% on average, and using RM and MS together provides up to 196, 171, 146 and 121 years of RPs for blocks with 0, 25, 50 and 75 program/erase cycles, respectively.
Article
An m-ary balanced code with r check digits and k information digits is a code over the alphabet Zm = {0,1, …, m−1} of length n = k+r and cardinality mk such that each codeword is balanced; that is, the real sum of its components (or weight) is equal to [(m − 1)n/2]. This paper contains new efficient methods to design m-ary balanced codes which improve the constructions found in the literature, for all alphabet size m ⩾2. To design such codes, the information words which are close to be balanced are encoded using single maps obtained by a new generalization of Knuth's complementation method to the m-ary alphabet that we introduce in this paper. Whereas, the remaining information words are compressed via suitable m-ary uniquely decodable variable length codes and then balanced using the saved space. For any m⩾2, infinite families of m-ary balanced codes are given with r check digits and k⩽[1/(1 − 2α)][mr − 1)/(m − 1)] − c1 (m, α) r −c2(m, α) information digits, where α ϵ [0, 1/2) can be chosen arbitrarily close to 1/2. The codes can be implemented with O(mk logmk) m-ary digit operations and O(m + k) memory elements to store m-ary digits.
Article
In this article, we study properties and algorithms for constructing sets of 'constant weight' codewords with bipolar symbols, where the sum of the symbols is a constant q, q 6 0. We show various code constructions that extend Knuth's balancing vector scheme, q = 0, to the case where q > 0. We compute the redundancy of the new coding methods. Index Terms—Balanced code, channel capacity, constrained code, magnetic recording, optical recording. I. INTRODUCTION Let q be an integer. A setC, which is a subset of ( w = (w1;w2;:::;wn)2f 1; +1g n : n X i=1 wi = q )
Conference Paper
Balancing of q-ary sequences, using a generalization of Knuth's efficient parallel balancing scheme, is considered. It is shown that the new general scheme is as simple as the original binary scheme, which lends itself to parallel decoding of the balanced sequences.