Content uploaded by Kees Schouhamer Immink

Author content

All content in this area was uploaded by Kees Schouhamer Immink on Sep 08, 2021

Content may be subject to copyright.

Composition Check Codes

Kees A. Schouhamer Immink and Kui Cai

Abstract—We present composition check codes for noisy

storage and transmission channels with unknown gain and/or

offset. In the proposed composition check code, like in

systematic error correcting codes, the encoding of the main

data into a constant composition code is completely avoided.

To the main data a coded label is appended that carries

information regarding the composition vector of the main

data. Slepian’s optimal detection technique of codewords

that are taken from a constant composition code is applied

for detection. A ﬁrst Slepian detector detects the label, and

subsequently restores the composition vector of the main data.

The composition vector, in turn, is used by a second Slepian

detector to optimally detect the main data. We compute the

redundancy and error performance of the new method, and

results of computer simulations are presented.

Index Terms—Constant composition code, permutation

code, ﬂash memory, optical recording

I. INTRODUCTION

The receiver of a transmission or storage system is often

ignorant of the exact value of the amplitude (gain) and/or

offset (translation) of the received signal, which depend

on the actual, time-varying, conditions of the channel.

In wireless communications, for example, the amplitude

of the received signal may vary rapidly due to multi-

path propagation or due to obstacles affecting the wave

propagation. In optical disc recording, both the gain and

offset depend on the reﬂective index of the disc surface

and the dimensions of the written features. Fingerprints on

optical discs may result in rapid gain and offset variations

of the retrieved signal. Assume the q-level pulse amplitude

modulated (PAM) signal, xi,i= 1,2, . . . , is sent, and

received as ri, where

ri=a(xi+νi) + b.

The reals a > 0and bare called the gain and offset of

the received signal, respectively, and we assume that the

receiver is ignorant of the actual values of aand b. The

stochastic component is called ‘noise’ and is denoted by νi.

We further assume that the parameters aand bvary slowly

over time or position, so that for a plurality of n,n > 1,

Kees A. Schouhamer Immink is with Turing Machines Inc, Willemskade

15d, 3016 DK Rotterdam, The Netherlands. E-mail: immink@turing-

machines.com.

Kui Cai is with Singapore University of Technology and Design (SUTD),

8 Somapah Rd, 487372, Singapore. E-mail: cai kui@sutd.edu.sg.

This work is supported in part by Singapore Agency of Science and

Technology (A*Star) Public Sector Research Funding (PSF) grant.

Copyright (c) 2017 IEEE. Personal use of this material is permitted.

However, permission to use this material for any other purposes must be

obtained from the IEEE by sending a request to pubs-permissions@ieee.org

symbol time slots the parameters aand bcan be considered

ﬁxed, but unknown to the receiver. The receiver’s ignorance

of the exact value of aand bmay seriously degrade the

error performance of the transmission or storage channel,

as has been shown in [1].

There are a myriad of proposals to handle the problem

of the channel’s unknown gain and offset. Automatic gain

control (AGC) has been applied in many practical trans-

mission systems, but an automatic gain control (AGC) is

close to useless if the gain and offset vary very rapidly.

Redundant training sequences or reference memory cells

with prescribed levels are placed between ‘user’ data for

estimating the unknown parameters. The parameter estima-

tion will, by necessity, be based on an average over a limited

time-interval, and the estimated values may be inaccurate as

they lag behind the actual values. A more frequent insertion

of reference cells may improve the parameter estimation,

which, however, comes at the cost of higher redundancy

and thus decreased payload.

Slepian showed in his seminal paper [2] that the error per-

formance of optimal detection of codewords that are drawn

from a single constant composition code is immune to gain

and offset mismatch. He also presented an implementation

of optimal detection whose complexity grows with nlog n.

A constant composition code of n-length codewords over

the q-ary alphabet has the property that the numbers of

occurrences of the symbols within a codeword is the same

for each codeword [3].

In practice, however, Slepian’s detection method has

limited applicability as it depends heavily on the efﬁcient

and simple encoding and decoding of arbitrary user data

into a constant composition code. Encoding and decoding

of constant composition code is a ﬁeld of active research,

see, for example [4], [5], [6]. For the binary case, Weber

and Immink [7] and Skachek et al. [8] presented methods

that translate arbitrary data into a codeword having a pre-

scribed number of one’s and zero’s. Enumerative methods

for generating codewords have been presented in [9], [10],

[11]. A serious drawback of enumeration schemes is error

propagation, a phenomenon illustrated in Section VII. The

lack of simple and efﬁcient encoding and decoding schemes

has been a major barrier for the application of Slepian’s

optimal detection method. Thus, an efﬁcient technique to

eliminate, or at least signiﬁcantly alleviate, the drawbacks

and deﬁciencies of Slepian’s prior art system has been a

desideratum.

The scheme proposed and analyzed here, coined com-

position check code, meets the above desideratum as it

has the virtues of Slepian’s optimal detection method, but

its drawback, the encoding of the main data, or payload

symbols, into a constant composition code, is removed.

In the proposed scheme, the main data are sent to the

receiver without modiﬁcation. Attached to the main data

word is a relatively short, ﬁxed-length, label that informs the

receiver regarding the constant composition code to which

the sent main data word belongs. The information conveyed

by the label is used by the receiver to optimally recover

the main data using Slepian’s optimal detection method.

The proposed method is reminiscent of a systematic error

correcting code, where unmodiﬁed main data is sent, and a

parity check word is appended to make error correction or

detection possible. A system using a variant of the proposed

scheme was discussed recently by Li et al. [12]. In Li’s

method, however, the label is not encoded into a constant

composition code, and hence this portion is not immune

to the unknown gain and offset. Also, Li’s system needs

two detectors for the payload portion and the label portion

(i.e. the conventional threshold detector and the Slepian

detector), which all but doubles the detector complexity. The

proposed method is also reminiscent of Knuth’s method [13]

for generating codewords having equal numbers of one’s

and zero’s, where an appended preﬁx carries information

regarding the speciﬁc segment of the codeword that has been

modiﬁed.

It should be noted that the proposed technique has the

principal virtues of Slepian’s prior art method, such as

enabling simple optimal detection of the noisy codewords

and immunity to the channel’s gain and offset mismatch.

However, the generated codewords do not belong to a

prescribed constant composition code, and therefore they

do not possess the spectral properties, speciﬁcally reduced

power at the low-frequency end, of codewords that are

drawn from a constant composition code.

A second advantage of the new scheme was noted

in [12]. Since the payload is ”systematic“, the payload

can be protected by a conventional error-correcting code.

Much stronger error correcting schemes (ECCs) are known

for conventional channels than those that are subsets of

constant-composition codes.

The paper is organized as follows. In Section II, we

set the scene, introduce preliminaries and discuss the state

of the art. In Section III, we present our approach. In

Section IV we compute the redundancy of the proposed

method. Complexity issues are dealt with in Section V. In

Sections VI and VII, we analyze and compute the error

performance. In Section VIII, we describe our conclusions.

II. PRELIMINARIES

We assume that user data is recorded in groups of n q-

level symbols, called a codeword. We consider a codebook,

Sw, of chosen codewords x= (x1, x2, . . . , xn)over the q-

ary alphabet Q={0, . . . , q −1}, where n, the length of x,

is a positive integer. In line with the adopted linear channel

model, we assume that the codeword, x, is retrieved as

r=a(x+ν) + b1,(1)

where r= (r1, . . . , rn),ri∈R, and 1= (1, . . . , 1). The

basic premises are that xis retrieved with an unknown

(positive) gain a,a > 0, is offsetted by an unknown uniform

offset, b1, where aand b∈R, and corrupted by additive

Gaussian noise ν=(ν1, . . . , νn),νi∈Rare noise samples

with distribution N(0, σ2), where σ2∈Rdenotes the

variance of the additive noise.

A. Constant composition codes

Deﬁne the composition vector w(x)=(w0, . . . , wq−1)

of x, where the qentries wj,j∈ Q, of w(x)indicate

the number of occurrences of the symbol xi=j∈ Q,

1≤i≤n, in x. That is, for a q-ary sequence x, we denote

the number of appearances of the symbol jby

wj=|{i:xi=j}| for j= 0,1, . . . , q −1.(2)

Clearly, wj=nand wj∈ {0, . . . , n}. A constant

composition code comprising all possible n-vectors with

the same composition vector w(x), is denoted by Sw.

Evidently, every codeword has wjoccurrences of symbol

j∈ Q. The code Swconsists of all permutations of the

symbols deﬁned by the composition vector w, so that the

size of Swequals the multinomial coefﬁcient

|Sw|=n!

i∈Qwi!.(3)

A constant composition code is also known as permuta-

tion modulation code (Variant I), which was introduced

by Slepian [2] in 1965. Slepian showed that a constant

composition code allows optimal detection using a simple

algorithm.

B. Slepian’s Algorithm

The well-known (squared) Euclidean distance, δe(r,ˆ

x),

between the received signal vector rand the codeword ˆ

x∈

Swis deﬁned by

δe(r,ˆ

x) =

n

i=1

(ri−ˆxi)2.(4)

A minimum Euclidean distance detector outputs the code-

word xodeﬁned by

xo= arg min

ˆ

x∈Sw

δe(r,ˆ

x).(5)

Working out (4) gives

δe(r,ˆ

x) =

n

i=1

(x′

i+b)2−2

n

i=1

x′

iˆxi−2b

n

i=1

ˆxi+

n

i=1

ˆx2

i,

(6)

where x′

i=a(xi+νi). Evidently, the Euclidean distance

δe(r,ˆ

x)depends on the quantities aand b, which may lead

to a serious degradation of the error performance [1].

The ﬁrst term of (6), n

i=1(x′

i+b)2, is independent of

ˆ

x, and clearly dropping the constant term does not effect

the outcome of (5). In a similar fashion, we can drop

the quantities 2bn

i=1 ˆxiand n

i=1 ˆx2

i, since the vector ˆ

x

is drawn from a constant composition code so that both

quantities are constant for all ˆ

x∈Sw.

Then we ﬁnd

δe(r,ˆ

x)≡ −

n

i=1

riˆxi,(7)

where the sign ≡denotes equivalence between (4) and (7)

since the outcome of (5) is the same when (7) is used instead

of (4). Thus the channel’s unknown gain, a, and offset, b,

do not affect the outcome of (5) when codewords are drawn

from a constant composition code Sw. We now address the

efﬁcient elaboration of the inner product (7) using Slepian’s

algorithm.

Slepian [2] showed that the minimization (5) can be

replaced by a simple sorting of the symbols of the received

signal vector r. He proved that for two given vectors,

(ˆx1, . . . , ˆxn)and (r1, . . . , rn)that the inner product (7)

r1ˆxi1+r2ˆxi2+. . . +rnˆxin(8)

is maximized under all permutations i1,i2, . . . , inof the

integers 1,2, . . . , n, by pairing the largest ˆxiwith the largest

ri, the second largest ˆxiwith the second largest ri, etc. To

that end, the nelements of the received vector, r, are sorted

from largest to smallest. From the composition vector of

the codeword, x, at hand, w, we deduce the reference

vector xr= (q−1, . . . q −1, q −2, . . . , q −2,...,0,...,0),

where the symbols are sorted from largest to smallest, and

the numbers of q−1’s, q−2’s and so on in xrequal

wq−1,wq−2, . . .,w0. Slepians algorithm is attractive since

the complexity of sorting nsymbols grows with nlog n,

which is far less complex than the evaluation of (5) whose

complexity grows exponentially with n. A small example

may clarify Slepian’s algorithm.

Example 1: Let n= 5,q= 3, and let the composition

vector be w= (1,2,2). Thus each sent codeword is a

permutation of the reference vector xr=(2, 2, 1, 1, 0),

where the symbols of the reference vector have been sorted

largest to smallest. Let the received vector be r=(0.2,

1.4, 0.9, 1.2, 1.6). We sort the symbols in the received

vector rin decreasing order and obtain (1.6, 1.4, 1.2, 0.9,

0.2). Then the detector assigns the symbols by pairing

the two largest symbols of xrand r, that is, the symbol

valued 1.6 to a ‘2’, then 1.4 to ‘2’, 1.2 to a ‘1’, 0.9 to

a ‘1’, and ﬁnally the symbol valued 0.2 to a ‘0’. The

detector decides that the codeword (0, 2, 1, 1, 2) was sent.

III. COM PO SI TI ON C HE CK C ODES

A drawback of the usage of a constant composition code

in Slepian’s prior art is the complexity of the encoding and

decoding operation in case the payload is large. Encoding

algorithms, such as enumerative encoding [14] [15], require

much smaller look-up tables than direct look-up tables, but

they often require complex look-up tables and algorithms.

In the proposed composition check code, the encoding of

arbitrary data into a constant composition code is avoided.

The n-symbol main data word, x, is sent without any

modiﬁcation, and a separate p-symbol label, denoted by

z= (z1, . . . , zp),zi∈ Q, is appended to the main data

word. The appended p-symbol label, z, informs the Slepian

detector to which constant composition code the main data

word, x, belongs. To that end, we deﬁne a one-to-one

correspondence between the set of all possible composition

vectors of the n-symbol payload and the set of p-symbol

labels. The number of possible distinct constant composition

vectors, denoted by N(q, n), of a q-ary n-vector equals [16],

page 38,

N(q, n) = n+q−1

q−1.(9)

The length of the label, p, must be chosen sufﬁciently

large so that the label can uniquely convey the identity

of the constant composition code. In the binary case, the

encoded label represents the number of one’s in the main

data word. The procedure for encoding and decoding is

succinctly written as follows.

Encoding/Decoding The main (user) data, denoted by x,

which consists of n q-ary symbols, is transferred to the

encoder. The encoder ﬁrst forms the composition vector

w= (w0, . . . , wq−1)of xusing (2), and translates the

vector winto the p-symbol q-ary label, z, using a predeﬁned

one-to-one correspondence, z=ϕ(w). The label, z, is

appended to the main data, and the main data plus the label

are sent. The one-to-one correspondence z=ϕ(w)can

be simply embodied by a look-up table for small values

of qand n. In practice, for larger values of nand q, the

function z=ϕ(w)is a two-step process, where z=ϕ(w)

is partitioned into a cascade of two functions, I=ϕ1(w)

and z=ϕ2(I), where Iis a non-negative integer. In the

ﬁrst step, the (compression) function, I=ϕ1(w), translates

the composition vector winto an integer in the range 0 to

at most (n+ 1)q−1. The vector wis redundant since we

have the constraint q−1

i=0 wi=n. In case the composition

vector wis ideally compressed, the integer Iranges from 0

to N(q, n)−1. In the second step, the function, z=ϕ2(I),

translates the integer Iinto the p-symbol q-ary label. Prac-

tical issues regarding the implementation of the functions

I=ϕ1(w)and z=ϕ2(I)for larger values of nand qare

given in Section V.

Note that the sent concatenation of xand zdoes not

have special spectral characteristics, it is not ‘balanced’ or

‘dc-free’.

The label, z, is detected, preferably using Slepian’s

optimal method, decoded by a look-up table, ϕ−1(z), and

the composition vector w=ϕ−1(z), is retrieved. Following

Slepian’s method, see Subsection II-B, the received main

data symbols are sorted and assigned to the symbols in

accordance with the retrieved composition vector w.

It is sufﬁcient to uniquely encode the N(q, n)different

composition vectors into p=⌈logqN(q, n)⌉label symbols,

but in a preferred embodiment, the label is a codeword taken

from a predeﬁned p-symbol constant composition code.

The preferred embodiment has the advantage that ﬁrstly,

Slepian’s optimal method is used for both the main data

word and the label giving them both a high resilience to

additive noise, and, secondly, both the main data word and

the label are immune to channel mismatch. These attractive

virtues come at a price, and in the next section, we compute

the redundancy of composition check codes.

IV. REDUNDANCY ANALYSIS

We discuss two label formatting options, where a) the

label is uncoded as in [12], and b) the label is encoded

using a constant composition code.

The p-symbol label must be able to uniquely represent

all distinct composition vectors, N(q, n), of the n-symbol

payload. Thus, for an uncoded label, we ﬁnd the condition

p≥ ⌈logqN(q, n)⌉.(10)

For asymptotically large nand limited q, we obtain using

Stirling’s Approximation for a binomial coefﬁcient

N(q, n) = n+q−1

q−1≈1

(q−1)!nq−1, n ≫1,(11)

so that the code redundancy, p, equals

p≈(q−1) logqn−logq(q−1)!.(12)

In case the p-symbol label is encoded into a q-ary constant

composition code, we have the condition

p!

i∈Q ˆwi!≥N(q, n),

where ˆ

wdenotes the composition vector of the p-symbol

label. The number of labels is maximized if we choose p=

aq, and ˆwi=a,∀a∈ Q. Then the label length, p, must be

sufﬁciently large to satisfy

p!

(p

q!)q≥N(q, n).(13)

Since, using Stirling’s Approximation,

p!

(p

q!)q≈αq

qp

p(q−1)/2,(14)

where

αq=q(q/2)

(2π)(q−1)/2,

we have

αq

qp

p(q−1)/2≥1

(q−1)!nq−1, p, n ≫1,(15)

or

logqαq−q−1

2logqp+p≥(q−1) logqn−logq(q−1)!.

(16)

For asymptotically large n, we have the estimate of the

redundancy

p > (q−1) logqn. (17)

For q= 2 we simply ﬁnd

p

⌈p

2⌉≥n+ 1,(18)

which is about the required redundancy of Knuth’s code for

balancing binary sequences.

The redundancy, rs, of Slepian’s prior art method, where

the payload is translated into a constant composition code

where all symbols appear with frequency n

q, is

rs= logq

qn

n!

(n

q!)n

≈q−1

2logqn−logqαq.(19)

A comparison with (17) reveals that for large nthe redun-

dancy of the proposed scheme is approximately a factor of

two more than can be obtained by the conventional method

using a ﬁxed constant composition code. Apparently, this is

the price to pay for a simple implementation. A variable-

length label that takes into account the probability of occur-

rence of the label instead of the ﬁxed-length label studied

here, will reduce the required redundancy of the method [7].

V. COMPLEXITY IS SU ES

For relatively small nand q, the composition vector w

can be straightforwardly translated into a p-symbol q-ary

label zby using a look-up table that embodies the one-to-

one correspondence z=ϕ(w). We infer from (11) that,

although N(q, n)grows polynomially with the codeword

length, n, that for larger alphabet size qthe number of

entries of a look-up table can be prohibitively large. For

a practical application, we must try to ﬁnd an algorithmic

routine in lieu of look-up tables. We present two alternative

scenario’s. We encode (compress) the composition vector w

using an algorithmic (enumeration) approach. Alternatively,

we do not compress the composition vector w, and we

compute the redundancy loss.

We commence, in the next subsection, with the compres-

sion of the composition vector, w, using Cover’s enumera-

tive coding techniques [17].

A. Compressed composition vector, enumerative encoding

of the composition vector

The translation function, I=ϕ1(w), of the composition

vector, w, into an integer I,0≤I≤N(q, n)−1,

can be accomplished using enumerative encoding. In an

enumerative coding scheme, the codewords are ranked in

lexicographical order [17]. The lexicographical index, or

rank, I, of a codeword, x, in the ordered list equals the

number of codewords preceding xin the ordered list. Using

the ﬁndings of [17], we write down the next Theorem.

Theorem 1:

I=

q−1

i=1

xi−1−1

j=0 n′−j+q−i−1

q−i−1,(20)

where

n′=n−

i−1

i′=1

xi′−1,

and I∈ {0. . . N (q, n)−1}.

Proof: We follow Cover’s approach [17]. Let

ns(w0, w1, . . . , wk−1)denote the number of composition

vectors for which the ﬁrst kcoordinates are given by

(w0, w1, . . . , wk−1). According to Cover, the lexicographic

index, I, is given by

I=

q−1

i=1

xi−1−1

j=0

ns(w0, w1, . . . , wi−2, j).(21)

We have

ns(w0, w1, . . . , wi−1) = n′+q−i−1

q−i−1,

where

n′=n−

i

i′=1

xi′−1.

Substitution yields (20), which concludes the proof.

The inverse function, w=ϕ−1

1(I), is also calculated

using an algorithmic approach, and we refer to [17] for

details. The binomial coefﬁcients can be computed on the

ﬂy, and look-up tables are not required.

B. Uncompressed composition vector

Alternatively, we investigate the case that the vector, w, is

not compressed. The qentries wiof the composition vector

ware in the alphabet {0, . . . , n}, so that the composition

vector wcan be seen as a positive integer number of q

(n+ 1)-ary digits. We may slightly compress the vector

wby noting that the observation of q−1entries uniquely

identiﬁes wsince q−1

i=0 wi=n. We study the increase of

the redundancy as the label must be able to accommodate

(n+ 1)qdifferent integer numbers that are associated with

the uncompressed w.

To that end, let p′denote the length of the label. In case

the label is uncoded, the vector wis translated into the

q-ary p′-symbol label using a well-known base conversion

algorithm [18]. We have

p′≥q⌈logq(n+ 1)⌉.(22)

The relative increase in redundancy with respect to the

compressed vector, w, is deﬁned by

η=p′−p

p.(23)

Then

η=q⌈logq(n+ 1)⌉ − (q−1) logqn−logq(q−1)!

(q−1) logqn−logq(q−1)! .

(24)

For asymptotically large n, we ﬁnd

η≈1

q, n ≫1,(25)

and we conclude that the relative increase in redundancy

by using uncompressed composition vector, w, is inversely

proportional with q.

We proceed and take a look at the redundancy of the

coded label. The algorithmic encoding and decoding of an

integer number in any base into a codeword of symbols in

any base of a constant composition code using enumerative

encoding has been published extensively in the literature,

see for example [5]. The coded label length, p′, must be

sufﬁciently large to satisfy (see (13) and (14))

p!

(p

q!)q≥(n+ 1)q,

which, for large n, can be approximated by

αq

qp′

p′(q−1)/2≥(n+ 1)q, n ≫1,

or

logqαq−q−1

2logqp′+p′≥qlogq(n+ 1).

For asymptotically large n, we ﬁnd

p′> q logq(n+ 1).(26)

The relative extra redundancy required by the unconstrained

algorithmic encoding of the composite vector wequals

η≈qlogq(n+ 1) −(q−1) logqn

qlogq(n+ 1) ≈1

q, n ≫1.(27)

We infer that the relative extra redundancy for the method

that employs traditional enumerative algorithmic encoding

equals 1

q. For small values of qwe may, dependent of

the codeword length n, apply look-up tables for encoding

the label, while for larger qwe may employ algorithmic

encoding without signiﬁcant loss in redundancy. The next

example shows numerical results.

Example 2: Let q= 3 and n= 64. From (9), we ﬁnd

that the number of distinct constant composition vectors

equals N(q, n) = 2145. The N(q, n) = 2145 vectors can

be encoded into a ternary label taken from a constant

composition code of length 10. In case the label is not a

member of a speciﬁed constant composition code, the label

length can be slightly shorter, namely ⌈log32145⌉= 7.

In case we do not compress the composition vector, we

require a look-up table of (n+ 1)2= 65 ×65 = 4225

entries. The 4225 entries can be encoded into a speciﬁed

constant composition code of length 11 or, alternatively,

into an uncoded label of length ⌈log34225⌉= 8.

VI. ERROR PERFORMANCE ANALYSIS

Decoding is a two-step process: ﬁrst the label is detected,

and subsequently the payload is retrieved by using the data

conveyed by the label. Clearly, the payload is received in

error if the p-symbol label is received in error, or, in case

the label is correctly received, the payload itself is received

in error by the Slepian detector. We concentrate here on the

block error rate of the outputted payload.

The p-symbol label is drawn from a ﬁxed constant

composition code, while the n-symbol payload is a member

of a constant composition code (not necessarily the same

code as that of the label), which may be different for each

source word. We start by computing the error performance

of a, given, constant composition code.

To that end, let the codeword xbe a codeword taken

from the constant composition code Sw. The word error rate

(WER) averaged over all words x∈Swis upperbounded

(union bound) by

WER <1

|Sw|

x∈Sw

ˆ

x̸=x

Qδe(x,ˆ

x)

2σ,(28)

where the Q-function is deﬁned by

Q(x) = 1

√2π∞

x

e−u2

2du. (29)

Note that the error performance of the proposed method is

invariant to unknown gain, a, and offset, b, see (7), and that,

obviously, these parameters are not present in the word error

rate (28). For asymptotically large signal-to-noise-ratio’s

(SNR), i.e. for σ << 1, the word error rate is overbounded

by [19]

WER < Nw(q, n)Qdmin

2σ,(30)

where Nw(q, n)is the average number of pairs of code-

words (neighbors) at minimum Euclidean distance, dmin,

and the squared minimum Euclidean distance is deﬁned by

d2

min = min

x,ˆ

x∈S

x̸=ˆ

x

δe(x,ˆ

x).(31)

A codeword x∈Swis at minimum (squared) Euclidean

distance δe(x,ˆ

x)=2to ˆ

x∈Swsince ˆ

xcan be obtained

by swapping two symbols in x, say xiand xj, where |xi−

xj|= 1. So we have

d2

min ≥2.(32)

In our analysis we assume a simple code having dmin =√2.

The computation of the average number of neighboring pairs

of codewords ˆ

xof xboth in Swat minimum distance

dmin is a combinatorics exercise. Since xis a member

of a constant composition code, we infer, for reasons of

symmetry, that each xhas the same number of nearest

neighbors, so that it sufﬁces to compute the number of

nearest neighbors for one, given, x. A codeword ˆ

xis at

(squared) Euclidean distance δe(x,ˆ

x) = 2 if ˆ

xcan be

obtained by swapping two symbols in x, say xiand xj,

where |xi−xj|= 1. We conclude that the number of pairs

of codewords at distance δe(x,ˆ

x) = 2 equals

Nw(q, n) =

q−2

i=0

wiwi+1.(33)

For the binary case, q= 2, we simply ﬁnd

Nw(2, n) = w0w1=w0(n−w0),(34)

where w0denotes the number of zeros in a codeword. In

case all qsymbols appear exactly u,u≥1, times, thus,

n=uq, and wi=u,0≤i≤q−1, we simply obtain,

using (33),

Nw(q, n) = (q−1)u2=q−1

q2n2.(35)

As the label is encoded into a p-symbol constant composi-

tion code, we can straightforwardly compute, using (30) and

(33), the error rate, denoted by WERlabel, of the p-symbol

label,

WERlabel < N ˆ

w(q, p)Q1

√2σ,(36)

where ˆ

wdenotes the constant composition code used for

encoding the label. The label error rate, in case the labels

are taken from the constant composition code with constant

composition vector ˆ

w= (p

q, . . . , p

q), equals

WERlabel <q−1

q2p2Q1

√2σ.(37)

The computation of the error performance of the n-symbol

payload is more involved. The n-symbol payload consists

of n+q−1

q−1

distinct constant composition codes of size

|Sw|=n!

i∈Qwi!.

The error performance of the payload is the weighted error

performance of each constant composition code. Let WERpl

denote the word error rate of the payload given the label is

received correctly. Then

WERpl < Npl(q, n)Q1

√2σ,(38)

where

Npl(q, n) = 1

qn

w|Sw|Nw(q, n)(39)

=1

qn

w0+...+wq=q−1

n!

i∈Qwi!

q−2

i=0

wiwi+1.

The next Theorem simpliﬁes the above expression by invok-

ing some well-known properties of multinomial coefﬁcients.

Theorem 2:

1

qn

w0+...+wq=q−1

n!

i∈Qwi!

q−2

i=0

wiwi+1 =q−1

q2n(n−1).

Proof. Following the multinomial theorem [20], we can

write a sum of q(dummy) terms, xi,i∈ Q as

(x0+. . . +xq−1)n=

w0+...+wq−1=n

n!

i∈Qwi!

1≤t≤q

xwt

t.

(40)

We ﬁnd after substituting x0=. . . =xq−1= 1 the well-

known identity

w0+...+wq−1=n

n!

i∈Qwi!=qn.

After differentiating the right and left-hand side of (40) with

respect to xiand xj,i, j ∈ Q,i̸=j, and substituting

x0=...=xq−1= 1, we obtain

w0+...+wq−1=n

n!

i∈Qwi!wiwj=n(n−1)qn−2.

Then,

1

qn

w0+...+wq=q−1

n!

i∈Qwi!

q−2

i=0

wiwi+1 =q−1

q2n(n−1),

which proofs the Theorem.

With the above Theorem, we simply have

WERpl <q−1

q2n(n−1)Q1

√2σ.(41)

A comparison of (37) and (41) makes it clear that at

high SNRs, the difference in label and payload WERs is

approximately a factor of n2/p2. As the length of the label,

p, is normally considerably shorter than the length, n, of the

payload, we conclude that the probability of the label error

is much smaller than that of a payload error, so that, in this

range, only in rare cases label errors will be the cause of

payload errors. In the range n≫1and σ≪1, the bit error

rate (BER) of the payload can be approximated by

BER ≈2WERpl, n ≫1, σ ≪1,(42)

as the majority of errors is caused by a swapping of

two symbols. In the next section, we present results of

computations and simulations.

VII. RES ULTS O F CO MP UTATI ON S AN D SI MU LATI ON S

We have implemented the proposed coding and detection

technique, and veriﬁed the computed error performance

using computer simulations. Figure 1 shows an example of

computations and simulations results for the case q= 2 and

n= 64. The label has length p= 8, where each label has

four zero’s and four one’s. The signal-to-nose ratio (SNR)

equals −20 log10 σdB. The diagram shows the word error

rate of the main data (including the errors caused by errors

in a label) and the word error rate of the label versus SNR.

The difference between the WERs of the payload and label

at high SNRs is approximately a factor of n(n−1)/p2= 63,

see (37) and (41).

In order to compare the new technique with the prior

art technology, we have simulated the encoding of a 64-

bit payload into a 68-bit codeword having 34 ones and 34

zeros by applying Schalkwijk’s enumeration technique [15].

Note that 68 is the smallest even integer mfor which

m

m/2>264. Figure 2 shows the bit error rate (BER) of

a) the prior art using Schalkwijk’s enumeration scheme,

and b) the new technique that uses an 8-bit label (as

displayed in Figure 1). Both schemes carry a 64-bit payload.

The difference in rate of the two techniques, 64/68 versus

64/72, effects the magnitude of the noise variance. The rate

effect is insigniﬁcant in this case and therefore ignored

in the simulations presented in Figure 2. We notice that

Schalkwijk’s prior art enumeration scheme shows severe

error propagation, a phenomenon that has been reported in

the literature [21].

VIII. CONCLUSIONS

In the proposed composition check codes, the n-symbol

q-ary main data are sent unmodiﬁed to the receiver. The

encoder computes the composition vector of the main data,

and appends a p-symbol q-ary label to the main data, which

carries information regarding the composition vector of

the main data. The receiver detects the label using a ﬁrst

Slepian detector, and subsequently retrieves the composition

vector of the main data. The retrieved composition vector,

in turn, is used by a second Slepian detector to optimally

detect the n-symbol q-ary main data. We have analyzed the

redundancy of the proposed method, described complexity

issues of the en(de)coding of the p-symbol q-ary label, and

analyzed the error performance of the main data and the la-

bel. We have shown results of simulations and computations

of both word and bit error rates.

14 14.5 15 15.5 16 16.5 17

10−6

10−5

10−4

10−3

10−2

10−1

100

WER

SNR (dB)

payload

label

Fig. 1. Word error rate (WER) of the main data of length n= 64

and appended label of length p= 8 for the binary case q= 2.

The signal-to-nose ratio (SNR) equals −20 log10 σdB. The dotted

lines are obtained by simulations, while the undotted lines show

the computed performance invoking (37) and (41).

15.5 16 16.5 17 17.5

10−4

10−3

10−2

10−1

100

BER

SNR (dB)

prior art

new

Fig. 2. Bit error rate (BER) of the main binary data of length n=

64 using the prior art enumeration method and the new method.

The signal-to-nose ratio (SNR) equals −20 log10 σdB. The dotted

lines are obtained by simulations, while the undotted line shows

the computed performance invoking (42).

REFERENCES

[1] K. A. S. Immink and J. H. Weber, “Minimum Pearson Distance De-

tection for Multi-Level Channels with Gain and/or Offset Mismatch,”

IEEE Trans. Inform. Theory, vol. IT-60, pp. 5966-5974, Oct. 2014.

[2] D. Slepian, “Permutation Modulation,” Proc. IEEE, vol. 53, pp. 228-

236, March 1965.

[3] W. Chu, C. J. Colbourn, and P. Dukes, “On Constant Composition

Codes,” Discrete Applied Mathematics, Volume 154, Issue 6, 15 pp.

912-929, April 2006.

[4] W. E. Ryan and S. Lin, Channel Codes, Classical and Modern,

Cambridge University Press, 2009.

[5] S. Datta and S. W. McLaughlin, “An Enumerative Method for

Runlength-Limited Codes: Permutation Codes,” IEEE Trans. Inform.

Theory, vol. IT-45, no. 6, pp. 2199-2204, Sept. 1999.

[6] D. Pelusi, S. Elmougy, L. G. Tallini, and B. Bose, “m-ary Balanced

Codes With Parallel Decoding,” IEEE Transactions on Inform. The-

ory, vol. IT-61, pp. 3251-3264, May 2015.

[7] J. H. Weber and K. A. S. Immink, “Knuth’s Balancing of Codewords

Revisited,” IEEE Trans. Inform. Theory, vol. 56, no. 4, pp. 1673-1679,

2010.

[8] V. Skachek and K. A. S. Immink, “Constant Weight Codes: An

Approach Based on Knuths Balancing Method,” IEEE Journal on

Selected Areas in Communications, Special Issue on Mass Storage

Systems, Vol. 32, No. 5, pp. 908-918, May 2014.

[9] R. M. Capocelli. L. Gargano, and U. Vaccaro, “Efﬁcient q-ary

immutable codes,” Discrete Applied Mathematics, vol. 33, pp. 25-

41, 1991.

[10] L. G. Tallini and U. Vaccaro, “Efﬁcient m-ary balanced codes,”

Discrete Applied Mathematics, vol. 92, no. 1, pp. 17-56, 1999.

[11] T. G. Swart and J. H. Weber, “Efﬁcient Balancing of q-ary Sequences

with Parallel Decoding,” IEEE International Symposium on Informa-

tion Theory, ISIT2009, pp. 1564-1568, Seoul, June 29 - July 3, 2009.

[12] Y. Li, E. En Gad, A. Jiang, and J. Bruck, “Data archiving in

1x-nm NAND ﬂash memories: Enabling long-term storage using

rank modulation and scrubbing,“ 2016 IEEE International Reliability

Physics Symposium, 2016.

[13] D. E. Knuth, “Efﬁcient Balanced Codes,” IEEE Trans. Inform.

Theory, vol. IT-32, no. 1, pp. 51-53, Jan. 1986.

[14] O. Milenkovic and B. Vasic, “Permutation (d, k)codes: Efﬁcient

Enumerative Coding and Phrase Length Distribution Shaping,” IEEE

Trans. Inform. Theory, vol. IT-46, no. 7, pp. 2671-2675, Nov. 2000.

[15] J. P. M. Schalkwijk, “An Algorithm for Source Coding,” IEEE Trans.

Inform. Theory, IT-18, pp. 395-399, 1972.

[16] W. Feller, An Introduction to Probability Theory and Its Applications,

Volume I, Wiley and Sons Inc., New York, 1950.

[17] T. M. Cover, “Enumerative Source Coding,” IEEE Trans. Inform.

Theory, vol. IT-19, no. 1, pp. 73-77, Jan. 1973.

[18] D. E. Knuth, “Positional Number Systems,” The Art of Computer

Programming, Vol. 2: Semi-numerical Algorithms, 3rd ed. Reading,

MA: Addison-Wesley, pp. 195-213, 1998.

[19] G. D. Forney Jr., “Maximum-Likelihood Sequence Estimation of

Digital Sequences in the Presence of Intersymbol Interference,” IEEE

Trans. Inform. Theory, vol. IT-18, pp. 363-378, May. 1972.

[20] J. Riordan, An Introduction to Combinatorial Analysis, Princeton

University Press, 1980.

[21] K. A. S. Immink and A. J. E. M. Janssen, “Error propagation

assessment of enumerative coding schemes,” IEEE Trans. Inform.

Theory, vol. IT-45, no. 7, pp. 2591-2594, Nov. 1999.

Kees Schouhamer Immink (M’81-SM’86-F’90) re-

ceived his PhD degree from the Eindhoven University of

Technology. He was from 1994 till 2014 an adjunct pro-

fessor at the Institute for Experimental Mathematics, Essen,

Germany. In 1998, he founded Turing Machines Inc., an

innovative start-up focused on novel signal processing for

hard disk drives and solid-state (Flash) memories.

He received a Knighthood in 2000, a personal Emmy

award in 2004, the 2017 IEEE Medal of Honor, the 1999

AES Gold Medal, the 2004 SMPTE Progress Medal, and

the 2015 IET Faraday Medal. He received the Golden

Jubilee Award for Technological Innovation by the IEEE

Information Theory Society in 1998. He was elected into

the (US) National Academy of Engineering. He received

an honorary doctorate from the University of Johannesburg

in 2014.

Kui Cai received B.E. degree in information and control

engineering from Shanghai Jiao Tong University, Shanghai,

China, M.Eng degree in electrical engineering from National

University of Singapore, and joint Ph.D. degree in electrical

engineering from Technical University of Eindhoven, The

Netherlands, and National University of Singapore.

Currently, she is an Associate Professor with Singapore

University of Technology and Design (SUTD). She received

2008 IEEE Communications Society Best Paper Award in

Coding and Signal Processing for Data Storage. She served

as the Vice-Chair (Academia) of IEEE Communications

Society, Data Storage Technical Committee (DSTC) during

2015 and 2016. Her main research interests are in the areas

of coding theory, information theory, and signal processing

for various data storage systems and digital communica-

tions.