Content uploaded by Kees Schouhamer Immink

Author content

All content in this area was uploaded by Kees Schouhamer Immink on Oct 16, 2021

Content may be subject to copyright.

Knuth’s Balancing of Codewords Revisited

Jos H. Weber

TU Delft, IRCTR/CWPC,

Mekelweg 4, 2628 CD Delft The Netherlands

Email: J.H.Weber@ewi.tudelft.nl

Kees A. Schouhamer Immink

Turing Machines Inc.

Willemskade 15b-d, 3016 DK Rotterdam The Netherlands

Email: immink@turing-machines.com

Abstract— In 1986, Don Knuth published a very simple al-

gorithm for constructing sets of bipolar codewords with equal

numbers of ’1’s and ’-1’s, called balanced codes. Knuth’s algo-

rithm is, since look-up tables are absent, well suited for use with

large codewords. The redundancy of Knuth’s balanced codes is

a factor of two larger than that of a code comprising the full set

of balanced codewords. In our paper we will present results of

our attempts to improve the performance of Knuth’s balanced

codes.

Key words: magnetic recording, optical recording, channel

capacity, constrained code, balanced code.

I. INTRODUCTION

Sets of bipolar codewords that have equal numbers of ’1’s

and ’-1’s are usually called balanced codes. Such codes have

found application in cable transmission, optical and magnetic

recording. A survey of properties and methods for constructing

balanced codes can be found in [1]. A simple encoding

technique for generating balanced codewords, which is capable

of handling (very) large blocks was described by Knuth [2] in

1986.

Knuth’s algorithm is extremely simple. An -bit user word,

even, consisting of bipolar symbols valued is forwarded

to the encoder. The encoder inverts the ſrst bits of the user

word, where is chosen in such a way that the modiſed word

has equal numbers of ’1’s and ’-1’s. Knuth showed that such

an index can always be found. The index is represented

by a (preferably) balanced word of length .The -bit

preſx word followed by the modiſed -bit user word are

both transmitted, so that the rate of the code is .

The receiver can easily undo the inversion of the ſrst bits

received. Both encoder and decoder do not require look-up

tables, and Knuth’s algorithm is therefore very attractive for

constructing long balanced codewords. Modiſcations of the

generic scheme are discussed in Knuth [2], Alon et al. [3],

Al-Bassam & Bose [4], and Tallini, Capocelli & Bose [5].

Knuth showed that in his best construction is roughly

equal to so that the redundancy of

Knuth’s construction is [2]

(1)

The cardinality of a full set of balanced codewords of length

equals

where the approximation of the central binomial coefſcient is

due to Stirling. Then the redundancy of a full set of balanced

codewords is

(2)

We conclude that the redundancy of a balanced code generated

by Knuth’s algorithm falls a factor of two short with respect

to a code that uses ’full’ balanced code sets. Clearly, the

loss in redundancy is the price one has to pay for a simple

construction. There are two features of Knuth’s construction

that could help to explain the difference in performance, and

they offer opportunities for code improvement.

The ſrst feature that may offer a possibility of improving the

code’s performance stems from the fact that Knuth’s algorithm

is greedy as it takes the very ſrst opportunity for balancing

the codeword [1], that is, in Knuth’s basic scheme, the ſrst,

i.e. the smallest index where balance is reached is selected.

In case there is more than one position where balance can

be achieved, the encoder will thus favor smaller values of the

position index. As a result, we may expect that smaller values

of the index are more probable than larger ones. Then, if the

index distribution is non-uniform, we may conclude that the

average length of the preſx required to transmit the position

information is less than . A practical embodiment of

a scheme that takes advantage of this feature is characterized

by the fact that the length of the preſx word is not ſxed, but

user data dependent. The preſx assigned to a position with

a smaller, more probable, index has a smaller length than a

preſx assigned to a position with a larger index.

Secondly, it has been shown by Knuth that there is always a

position where balance can be reached. It can be veriſed that

there is, for some user words, more than one suitable position

where balance of the word can be realized. It will be shown

later that the number of positions where words can be balanced

lies between 1 and . This freedom offers a possibility

to improve the redundancy of Knuth’s basic construction. An

enhanced Knuth’s algorithm may transmit auxiliary data by

using the freedom of selecting from the balancing positions

possible. Assume there are positions, where

the encoder can balance the user word, then the encoder can

convey an additional bits. The number depends on

the user word, and therefore the amount of auxiliary data that

can be transmitted is user data dependent.

We start, in Section II, with a survey of known properties

of Knuth’s coding method. Thereafter, we will compute the

ISIT 2008, Toronto, Canada, July 6 - 11, 2008

1567978-1-4244-2571-6/08/$25.00 ©2008 IEEE

distribution of the transmitted index in Section III. Given

the distribution of the index, we will compute the entropy

of the index, and evaluate the performance of a suitably

modiſed scheme. In Section IV, we will compute the amount

of additional data that can be conveyed in a modiſcation of

Knuth’s basic scheme. Section V concludes this article.

II. KNUTH’S BASIC SCHEME

Knuth’s balancing algorithm is based on the idea that there

is a simple translation between the set of all -bit bipolar user

words, even, and the set of all -bit codewords. The

translation is achieved by selecting a bit position within the

-bit word that deſnes two segments, each having the same,

but opposite, disparity. A zero-disparity or balanced block is

now generated by the inversion of the ſrst bits (or the last

bits). The position digit is encoded in the -bit preſx.

The rate of the code is simply .

The proof that there is at least one position, , where balance

in any even length user word can be achieved is due to Knuth.

Let the user word be ,,and

let be the sum, or disparity, of the user symbols, or

(3)

Let be the running digital sum of the ſrst , ,

bits of ,or

(4)

and let be the word with its ſrst bits inverted. For

example, let

then we have and = (1, -1, -1, -1, -1, 1, -1,

1, 1, -1). We let stand for , then the quantity

is

(5)

It is immediate that , (no symbols inverted)

and (all symbols inverted). We may,

as , conclude that every word ,

even, can be associated with at least one position for which

,or is balanced. This concludes the proof.

The value of is encoded in a (preferably) balanced word

of length ,even. The maximum codeword length of

is, since the preſx has an equal number of ’1’s and ’-1’s,

governed by

(6)

In this article, we follow Knuth’s generic format, where

. Note that in a slightly different format, we may opt for

, where the encoder has the option to invert or not

to invert the codeword in case the user word is balanced. For

small values of , this will lead to slightly different results,

though for very large values of , the differences between

the two formats are small. Knuth described some variations

on the general framework. For example, if and are both

odd, we can use a similar construction. The redundancy of

Knuth’s most efſcient construction is

III. DISTRIBUTION OF THE TRANSMITTED INDEX

The basic Knuth algorithm, as described above, progres-

sively scans the user word till it ſnds the ſrst suitable position,

, where the word can be balanced. In case there is more than

one position where balance can be obtained, it is expected that

the encoder will favor smaller values of the position index.

Then the distribution of the index is not uniform, and, thus,

the entropy of the index is less than , which opens the

door for a more efſcient scheme. A practical embodiment of

amoreefſcient scheme would imply that the preſx assigned

to a smaller index has a smaller length than a preſx assigned

to a larger index. We will compute the entropy of the index

sent by the basic Knuth encoder, and in order to do so we

ſrst compute the probability distribution of the transmitted

index. In our analysis it is assumed that all information words

are equiprobable and independent. Let denote the

probability that the transmitted index equals , .

Theorem 1: The distribution of the transmitted index ,

,is given by ( )

Proof. This result can be shown using On Line Encyclopedia

of Integer Sequences A33820.

Invoking Stirling’s approximation, we have

For ,wehave ,and

for ,wehave .

Figure 1 shows two examples of the distribution, ,for

and . The entropy of the transmitted index,

denoted by ,is

(7)

Given the distribution, it is now straightforward to compute the

entropy, , of the index. Figure 2 shows a few results of

computations. The diagram shows that is only slightly

ISIT 2008, Toronto, Canada, July 6 - 11, 2008

1568

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Pr(j)

j/m

m=64

m=256

Fig. 1. Distribution of the (normalized) transmitted index

for and .

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

Hp

log(m)

Fig. 2. Entropy versus .

less than , and we conclude that the above proposed

modiſcation of Knuth’s scheme using a variable length preſx

can offer only a small improvement in redundancy within

the range of codeword length investigated. We conclude that,

within the range of codeword length investigated, the proposed

variable preſx-length scheme cannot bridge the factor of two

in redundancy between the basic Knuth scheme and that of

full set balanced codes.

IV. ENCODING AUXILIARY DATA

There is at least one position and there at most

positions within an -bit word, even, where a word can

be balanced. The ’at least’ one position, which makes Knuth’s

algorithm possible, was proved by Knuth (see above). The ’at

most’ bound will be shown in the next Theorem.

Theorem 2: There at most positions within an -bit

word, even, where a word can be balanced.

Proof. Let denote the position where balance can be made.

Then, at the neighboring positions or such a

00.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0

0.02

0.04

0.06

0.08

0.1

0.12

Pr(j)

j/m

m=64

m=256

Fig. 3. Distribution of the (normalized) number, ,of

possible balancing positions for and .

balance cannot be made, so that we conclude that the number

of positions where balance can be made is less or equal to

Note that the indices of a word with balance positions

are either all even or all odd. It can easily be veriſed that

there are three groups of words that can be balanced at

positions. Namely

a) the words consisting of the cascade of the

di-bits (+1,-1) or (-1,+1),

b) the words beginning with a +1 followed by

di-bits (+1,-1) or (-1,+1), followed by a +1, and

c) the inverted words of case b).

Since, on average, the encoder has the degree of freedom of

selecting from more than one balance position, it offers the

encoder the possibility to transmit auxiliary data. Assume there

are positions, , where the encoder can balance

the user word, then the encoder can convey an additional

bits. The number depends on the user word at hand, and

therefore the amount of auxiliary data that can be transmitted

is user data dependent.

Let denote the probability that the encoder may

choose between ,, possible positions, where

balancing is possible.

Theorem 3: The distribution of the number of positions,

where an -bit word, even, can be balanced is given by

(8)

Proof. See Appendix. Theorem 3 follows from Lemma 3 and

the fact that there are sequences of length , which are

assumed to be equally probable.

Figure 3 shows two examples of the distribution, namely for

and . The average amount of information,

, that can be conveyed via the choice in the position

ISIT 2008, Toronto, Canada, July 6 - 11, 2008

1569

12345678910

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Ha(m)

log (m)

Fig. 4. The average amount of information, , that can be

conveyed via the choice in the index as a function of

data is

(9)

Results of computations are shown in Figure 4. We can

recursively compute by invoking

For large and ,wehave

where . We approximate

so that

Now, for large , we can approximate by

(10)

(11)

(12)

where is Euler’s constant. We conclude that

the average amount of information that can be conveyed by

exploiting the choice of index compensates for the loss in rate

between codes based on Knuth’s algorithm and codes based

on full balanced codeword sets.

V. C ONCLUSIONS

We have investigated some characteristics and possible

improvements of Knuth’s algorithm for constructing bipolar

codewords with equal numbers of ’1’s and ’-1’s. An -

bit codeword is obtained after a small modiſcation of the -

bit user word plus appending a, ſxed-length, -bit preſx. The

-bit preſx represents the position index within the codeword,

where the modiſcation has been made.

We have derived the distribution of the index (assuming

equiprobable user words), and have computed the entropy of

the transmitted index. Our computations show that a modiſca-

tion of Knuth’s generic scheme using a variable length preſx

of the position index will only offer a small improvement in

redundancy.

The transmitter can, in general, choose from a plurality

of indices, so that the transmitter can transmit additional

information. The number of possible indices depends on the

given user word, so that the amount of extra information

that can be transmitted is data dependent. We have derived

the distribution of the number of positions where a word

can be balanced. We have computed the average information

that can be conveyed by using the freedom of choosing from

multiple indices. The average information rate can, for large

user word length, , be approximated by .

This compensates for the loss in code rate between codes

based on Knuth’s algorithm and codes based on full balanced

codeword sets.

VI. APPENDIX

In this appendix we give a combinatorial proof of The-

orem 3. We also refer the reader to On Line Encyclopedia

of Integer Sequences A112326. Let denote the set of

bipolar sequences of even length which can be balanced in

positions ( ). Deſne .

We will derive an explicit expression for (in Lemma 3),

from which Theorem 3 immediately follows. Let denote

the set of all balanced sequences of length without internal

balancing positions, i.e., there are no balancing positions

with .Deſne . Any sequence

with balancing positions can be

uniquely decomposed as ,where

is of length , with and . Note that

is in for all and that is

in . From these observations, we can easily

derive the recursive relation

(13)

for all . Further, we have, for all , the trivial

equality

(14)

ADyck word of length is a balanced bipolar sequence

of length such that no initial segment has more ’1’s than

ISIT 2008, Toronto, Canada, July 6 - 11, 2008

1570

’-1’s [6], or in other words, is a Dyck word if the running

digital sum for all .The

number of Dyck words of length is equal to

(15)

which is the -th Catalan number [6]. For example, ,and

are the Dyck words of length ,and ,

,,,and are the Dyck

words of length , where for clerical convenience we have

written ’0’ instead of ’-1’. Note that a sequence is in

if and only if it has (the inverse of) the format ,where

is a Dyck word of length . Hence, for all ,

(16)

For example, ,,,

, which is indeed the result provided by (16).

Lemma 1: For all and satisfying , it holds

that

(17)

Proof. Any bipolar sequence of length containing

’ones’ can be uniquely written as ,where is a Dyck

word of length , with ,and

is a bipolar sequence of length containing

1’s. Using (15) for Dyck word enumeration, a simple

counting argument gives the stated result.

Lemma 2: For all , it holds that

(18)

Proof. Any bipolar sequence of length having more than

1’s can be uniquely written as ,where is of length

, with ,and is of length and

has 1’s. Any bipolar sequence of length containing less

than 1’s can be uniquely written as ,where is of

length , with ,and is of length

and has 1’s. Hence,

(19)

which concludes the proof.

Lemma 3: For all , it holds that

(20)

Proof. Assuming that the statement holds for all ,

we will show that it also holds for .Forall

,wehave

(21)

where the ſrst equality follows from (13), the second from

(20) and (16), and the third from Lemma 1 (with

and ). Further, we have

(22)

where the ſrst equality follows from (14) (with ),

the second from (21), and the third from Lemma 2 (with

). Hence, if the statement in the lemma holds for

all , then it holds for as well. Since

(14) gives that , (20) holds for ,andthe

lemma follows by induction on .

REFERENCES

[1] K.A.S. Immink, Codes for Mass Data Storage Systems, Second Edi-

tion, ISBN 90-74249-27-2, Shannon Foundation Publishers, Eindhoven,

Netherlands, 2004.

[2] D.E. Knuth, ’Efſcient Balanced Codes’, IEEE Trans. Inform. Theory,

vol. IT-32, no. 1, pp. 51-53, Jan. 1986.

[3] N. Alon, E.E. Bergmann, D. Coppersmith, and A.M. Odlyzko, ’Balanc-

ing Sets of Vectors’, IEEE Trans. Inform. Theory, vol. IT-34, no. 1, pp.

128-130, Jan. 1988.

[4] S. Al-Bassam and B. Bose, ’On Balanced Codes’, IEEE Trans. Inform.

Theory, vol. IT-36, no. 2, pp. 406-408, March 1990.

[5] L.G. Tallini, R.M. Capocelli, and B. Bose, ’Design of Some New

Balanced Codes’, IEEE Trans. Inform. Theory, vol. IT-42, no. 3, pp.

790-802, May 1996.

[6] R.P. Stanley, Enumerative Combinatorics, Vol. 2, Cambridge University

Press, 1999.

ISIT 2008, Toronto, Canada, July 6 - 11, 2008

1571