Content uploaded by Kees Schouhamer Immink

Author content

All content in this area was uploaded by Kees Schouhamer Immink on Apr 08, 2019

Content may be subject to copyright.

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.56, NO. 4, APRIL 2010 1673

Knuth’s Balanced Codes Revisited

Jos H. Weber, Senior Member, IEEE, and Kees A. Schouhamer Immink, Fellow, IEEE

Abstract—In 1986, Don Knuth published a very simple al-

gorithm for constructing sets of bipolar codewords with equal

numbers of “ ”s and “ ”s, called balanced codes. Knuth’s algo-

rithm is well suited for use with large codewords. The redundancy

of Knuth’s balanced codes is a factor of two larger than that of a

code comprising the full set of balanced codewords. In this paper,

we will present results of our attempts to improve the performance

of Knuth’s balanced codes.

Index Terms—Balanced code, channel capacity, constrained

code, magnetic recording, optical recording.

I. INTRODUCTION

SETS of bipolar codewords that have equal numbers of “ ”s

and “ ”s are usually called balanced codes. Such codes

have found application in cable transmission, optical and mag-

netic recording. A survey of properties and methods for con-

structing balanced codes can be found in [1]. A simple encoding

technique for generating balanced codewords, which is capable

of handling (very) large blocks was described by Knuth [2] in

1986.

Knuth’s algorithm is extremely simple. An -bit user word,

even, consisting of bipolar symbols valued is forwarded

to the encoder. The encoder inverts the ﬁrst bits of the user

word, where is chosen in such a way that the modiﬁed word

has equal numbers of “ ”s and “ ”s. Knuth showed that such

an index can always be found. The index is represented

by a balanced word of length . The -bit preﬁx word fol-

lowed by the modiﬁed -bit user word are both transmitted, so

that the rate of the code is . The receiver can easily

undo the inversion of the ﬁrst bits received once is computed

from the preﬁx. Both encoder and decoder do not require large

look-up tables, and Knuth’s algorithm is therefore very attrac-

tive for constructing long balanced codewords. Modiﬁcations of

the generic scheme are discussed in Knuth [2], Alon et al. [3],

Al-Bassam and Bose [4], and Tallini, Capocelli and Bose [5].

Knuth showed that in his best construction [2], the redun-

dancy, i.e., the number of redundant symbols , is roughly equal

to

(1)

Manuscript received March 23, 2009. Current version published March 17,

2010. This work was supported by grant Theory and Practice of Coding and

Cryptography, Award Number: NRF-CRP2-2007-03. The material in this paper

was presented in part at the IEEE International Symposium on Information

Theory, Toronto, ON, Canada, July 2008.

J. H. Weber iswith the IRCTR/CWPC, Delft University of Technology, 2628

CD Delft, The Netherlands (e-mail: J.H.Weber@tudelft.nl).

K. A. Schouhamer Immink is with the Nanyang Technological University of

Singapore, Singapore, and with Turing Machines BV, 3016 DK Rotterdam, The

Netherlands (e-mail: immink@turing-machines.com).

Communicated by H.-A. Loeliger, Associate Editor for Coding Techniques.

Color versions of Figures 1–4 in this paper are available online at http://iee-

explore.ieee.org.

Digital Object Identiﬁer 10.1109/TIT.2010.2040868

The cardinality of a full set of balanced codewords of length

equals

where the approximation of the central binomial coefﬁcient fol-

lows from Stirling’s formula. Then the redundancy of a full set

of balanced codewords is roughly equal to

(2)

We conclude that the redundancy of a balanced code generated

by Knuth’s algorithm falls a factor of two short with respect to

a code that uses ’full’ balanced code sets. Clearly, the loss in

redundancy is the price one has to pay for a simple construc-

tion without look-up tables. There are two features of Knuth’s

construction that could help to explain the difference in perfor-

mance, and they offer opportunities for code improvement.

The ﬁrst feature that may offer a possibility of improving the

code’s performance stems from the fact that Knuth’s algorithm

is greedy as it takes the very ﬁrst opportunity for balancing the

codeword [1], that is, in Knuth’s basic scheme, the ﬁrst, i.e., the

smallest, index where balance is reached is selected. In case

there is more than one position where balance can be achieved,

the encoder will thus favor smaller values of the position index.

As a result, we may expect that smaller values of the index are

more probable than larger ones. Then, if the index distribution

is non-uniform, we may conclude that the average length of the

preﬁx required to transmit the position information is less than

. A practical embodiment of a scheme that takes advan-

tage of this feature is characterized by the fact that the length of

the preﬁx word is not ﬁxed, but user data dependent. The preﬁx

assigned to a position with a smaller, more probable, index has

a smaller length than a preﬁx assigned to a position with a larger

index.

Second, it has been shown by Knuth that there is always a

position where balance can be reached. It can be veriﬁed that

there is, for some user words, more than one suitable position

where balance of the word can be realized. It will be shown

later that the number of positions where words can be balanced

lies between 1 and . This freedom offers a possibility to

improve the redundancy of Knuth’s basic construction. An en-

hanced Knuth’s algorithm may transmit auxiliary data by using

the freedom of selecting from the balancing positions possible.

Assume there are positions, where the encoder

can balance the user word, then the encoder can convey an addi-

tional bits. The number depends on the user word, and

therefore the amount of auxiliary data that can be transmitted is

user data dependent.

We start, in Section II, with a survey of known properties of

Knuth’s coding method. Thereafter, in Section III, we will com-

pute the distribution of the transmitted index in Knuth’s basic

0018-9448/$26.00 © 2010 IEEE

1674 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 4, APRIL 2010

scheme. Given the distribution of the index,we will compute the

entropy of the index, and evaluate the performance of a suitably

modiﬁed scheme. In Section IV, we will compute the amount

of additional data that can be conveyed in a modiﬁcation of

Knuth’s basic scheme. Section V concludes this article.

II. KNUTH’SBASIC SCHEME

Knuth’s balancing algorithm is based on the idea that there

is a simple translation between the set of all -bit bipolar user

words, even, and the set of all -bit codewords. This

conversion is based on the observation that in any block of data,

having an even number of binary digits, it is always possible to

ﬁnd a location which deﬁnes two digit segments having equal

disparity. A balanced block can then be created by the inver-

sion of all the digits within either segment. The translation is

achieved by selecting a bit position within the -bit word

that deﬁnes two segments, each having the same disparity. A

zero-disparity, or balanced, block is now generated by the in-

version of the ﬁrst bits (or the last bits). The position

digit is encoded in the -bit preﬁx. The rate of the code is

simply .

The proof that there is at least one position, , where balance

in any even length user word can be achieved is due to Knuth.

Let the user word be , , and let

be the sum, or disparity, of the user symbols, or

(3)

Let be the running digital sum of the ﬁrst , , bits

of ,or

(4)

and let be the word with its ﬁrst bits inverted. For

example, let

then we have and

. We let

stand for , then the quantity is

(5)

It is immediate that , (no symbols inverted)

and (all symbols inverted). We may, as

, conclude that every word , even,

canbe associatedwith atleast oneposition for which

,or is balanced. This concludes the proof.

The value of is encoded in a balanced word of length ,

even. The maximum codeword length of is, since the preﬁx

has an equal number of “ ”s and “ ”s, governed by

(6)

In this article, we follow Knuth’s generic format, where

. Note that in a slightly different format, we may opt

for , where the encoder has the option to invert or

not to invert the codeword in case the user word is balanced.

For small values of , this will lead to slightly different results,

though for very large values of , the differences between the

two formats are small. Knuth described some variations on the

general framework. For example, if and are both odd, we

can use a similar construction. The redundancy of Knuth’s most

efﬁcient construction is

III. DISTRIBUTION OF THE TRANSMITTED INDEX

The basic Knuth algorithm, as described above, progressively

scans the user word till it ﬁnds the ﬁrst suitable position, ,

where the word can be balanced. In case there is more than one

position where balance can be obtained, it is expected that the

encoder will favor smaller values of the position index. Then

the distribution of the index is not uniform, and, thus, the en-

tropy of the index is less than , which opens the door for

a more efﬁcient scheme. A practical embodiment of a more ef-

ﬁcient scheme would imply that the preﬁx assigned to a smaller

index has a smaller length than a preﬁx assigned to a larger

index. We will compute the entropy of the index sent by the

basic Knuth encoder, and in order to do so we ﬁrst compute the

probability distribution of the transmitted index. In our analysis

it is assumed that all information words are equiprobable and

independent. Let denote the probability that the trans-

mitted index equals , .

Theorem 1: The distribution of the transmitted index ,

, is given by

Proof: Theorem 1 follows from Lemma 3 in Appendix and

the fact that there are (equally probable) sequences of length

.

Invoking Stirling’s approximation, we have

For ,wehave , and for

,wehave . Fig. 1

shows two examples of the distribution, , for

and . The entropy of the transmitted index, denoted by

,is

(7)

Given the distribution, it is now straightforward to compute the

entropy, , of the index. Fig. 2 shows a few results of com-

putations. The diagram shows that is only slightly less

WEBER AND SCHOUHAMER IMMINK: KNUTH’S BALANCED CODES REVISITED 1675

Fig. 1. Distribution of the (normalized) transmitted index for and .

Fig. 2. Entropy versus .

than , and we conclude that the above proposed modiﬁca-

tion of Knuth’s scheme using a variable length preﬁx can offer

only a small improvement in redundancy within the range of

codeword length investigated. We conclude that, at least within

this range, the proposed variable preﬁx-length scheme cannot

bridge the factor of two in redundancy between the basic Knuth

scheme and that of full set balanced codes.

IV. ENCODING AUXILIARY DATA

There is at least one position and there are at most posi-

tions within an -bit word, even, where a word can be bal-

anced. The “at least” one position, which makes Knuth’s algo-

rithm possible, was proved by Knuth (see above). The “at most”

bound will be shown in the next Theorem.

Theorem 2: There are at most positions within an -bit

word, even, where a word can be balanced.

Proof: Let denote the position where balance can be

made. Then, at the neighboring positions or such

a balance cannot be made, so that we conclude that the number

of positions where balance can be made is less or equal to

1676 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 4, APRIL 2010

Fig. 3. Distribution of the (normalized) number, , of possible balancing positions for and .

Note that the indices of a word with balance positions

are either all even or all odd. It can easily be veriﬁed that there

are three groups of words that can be balanced at positions,

namely

• the wordsconsisting ofthe cascadeof the di-bits

or ,

• the words beginning with a followed by

di-bits or , followed by a , and

• the inverted words of the previous case.

Since, on average, the encoder has the degree of freedom of

selecting from more than one balance position, it offers the en-

coder the possibility to transmit auxiliary data. Assume there

are positions, , where the encoder can balance

the user word, then the encoder can convey an additional

bits. The number depends on the user word at hand, and there-

fore the amount of auxiliary data that can be transmitted is user

data dependent.

Let denote the probability that the encoder may

choose between , , possible positions, where

balancing is possible.

Theorem 3: The distribution of the number of positions,

where an -bit word, even, can be balanced is given by

(8)

Proof: Theorem 3 follows from Lemma 6 in Appendix and

the fact that there are (equally probable) sequences of length

.

Fig. 3 shows two examples of the distribution, namely for

and . The average amount of information,

, that can be conveyed via the choice in the position data

is

(9)

Results of computations are shown in Fig. 4. We can recursively

compute by invoking

For large and ,wehave

where . We approximate

so that

Now, for large , we can approximate by

(10)

WEBER AND SCHOUHAMER IMMINK: KNUTH’S BALANCED CODES REVISITED 1677

Fig. 4. The average amount of information, , that can be conveyed via the choice in the index as a function of .

(11)

(12)

where isEuler’s constant.Weconclude thatthe av-

erage amount of information that can be conveyed by exploiting

the choice of index compensates for the loss in rate between

codes based on Knuth’s algorithm and codes based on full bal-

anced codeword sets.

V. CONCLUSION

We have investigated some characteristics and possible im-

provements of Knuth’s algorithm for constructing bipolar code-

words with equal numbers of “ ”s and “ ”s. An -bit

codeword is obtained after a small modiﬁcation of the -bit

user word plus appending a, ﬁxed-length, -bit preﬁx. The -bit

preﬁx represents the position index within the codeword, where

the modiﬁcation has been made.

We have derived the distribution of the index (assuming

equiprobable user words), and have computed the entropy of

the transmitted index. Our computations show that a modiﬁca-

tion of Knuth’s generic scheme using a variable length preﬁx

of the position index will only offer a small improvement in

redundancy.

The transmitter can, in general, choose from a plurality of

indices, so that the transmitter can transmit additional infor-

mation. The number of possible indices depends on the given

user word, so that the amount of extra information that can be

transmitted is data dependent. Wehave derived the distribution

of the number of positions where a word can be balanced. We

have computed the average information that can be conveyed

by using the freedom of choosing from multiple indices. The

average amount of information can, for large user word length,

, be approximated by . This compensates for

the loss in code rate between codes based on Knuth’s algorithm

and codes based on full balanced codeword sets.

APPENDIX

In this Appendix, we give combinatorial proofs of Theorems

1 and 3. We ﬁrst review some results on Dyck words and then

derive lemmas leading to the proofs of the theorems. We also

refer the reader to On Line Encyclopedia of Integer Sequences

A33820 and A112326.

ADyck word of length is a balanced bipolar sequence

of length such that no initial segment has more ’1’s than

’’s [6], or in other words, is a Dyck word if the running

digital sum for all . The

number of Dyck words of length is equal to

(13)

which is the th Catalan number [6]. For example, , and

are the Dyck words of length , and ,

, , , and are the Dyck

words of length , where for clerical convenience we have

written “ ” instead of “ ”.

Let denotethe setofallbalanced sequencesoflength

without internal balancing positions, i.e., there are no balancing

1678 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 4, APRIL 2010

positions with . Deﬁne . Note

that a sequence is in if and only if it has the format

or its inverse, where is a Dyck word of length .

Hence, for all

(14)

For example, , which is indeed the result provided by

(14).

Let denote the set of bipolar sequences of even length

for which the smallest balancing index is .

Deﬁne . We will derive an explicit

expression for (in Lemma 3), from which Theorem 1

immediately follows.

Lemma 1: For all , it holds that

(15)

Proof: Let with of length .

We deﬁne a mapping from to by

, where is the inverse of , i.e., is the cyclic shift of

with an inversion of the last bit of . The lemma follows from

the observation that is a bijection.

Lemma 2: For all , it holds that

(16)

Proof: Let denote the set of all bipolar sequences

of length , where and is

balanced. Let with of length .We

deﬁne a mapping from to by ,

where is the symbol-wise inverse of . Since is a bijection

(17)

and the lemma follows using (14).

Lemma 3: For all , it holds that

(18)

Proof: The ﬁrst equality follows from Lemma 1. Suppose

that the second equality holds for . From Lemma 2

(19)

and thus the second equality also holds for . Since the

second equality holds for because of (14), the result

follows by induction.

Let denote the set of bipolar sequences of even length

which can be balanced in positions . De-

ﬁne . We will derive an explicit ex-

pression for (in Lemma 6), from which Theorem 3 im-

mediately follows. Any sequence with balancing

positions can be uniquely decomposed as

, where is of length , with

and . Note that is in for all

and that is in . From

these observations, we can easily derive the recursive relation

(20)

for all . Further, we have, for all , the trivial

equality

(21)

Lemma 4: For all and satisfying , it holds

that

(22)

Proof: Any bipolar sequence of length containing

’ones’ can be uniquely written as , where is a Dyck

word of length , with , and is

a bipolar sequence of length containing

’s. Using (13) for Dyck word enumeration, a simple counting

argument gives the stated result.

Lemma 5: For all , it holds that

(23)

Proof: Any bipolar sequence of length having morethan

’s can be uniquely written as , where is of length

, with , and is of length and has

’s. Any bipolar sequence of length containing less than

’s can be uniquely written as , where is of length

, with , and is of length and has

’s. Hence

WEBER AND SCHOUHAMER IMMINK: KNUTH’S BALANCED CODES REVISITED 1679

(24)

which concludes the proof.

Lemma 6: For all , it holds that

(25)

Proof: Assuming that the statement holds for all ,

we will show that it also holds for . For all

,wehave

(26)

where the ﬁrst equality follows from (20), the second from (25)

and (14), and the third from Lemma 4 (with and

). Further, we have

(27)

where the ﬁrst equality follows from (21) (with ),

the second from (26), and the third from Lemma 5 (with

). Hence, if the statement in the lemma holds for all

, then it holds for as well. Since (21) gives that

, (25) holds for , and the lemma follows by

induction on .

REFERENCES

[1] K. A. S. Immink, Codes for Mass Data Storage Systems, Second ed.

Eindhoven, Netherlands: Shannon Foundation Publishers, 2004.

[2] D. E. Knuth, “Efﬁcient balanced codes,” IEEE Trans. Inf. Theory, vol.

IT-32, pp. 51–53, Jan. 1986.

[3] N. Alon, E. E. Bergmann, D. Coppersmith, and A. M. Odlyzko,

“Balancing sets of vectors,” IEEE Trans. Inf. Theory, vol. IT-34, pp.

128–130, Jan. 1988.

[4] S. Al-Bassam and B. Bose, “On balanced codes,” IEEE Trans. Inf.

Theory, vol. 36, pp. 406–408, Mar. 1990.

[5] L. G. Tallini, R. M. Capocelli, and B. Bose, “Design of some new

balanced codes,” IEEE Trans. Inf. Theory, vol. 42, pp. 790–802, May

1996.

[6] R. P. Stanley, Enumerative Combinatorics. New York: Cambridge

University Press, 1999, vol. 2.

Jos H. Weber (S’87–M’90–SM’00) was born in Schiedam, The Netherlands,

in 1961. He received the M.Sc. (in mathematics, with honors), Ph.D., and MBT

(Master of Business Telecommunications) degrees from Delft University of

Technology, Delft, The Netherlands, in 1985, 1989, and 1996, respectively.

Since 1985, he has been with the Faculty of Electrical Engineering, Mathe-

matics, and Computer Science of Delft University of Technology. Currently, he

is an associate professor at the Wireless and Mobile Communications Group.

He is the chairman of the WIC (Werkgemeenschap voor Informatie- en Com-

municatietheorie in de Benelux) and the secretary of the IEEE Benelux Chapter

on Information Theory. He was a Visiting Researcher at the University of Cal-

ifornia at Davis, the University of Johannesburg, South Africa, and the Tokyo

Institute of Technology, Japan. His main research interests are in the areas of

channel and network coding.

Kees A. Schouhamer Immink (M’81–SM’86–F’90) received the Ph.D. degree

from the Eindhoven University of Technology, The Netherlands.

He founded and was named President of Turing Machines, Inc., in 1998. He

has, since 1994, been an Adjunct Professor at the Institute for Experimental

Mathematics, Essen University, Germany, and is afﬁliated with the Nanyang

Technological University of Singapore. He designed coding techniques of a

wealth of digital audio and video recording products, such as compact disc,

CD-ROM, CD-video, digital compact cassette system, DCC, DVD, video disc

recorder, and blu-ray disc.

Dr. Immink received a Knighthood in 2000, a personal “Emmy” award in

2004, the 1996 IEEE Masaru Ibuka Consumer Electronics Award, the 1998

IEEE Edison Medal, 1999 AES Gold and Silver Medals, and the 2004 SMPTE

Progress Medal. He was named a Fellow of the IEEE, AES, and SMPTE, and

was inducted into the Consumer Electronics Hall of Fame, and elected into the

Royal Netherlands Academy of Sciences and the US National Academy of En-

gineering. He served the profession as President of the Audio Engineering So-

ciety inc., New York, in 2003.