IEEE TRANSACTIONS ON INFORMATION THEORY, VOL.56, NO. 4, APRIL 2010 1673
Knuth’s Balanced Codes Revisited
Jos H. Weber, Senior Member, IEEE, and Kees A. Schouhamer Immink, Fellow, IEEE
Abstract—In 1986, Don Knuth published a very simple al-
gorithm for constructing sets of bipolar codewords with equal
numbers of “ ”s and “ ”s, called balanced codes. Knuth’s algo-
rithm is well suited for use with large codewords. The redundancy
of Knuth’s balanced codes is a factor of two larger than that of a
code comprising the full set of balanced codewords. In this paper,
we will present results of our attempts to improve the performance
of Knuth’s balanced codes.
Index Terms—Balanced code, channel capacity, constrained
code, magnetic recording, optical recording.
SETS of bipolar codewords that have equal numbers of “ ”s
and “ ”s are usually called balanced codes. Such codes
have found application in cable transmission, optical and mag-
netic recording. A survey of properties and methods for con-
structing balanced codes can be found in . A simple encoding
technique for generating balanced codewords, which is capable
of handling (very) large blocks was described by Knuth  in
Knuth’s algorithm is extremely simple. An -bit user word,
even, consisting of bipolar symbols valued is forwarded
to the encoder. The encoder inverts the ﬁrst bits of the user
word, where is chosen in such a way that the modiﬁed word
has equal numbers of “ ”s and “ ”s. Knuth showed that such
an index can always be found. The index is represented
by a balanced word of length . The -bit preﬁx word fol-
lowed by the modiﬁed -bit user word are both transmitted, so
that the rate of the code is . The receiver can easily
undo the inversion of the ﬁrst bits received once is computed
from the preﬁx. Both encoder and decoder do not require large
look-up tables, and Knuth’s algorithm is therefore very attrac-
tive for constructing long balanced codewords. Modiﬁcations of
the generic scheme are discussed in Knuth , Alon et al. ,
Al-Bassam and Bose , and Tallini, Capocelli and Bose .
Knuth showed that in his best construction , the redun-
dancy, i.e., the number of redundant symbols , is roughly equal
Manuscript received March 23, 2009. Current version published March 17,
2010. This work was supported by grant Theory and Practice of Coding and
Cryptography, Award Number: NRF-CRP2-2007-03. The material in this paper
was presented in part at the IEEE International Symposium on Information
Theory, Toronto, ON, Canada, July 2008.
J. H. Weber iswith the IRCTR/CWPC, Delft University of Technology, 2628
CD Delft, The Netherlands (e-mail: J.H.Weber@tudelft.nl).
K. A. Schouhamer Immink is with the Nanyang Technological University of
Singapore, Singapore, and with Turing Machines BV, 3016 DK Rotterdam, The
Netherlands (e-mail: email@example.com).
Communicated by H.-A. Loeliger, Associate Editor for Coding Techniques.
Color versions of Figures 1–4 in this paper are available online at http://iee-
Digital Object Identiﬁer 10.1109/TIT.2010.2040868
The cardinality of a full set of balanced codewords of length
where the approximation of the central binomial coefﬁcient fol-
lows from Stirling’s formula. Then the redundancy of a full set
of balanced codewords is roughly equal to
We conclude that the redundancy of a balanced code generated
by Knuth’s algorithm falls a factor of two short with respect to
a code that uses ’full’ balanced code sets. Clearly, the loss in
redundancy is the price one has to pay for a simple construc-
tion without look-up tables. There are two features of Knuth’s
construction that could help to explain the difference in perfor-
mance, and they offer opportunities for code improvement.
The ﬁrst feature that may offer a possibility of improving the
code’s performance stems from the fact that Knuth’s algorithm
is greedy as it takes the very ﬁrst opportunity for balancing the
codeword , that is, in Knuth’s basic scheme, the ﬁrst, i.e., the
smallest, index where balance is reached is selected. In case
there is more than one position where balance can be achieved,
the encoder will thus favor smaller values of the position index.
As a result, we may expect that smaller values of the index are
more probable than larger ones. Then, if the index distribution
is non-uniform, we may conclude that the average length of the
preﬁx required to transmit the position information is less than
. A practical embodiment of a scheme that takes advan-
tage of this feature is characterized by the fact that the length of
the preﬁx word is not ﬁxed, but user data dependent. The preﬁx
assigned to a position with a smaller, more probable, index has
a smaller length than a preﬁx assigned to a position with a larger
Second, it has been shown by Knuth that there is always a
position where balance can be reached. It can be veriﬁed that
there is, for some user words, more than one suitable position
where balance of the word can be realized. It will be shown
later that the number of positions where words can be balanced
lies between 1 and . This freedom offers a possibility to
improve the redundancy of Knuth’s basic construction. An en-
hanced Knuth’s algorithm may transmit auxiliary data by using
the freedom of selecting from the balancing positions possible.
Assume there are positions, where the encoder
can balance the user word, then the encoder can convey an addi-
tional bits. The number depends on the user word, and
therefore the amount of auxiliary data that can be transmitted is
user data dependent.
We start, in Section II, with a survey of known properties of
Knuth’s coding method. Thereafter, in Section III, we will com-
pute the distribution of the transmitted index in Knuth’s basic
0018-9448/$26.00 © 2010 IEEE
1674 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 4, APRIL 2010
scheme. Given the distribution of the index,we will compute the
entropy of the index, and evaluate the performance of a suitably
modiﬁed scheme. In Section IV, we will compute the amount
of additional data that can be conveyed in a modiﬁcation of
Knuth’s basic scheme. Section V concludes this article.
II. KNUTH’SBASIC SCHEME
Knuth’s balancing algorithm is based on the idea that there
is a simple translation between the set of all -bit bipolar user
words, even, and the set of all -bit codewords. This
conversion is based on the observation that in any block of data,
having an even number of binary digits, it is always possible to
ﬁnd a location which deﬁnes two digit segments having equal
disparity. A balanced block can then be created by the inver-
sion of all the digits within either segment. The translation is
achieved by selecting a bit position within the -bit word
that deﬁnes two segments, each having the same disparity. A
zero-disparity, or balanced, block is now generated by the in-
version of the ﬁrst bits (or the last bits). The position
digit is encoded in the -bit preﬁx. The rate of the code is
The proof that there is at least one position, , where balance
in any even length user word can be achieved is due to Knuth.
Let the user word be , , and let
be the sum, or disparity, of the user symbols, or
Let be the running digital sum of the ﬁrst , , bits
and let be the word with its ﬁrst bits inverted. For
then we have and
. We let
stand for , then the quantity is
It is immediate that , (no symbols inverted)
and (all symbols inverted). We may, as
, conclude that every word , even,
canbe associatedwith atleast oneposition for which
,or is balanced. This concludes the proof.
The value of is encoded in a balanced word of length ,
even. The maximum codeword length of is, since the preﬁx
has an equal number of “ ”s and “ ”s, governed by
In this article, we follow Knuth’s generic format, where
. Note that in a slightly different format, we may opt
for , where the encoder has the option to invert or
not to invert the codeword in case the user word is balanced.
For small values of , this will lead to slightly different results,
though for very large values of , the differences between the
two formats are small. Knuth described some variations on the
general framework. For example, if and are both odd, we
can use a similar construction. The redundancy of Knuth’s most
efﬁcient construction is
III. DISTRIBUTION OF THE TRANSMITTED INDEX
The basic Knuth algorithm, as described above, progressively
scans the user word till it ﬁnds the ﬁrst suitable position, ,
where the word can be balanced. In case there is more than one
position where balance can be obtained, it is expected that the
encoder will favor smaller values of the position index. Then
the distribution of the index is not uniform, and, thus, the en-
tropy of the index is less than , which opens the door for
a more efﬁcient scheme. A practical embodiment of a more ef-
ﬁcient scheme would imply that the preﬁx assigned to a smaller
index has a smaller length than a preﬁx assigned to a larger
index. We will compute the entropy of the index sent by the
basic Knuth encoder, and in order to do so we ﬁrst compute the
probability distribution of the transmitted index. In our analysis
it is assumed that all information words are equiprobable and
independent. Let denote the probability that the trans-
mitted index equals , .
Theorem 1: The distribution of the transmitted index ,
, is given by
Proof: Theorem 1 follows from Lemma 3 in Appendix and
the fact that there are (equally probable) sequences of length
Invoking Stirling’s approximation, we have
For ,wehave , and for
,wehave . Fig. 1
shows two examples of the distribution, , for
and . The entropy of the transmitted index, denoted by
Given the distribution, it is now straightforward to compute the
entropy, , of the index. Fig. 2 shows a few results of com-
putations. The diagram shows that is only slightly less
WEBER AND SCHOUHAMER IMMINK: KNUTH’S BALANCED CODES REVISITED 1675
Fig. 1. Distribution of the (normalized) transmitted index for and .
Fig. 2. Entropy versus .
than , and we conclude that the above proposed modiﬁca-
tion of Knuth’s scheme using a variable length preﬁx can offer
only a small improvement in redundancy within the range of
codeword length investigated. We conclude that, at least within
this range, the proposed variable preﬁx-length scheme cannot
bridge the factor of two in redundancy between the basic Knuth
scheme and that of full set balanced codes.
IV. ENCODING AUXILIARY DATA
There is at least one position and there are at most posi-
tions within an -bit word, even, where a word can be bal-
anced. The “at least” one position, which makes Knuth’s algo-
rithm possible, was proved by Knuth (see above). The “at most”
bound will be shown in the next Theorem.
Theorem 2: There are at most positions within an -bit
word, even, where a word can be balanced.
Proof: Let denote the position where balance can be
made. Then, at the neighboring positions or such
a balance cannot be made, so that we conclude that the number
of positions where balance can be made is less or equal to
1676 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 4, APRIL 2010
Fig. 3. Distribution of the (normalized) number, , of possible balancing positions for and .
Note that the indices of a word with balance positions
are either all even or all odd. It can easily be veriﬁed that there
are three groups of words that can be balanced at positions,
• the wordsconsisting ofthe cascadeof the di-bits
• the words beginning with a followed by
di-bits or , followed by a , and
• the inverted words of the previous case.
Since, on average, the encoder has the degree of freedom of
selecting from more than one balance position, it offers the en-
coder the possibility to transmit auxiliary data. Assume there
are positions, , where the encoder can balance
the user word, then the encoder can convey an additional
bits. The number depends on the user word at hand, and there-
fore the amount of auxiliary data that can be transmitted is user
Let denote the probability that the encoder may
choose between , , possible positions, where
balancing is possible.
Theorem 3: The distribution of the number of positions,
where an -bit word, even, can be balanced is given by
Proof: Theorem 3 follows from Lemma 6 in Appendix and
the fact that there are (equally probable) sequences of length
Fig. 3 shows two examples of the distribution, namely for
and . The average amount of information,
, that can be conveyed via the choice in the position data
Results of computations are shown in Fig. 4. We can recursively
compute by invoking
For large and ,wehave
where . We approximate
Now, for large , we can approximate by
WEBER AND SCHOUHAMER IMMINK: KNUTH’S BALANCED CODES REVISITED 1677
Fig. 4. The average amount of information, , that can be conveyed via the choice in the index as a function of .
where isEuler’s constant.Weconclude thatthe av-
erage amount of information that can be conveyed by exploiting
the choice of index compensates for the loss in rate between
codes based on Knuth’s algorithm and codes based on full bal-
anced codeword sets.
We have investigated some characteristics and possible im-
provements of Knuth’s algorithm for constructing bipolar code-
words with equal numbers of “ ”s and “ ”s. An -bit
codeword is obtained after a small modiﬁcation of the -bit
user word plus appending a, ﬁxed-length, -bit preﬁx. The -bit
preﬁx represents the position index within the codeword, where
the modiﬁcation has been made.
We have derived the distribution of the index (assuming
equiprobable user words), and have computed the entropy of
the transmitted index. Our computations show that a modiﬁca-
tion of Knuth’s generic scheme using a variable length preﬁx
of the position index will only offer a small improvement in
The transmitter can, in general, choose from a plurality of
indices, so that the transmitter can transmit additional infor-
mation. The number of possible indices depends on the given
user word, so that the amount of extra information that can be
transmitted is data dependent. Wehave derived the distribution
of the number of positions where a word can be balanced. We
have computed the average information that can be conveyed
by using the freedom of choosing from multiple indices. The
average amount of information can, for large user word length,
, be approximated by . This compensates for
the loss in code rate between codes based on Knuth’s algorithm
and codes based on full balanced codeword sets.
In this Appendix, we give combinatorial proofs of Theorems
1 and 3. We ﬁrst review some results on Dyck words and then
derive lemmas leading to the proofs of the theorems. We also
refer the reader to On Line Encyclopedia of Integer Sequences
A33820 and A112326.
ADyck word of length is a balanced bipolar sequence
of length such that no initial segment has more ’1’s than
’’s , or in other words, is a Dyck word if the running
digital sum for all . The
number of Dyck words of length is equal to
which is the th Catalan number . For example, , and
are the Dyck words of length , and ,
, , , and are the Dyck
words of length , where for clerical convenience we have
written “ ” instead of “ ”.
Let denotethe setofallbalanced sequencesoflength
without internal balancing positions, i.e., there are no balancing
1678 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 4, APRIL 2010
positions with . Deﬁne . Note
that a sequence is in if and only if it has the format
or its inverse, where is a Dyck word of length .
Hence, for all
For example, , which is indeed the result provided by
Let denote the set of bipolar sequences of even length
for which the smallest balancing index is .
Deﬁne . We will derive an explicit
expression for (in Lemma 3), from which Theorem 1
Lemma 1: For all , it holds that
Proof: Let with of length .
We deﬁne a mapping from to by
, where is the inverse of , i.e., is the cyclic shift of
with an inversion of the last bit of . The lemma follows from
the observation that is a bijection.
Lemma 2: For all , it holds that
Proof: Let denote the set of all bipolar sequences
of length , where and is
balanced. Let with of length .We
deﬁne a mapping from to by ,
where is the symbol-wise inverse of . Since is a bijection
and the lemma follows using (14).
Lemma 3: For all , it holds that
Proof: The ﬁrst equality follows from Lemma 1. Suppose
that the second equality holds for . From Lemma 2
and thus the second equality also holds for . Since the
second equality holds for because of (14), the result
follows by induction.
Let denote the set of bipolar sequences of even length
which can be balanced in positions . De-
ﬁne . We will derive an explicit ex-
pression for (in Lemma 6), from which Theorem 3 im-
mediately follows. Any sequence with balancing
positions can be uniquely decomposed as
, where is of length , with
and . Note that is in for all
and that is in . From
these observations, we can easily derive the recursive relation
for all . Further, we have, for all , the trivial
Lemma 4: For all and satisfying , it holds
Proof: Any bipolar sequence of length containing
’ones’ can be uniquely written as , where is a Dyck
word of length , with , and is
a bipolar sequence of length containing
’s. Using (13) for Dyck word enumeration, a simple counting
argument gives the stated result.
Lemma 5: For all , it holds that
Proof: Any bipolar sequence of length having morethan
’s can be uniquely written as , where is of length
, with , and is of length and has
’s. Any bipolar sequence of length containing less than
’s can be uniquely written as , where is of length
, with , and is of length and has
WEBER AND SCHOUHAMER IMMINK: KNUTH’S BALANCED CODES REVISITED 1679
which concludes the proof.
Lemma 6: For all , it holds that
Proof: Assuming that the statement holds for all ,
we will show that it also holds for . For all
where the ﬁrst equality follows from (20), the second from (25)
and (14), and the third from Lemma 4 (with and
). Further, we have
where the ﬁrst equality follows from (21) (with ),
the second from (26), and the third from Lemma 5 (with
). Hence, if the statement in the lemma holds for all
, then it holds for as well. Since (21) gives that
, (25) holds for , and the lemma follows by
induction on .
 K. A. S. Immink, Codes for Mass Data Storage Systems, Second ed.
Eindhoven, Netherlands: Shannon Foundation Publishers, 2004.
 D. E. Knuth, “Efﬁcient balanced codes,” IEEE Trans. Inf. Theory, vol.
IT-32, pp. 51–53, Jan. 1986.
 N. Alon, E. E. Bergmann, D. Coppersmith, and A. M. Odlyzko,
“Balancing sets of vectors,” IEEE Trans. Inf. Theory, vol. IT-34, pp.
128–130, Jan. 1988.
 S. Al-Bassam and B. Bose, “On balanced codes,” IEEE Trans. Inf.
Theory, vol. 36, pp. 406–408, Mar. 1990.
 L. G. Tallini, R. M. Capocelli, and B. Bose, “Design of some new
balanced codes,” IEEE Trans. Inf. Theory, vol. 42, pp. 790–802, May
 R. P. Stanley, Enumerative Combinatorics. New York: Cambridge
University Press, 1999, vol. 2.
Jos H. Weber (S’87–M’90–SM’00) was born in Schiedam, The Netherlands,
in 1961. He received the M.Sc. (in mathematics, with honors), Ph.D., and MBT
(Master of Business Telecommunications) degrees from Delft University of
Technology, Delft, The Netherlands, in 1985, 1989, and 1996, respectively.
Since 1985, he has been with the Faculty of Electrical Engineering, Mathe-
matics, and Computer Science of Delft University of Technology. Currently, he
is an associate professor at the Wireless and Mobile Communications Group.
He is the chairman of the WIC (Werkgemeenschap voor Informatie- en Com-
municatietheorie in de Benelux) and the secretary of the IEEE Benelux Chapter
on Information Theory. He was a Visiting Researcher at the University of Cal-
ifornia at Davis, the University of Johannesburg, South Africa, and the Tokyo
Institute of Technology, Japan. His main research interests are in the areas of
channel and network coding.
Kees A. Schouhamer Immink (M’81–SM’86–F’90) received the Ph.D. degree
from the Eindhoven University of Technology, The Netherlands.
He founded and was named President of Turing Machines, Inc., in 1998. He
has, since 1994, been an Adjunct Professor at the Institute for Experimental
Mathematics, Essen University, Germany, and is afﬁliated with the Nanyang
Technological University of Singapore. He designed coding techniques of a
wealth of digital audio and video recording products, such as compact disc,
CD-ROM, CD-video, digital compact cassette system, DCC, DVD, video disc
recorder, and blu-ray disc.
Dr. Immink received a Knighthood in 2000, a personal “Emmy” award in
2004, the 1996 IEEE Masaru Ibuka Consumer Electronics Award, the 1998
IEEE Edison Medal, 1999 AES Gold and Silver Medals, and the 2004 SMPTE
Progress Medal. He was named a Fellow of the IEEE, AES, and SMPTE, and
was inducted into the Consumer Electronics Hall of Fame, and elected into the
Royal Netherlands Academy of Sciences and the US National Academy of En-
gineering. He served the profession as President of the Audio Engineering So-
ciety inc., New York, in 2003.