ArticlePDF Available

Abstract and Figures

We present coding methods for generating ℓ-symbol constrained codewords taken from a set, S, of allowed codewords. In standard practice, the size of the set S, denoted by M=|S|, is truncated to an integer power of two, which may lead to a serious waste of capacity. We present an efficient and low-complexity coding method for avoiding the truncation loss, where the encoding is accomplished in two steps: first, a series of binary input (user) data is translated into a series of M-ary symbols in the alphabet M = {0, ... ,M - 1}. Then, in the second step, the M-ary symbols are translated into a series of admissible ℓ-symbol words in S by using a small look-up table. The presented construction of Pearson codes and fixed-weight codes offers a rate close to capacity. For example, the presented 255B320B balanced code, where 255 source bits are translated into 32 10-bit balanced codewords, has a rate 0.1 % below capacity.
Content may be subject to copyright.
Efficient encoding of constrained block codes
Kees A. Schouhamer Immink, Fellow, IEEE, and Kui Cai, Senior Member, IEEE
Abstract—We present coding methods for generating `-symbol
constrained codewords taken from a set, S, of allowed codewords.
In standard practice, the size of the set S, denoted by M=|S|,
is truncated to an integer power of two, which may lead to
a serious waste of capacity. We present an efficient and low-
complexity coding method for avoiding the truncation loss, where
the encoding is accomplished in two steps: first, a series of binary
input (user) data is translated into a series of M-ary symbols
in the alphabet M={0,...,M 1}. Then, in the second step,
the M-ary symbols are translated into a series of admissible `-
symbol words in Sby using a small look-up table. The presented
construction of Pearson codes and fixed-weight codes offers a rate
close to capacity. For example, the presented 255B320B balanced
code, where 255 source bits are translated into 32 10-bit balanced
codewords, has a rate 0.1 % below capacity.
Keywordsconstrained code, code design, binary block
code, balanced code, Pearson code.
I. INTRODUCTION
Constrained codes have found widespread application in a
large variety of communication systems such as cable trans-
mission [1, 2], vehicular communications systems, visible light
communications (VLC) systems [3], and data storage products
ranging from magnetic, optical, solid-state (Flash), and DNA.
Runlength limited codes [4] (RLL) use codewords with re-
strictions on the minimum and maximum runlength (that is,
the number of consecutive like symbols) of the encoded se-
quence. RLL codes are ubiquitous in optical disc and magnetic
recording products [5], VLC [3, 6], and DNA-based storage
media [7, 8]. Balanced and almost balanced codes employ
codewords with equal, or almost equal, numbers of 1’s and
0’s [5]. A typical example of an almost balanced code is the
8B10B code. The 8B10B code has many embodiments [2, 9],
and is widely used in gigabit telecommunication systems
and data storage media. Combinations of RLL and balanced
codes can be found in data storage, energy harvesting, and
communications codes [10, 11, 12, 13].
The codewords of a constrained block code are taken from
a selected repertoire, S,S Q`, of admissible codewords
x= (x1, x2, . . . , x`),xiQ. The number of admissible
codewords is denoted by M=|S|, the size of S. We do not
concern ourselves with the selection or pairing of codewords.
Each admissible codeword will be uniquely represented by an
integer symbol taken from the alphabet M={0, . . . , M 1}.
Kees A. Schouhamer Immink is with Turing Machines Inc, Willem-
skade 15d, 3016 DK Rotterdam, The Netherlands. E-mail: immink@turing-
machines.com.
Kui Cai is with Singapore University of Technology and Design (SUTD),
Science, Mathematics, and Technology Cluster, 8 Somapah Rd, 487372,
Singapore. E-mail: cai kui@sutd.edu.sg.
This work is supported by Singapore Ministry of Education Academic
Research Fund Tier 2 MOE2019-T2-2-123 and RIE2020 Advanced Manu-
facturing and Engineering (AME) programmatic grant A18A6b0057.
The information capacity, denoted by C, of the channel using
constrained codewords equals
C= log2M. (1)
In standard practice, the size of the original set Sof admissible
codewords is truncated to the nearest integer power of two,
2bCc, by judiciously deleting the surplus, M2bCc, words,
which may lead to a serious waste of capacity. A notorious
example is the binary Pearson code, where only one word, the
all-1 or all-0 word, is excluded [14]. With prior art fixed-block
codes the rate equals (`1)/`, which entails a significant
redundancy for small `. Another well-known example is
M= 252 (the number of 10-bit balanced (binary) codewords
with equal numbers of 1’s and 0’s), where the truncation to
27= 128 codewords leads to an information rate waste of
around 10 %. By combining ten 10-bit balanced words, we
can translate 79 bits into ten 10-bit balanced codewords. The
redundancy is less than a percent, but the improvement comes
at a higher complexity of the look-up translation tables. There
is a need to improve the rate efficiency of constrained codes
without complex look-up tables.
We present and investigate a new encoding procedure that
aims to improve the code rate efficiency without the need
to use large look-up tables. The new encoding method is
accomplished in two steps. First, the key step, binary source
data are efficiently translated into a series of integer symbols
in the alphabet Mthat are conveniently represented by q-bit
binary words, where qan integer satisfying q=dCe. In the
second step, the series of q-bit words is translated into a series
of admissible codewords in S. The first encoding step scales
linearly with the number of symbols in the codeword.
The paper is organized as follows. In Section II, we start
a survey of properties of the radix conversion scheme. Sec-
tion III presents results of the new coding technique. An
alternative scheme, the variable length to fixed length scheme,
is investigated in Section IV. Applications to binary Pearson
codes and balanced codes are given in Section V. We present a
high-rate 255B320B balanced code, where 255 source bits are
translated into 32 10-bit balanced codewords. Our conclusions
are presented in Section VI.
II. RADIX CONVERSION SCHEME
A straightforward method for translating an n-bit source
file into a series of integer symbols in the alphabet Mis base
or radix conversion. The binary n-bit input word, considered
as a number in radix 2, is converted into Lointeger symbols
in M, where Lois a user-defined positive integer [15]. The
Loradix-Minteger, in turn, is translated, using a look-up
table, into the corresponding admissible word. The number of
distinct integers that can be addressed with Losymbols in an
M-radix system equals MLo, so that for a code to exist we
have 2nMLo, or
n=bLoCc.(2)
An integer symbol in the alphabet Mcan carry at most Cbits,
so it is natural to define the rate efficiency of an encoder as
the quotient of the (average) number of bits that are translated
per symbol and the capacity C. Then, the rate efficiency of
the radix-2-to-Mconversion, denoted by Ro(Lo), equals
Ro(Lo) = n
LoC=bLoCc
LoC.(3)
The rate efficiency of a simple look-up table, Lo= 1, equals
Ro(1) = bCc/C. The best case rate efficiency is obtained
when Mis an integer power of two, then Ro(Lo) = 1, and
the coding step is lossless.
Example 1: Let M= 252. Then, we can transmit
at most C= log2(252) = 7.977 bits per symbol. A
simple binary encoder, using a look-up table, has a rate
Ro(1) = 7/7.977, which implies a 10 % relative rate loss.
Let, for example, Lo= 10, then, n= 79, so that the (relative)
code redundancy of the radix conversion scheme equals
1Ro(10) = 1 79/79.77 0.0097.
A drawback of the radix conversion scheme, unless M
is an integer power of two, is the increasing complexity
with growing codeword length nas we require n-bit addi-
sion/subtraction units and storage of the (M1) ×Lon-
bit wide coefficients. Let Mbe close to an integer power
of two, say M= 2u+v, where u, v are two integers,
u > 0,|v|  2u, then Cu+v
ln(2)2u. Let v > 0, then
Ro(1) = u/C, and we simply find that Ro(Lo)> Ro(1) for
Lo'2uln(2)/v 0.693 2u/v. In other words, in order to
improve the rate with respect to that of a simple look-up table,
Ro(1), we must increase Lo. For example, let M= 33, then
C= log2(33) 5.044394118 and Ro(1) 5/C = 0.9911.
We easily find that L0
o= 23 is the smallest Lothat can
increase the rate to Ro(23) = bCL0
oc/CL0
o= 116/115 Ro(1).
In the next section, we describe a simple method for efficiently
generating codewords that has less concerns with respect to
complexity.
III. DESCRIPTION OF THE NEW CODING METHOD
A. Basic encoder
An integer in Mis represented by a q-bit word, q=
dlog2Me, taken from a constrained set, C, where |C| =M.
The aim of the new coding method is to efficiently translate
binary source data into a series of q-bit words in C. The binary
source data are assumed to be represented by (n1)-bit words,
denoted by (a1, . . . , an1),aiB={0,1}, where nis a
conveniently chosen integer. The (n1)-bit source word is
translated, using the new algorithm, into L q-bit words, where
qL =n. The q-bit words are denoted by ui,1iL, where
the first word, u1= (p, a1, . . . , aq1), called pivot word,
contains q1user bits plus a redundant bit called pivot bit,
denoted by p,pB. The value of the pivot bit, p, is governed
by the encoder, see later. The remaining (L1) q-bit words
are defined by a shuffled input: ui= (a(i1)q, . . . , aiq1),
2iL.
B. Description of the encoding and decoding algorithms
For clerical convenience we define two functions: dec(y)
denotes the decimal representation of the q-bit word
y= (y1, . . . , yq), and, vice versa,y=binq(z)denotes the
q-bit binary representation, y, of the integer z,0z2q1.
Clearly, dec(y) = z. All variables are integers, the bold
face variables denote q-bit words. Let w= 2qMdenote
the number of inadmissible words, u, dec(u)< w. At the
conclusion of the algorithm, all inadmissible q-bit words, ui,
dec(ui)< w, are eliminated and replaced by admissible q-bit
words, so that dec(ui)w, i.
Encoding routine
Input: The integers q,w= 2qM,L, and the binary
(Lq 1)-bit source data (a1, . . . , aLq1).
Output: Series of encoded q-bit words ui, where dec(ui)w,
1iL.
Initialize: Define the L q-bit words u1=(1, a1, . . . , aq1)
and ui=(a(i1)q, . . . , aiq1),2iL.
Set v= 1.
Replacing inadmissible words:
for i=2:L
if dec(ui)< w
ui=uv;uv=binq((i1)w+ dec(ui));v=i
end
end.
Note that the pivot bit equals ‘1’ in case the user data is sent
unmodified, or it equals ‘0’ in case at least one word has
been modified. As a result, the receiver can easily detect that
modifications have been effected. Decoding of the received
codeword can be accomplished in a straightforward way by
recursively undoing the replacements.
Decoding routine
Input: The integers q,w= 2qM,L, and L q-bit words ui,
1iL, encoded by the above routine.
Output: Series of decoded q-bit words, denoted by ˆ
ui,
1iL. The (Lq 1)-bit source data (a1, . . . , aLq1)are
found after a reshuffling of ˆ
ui.
Restoring the source data:
for i= 1 : Lˆ
ui=uiend
if dec(u1)<2q1
v= 1; c= 0;
while (c < 2q1)
vo=v;c= dec(uv);
v= 1 + c÷w;ˆ
uv= binq(dec(uvo)(v1)w);
end
ˆ
u1=binq(c2q1);
end.
The arithmetic of the algorithms is easily embodied in a
look-up table. Two worked encoding examples will be helpful
to understand the encoding algorithm.
Example 2: Let q= 4,L= 4,w= 2 (‘0000’ and ‘0001’
are forbidden words). Let the user data be ‘000 0001 0000
1111’. After prepending the pivot bit ‘1’, we obtain the
sequence ‘1000 0001 0000 1111’. Set at the start v= 1. We
find the first inadmissible word at position i1= 2, and we
replace it by the pivot word, that is, u2=u1= ‘1000’. Then
wi1=dec(u2)=1, so that the pivot word becomes u1=
bin4((i11)w+wi1) = bin4((2 1)2 + 1) = bin4(3) =
‘0011’. We obtain the intermediate result ‘0011 1000 0000
1111’. Set v=i1= 2. The second inadmissible word is found
at index position i2= 3. We now set u3=uv=u2=‘1000’
and u2=bin4((i21)w+wi2) = bin4(4 + 0) = ‘0100’,
and we obtain the final result ‘0011 0100 1000 1111’.
Example 3: Let, as above, q= 4,L= 4,w= 2. Let the
user data be ‘000 0000 0000 0000’. Without much ado, we
write down the intermediate results ‘0010 1000 0000 0000’,
‘0010 0100 1000 0000’, and ‘0010 0100 0110 1000’. The
sent codeword is ‘0010 0100 0110 1000’. In case the source
word is ‘001 0001 0001 0001’, we obtain the codeword ‘0011
0101 0111 1001’.
Unique decoding is possible if the number, L, of q-bit words
that can maximally be translated, equals [16]
L=2q1
2qM.(4)
so that the rate efficiency of the code, denoted by R1(M), is
at most
R1(M) = Lq 1
LC .(5)
If 2qM= 2tis a power of two, 1tq1, we simply
find
R1(M) = q2t+1q
C.(6)
In case L= 1, when Mis in the range
2q1< M 3
22q11,(7)
the algorithm does not improve the rate efficiency with respect
to a simple look-up table. The worst case rate efficiency equals
R13
22q11q1
q2 + log2(3) .(8)
We may, in some instances, improve the rate efficiency by
combining r,r1,M-ary symbols into a single r-symbol
word.
C. Combining words
The number of combined admissible r-symbol words equals
Mr, so that we obtain
q0=drlog2Me,(9)
L0=$2q01
2q0Mr%.(10)
7 7.2 7.4 7.6 7.8 8
log2 M
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
1-R
r=2
VF
r=1
Fig. 1. Redundancy 1Rr(M)versus C= log2Mfor r= 1 and 2.
The curve denoted by VF is discussed in Section IV.
7 7.5 8 8.5 9 9.5 10
log2 M
0
0.005
0.01
0.015
0.02
0.025
0.03
1-R
Fig. 2. Redundancy 1ˆ
R3(M)versus C= log2M, where ˆ
R3(M)
denotes the largest of R1(M), R2(M), and R3(M).
Let Rr(M)denote the rate efficiency of the encoder, then
Rr(M) = L0q01
L0rC =R1(Mr).(11)
Figure 1 shows the redundancy 1Rr(M)versus C= log2M
for r= 1 and 2. The encoder cannot improve its rate efficiency
with respect to an encoder using a simple look-up table if L0=
1, that is, when Mis in the range 2q01< M < r
q3
22q01,
which gets smaller with increasing r.
We may further observe that not for all Mwe have
Rr(M)> R1(M), r > 1. Let ˆ
Rm(M)denote the highest
Rr(M)achievable for a selected r= 1, . . . , m, that is,
ˆ
Rm(M) = max{R1(M), R2(M), . . . , Rm(M)}. Figure 2
shows the redundancy 1ˆ
R3(M)as a function of log2M.
D. Encoder complexity
The complexity of the encoder scales with L. If the source
data are random, then the probability of treplacements, 0
TABLE I
ENC ODE R TABL E OF VF COD E FO R M= 5.
input output
000 0
001 1
01 2
10 3
11 4
tL, in the L q-bit words follows a binomial distribution.
The average number of replacements, denoted by µ, is
µ=Lw
2q.(12)
As (4), we have µ1/2. The probability that no word is
altered equals (1 w
2q)L.
IV. VARIABLE-TO -FIX ED (VF) LENGTH ENCODING
Cao and Fair have investigated the application of
variable-to-fixed length codewords (VF code) for constrained
systems [17, 18]. Again, let q=dlog2Me. In the VF scheme,
the Msource words may have two lengths, namely q1
and q. We define v1= 2qMsource words of length q1
and v2=Mv1= 2M2qsource words of length q. The
Msource words are assigned to the Mintegers taken from M.
Example 4: Let M= 5, then q= 3,v1= 2qM= 3,
and v2= 2. Define the five source words 000, 001, 01, 10,
11 of length 3 and 2, respectively. The M= 5 source words
are arbitrarily assigned to the integers 0,...,4. Table I shows
a possible assignment. Let the input string be 001011000011.
The encoder parses the input string into the words 001, 01,
10, 000, 11 and translates them, using Table I, into the output
string 1, 2, 3, 0, and 4.
Assuming independent and identically distributed random
input data, the average rate efficiency of the VL code, denoted
by Rvl(M), equals
Rvl(M) = 1
Cq1
2q1v1+q
2qv2=1
Cq2 + M
2q1.
(13)
The worst case rate efficiency equals bCc/C, which is the
same as that of a simple block code. Note that, see (6), for
M= 2q2twe have Rvl(M) = R1(M). Figure 3 shows the
redundancy 1Rvl(M)versus log2M. A survey plotted in
Figure 1 shows the redundancy 1Rvl(M)for the variable-
to-fixed length (VF) encoding and 1Rr(M)versus C=
log2M,r= 1,r= 2, for the new method.
V. APPLICATIONS
In this section, we present applications to Pearson codes
and fixed-weight codes.
7 7.5 8 8.5 9 9.5 10
log2 M
0
0.002
0.004
0.006
0.008
0.01
0.012
1-R
Fig. 3. Redundancy 1Rvl(M)of the VL scheme versus C=
log2M.
A. Binary Pearson codes
Pearson codes have been advocated for channels whose gain
and/or offset are unknown [19]. For binary channels, Q=
{0,1}, with unknown offset, the offset channel, it suffices to
forbid the all-0 word (or the all-1 word), and for channels
with both unknown gain and offset, the offset/gain channel,
we forbid both the all-0 and all-1 words. Let the codeword
size be q, then we simply have M= 2q1,M0= 2q2,
and C= log2M,C0= log2M0for the offset and offset/gain
channel, respectively. Although only one or two codewords are
barred, prior art block codes face a serious loss for small q. By
invoking the new coding method, we are able to improve the
rate efficiency to R1(2q1) = (q2q+1)/C or R0
1(2q2) =
(q2q+2)/C0, see (6). For the VF code we find the same
rate efficiency results, namely Rvl(2q1) = R1(2q1) and
R0
vl(2q2) = R0
1(2q2), respectively, which accords with
the results presented in [17, 18, 20].
B. Fixed-weight codes
The weight of a binary codeword is the number of its
symbols equal to ’1’. A balanced code is a fixed-weight code
whose codewords have equal numbers of 1’s and 0’s. The
6B8B [21] and 4B6B [3] codes are examples of balanced
codes, where the short-hand notation mB`B refers to codes
that translate an m-bit input word into an `-bit codeword.
Balanced codes, such as the 6B8B and 5B10B codes, have
a minimum Hamming distance of at least two. The 5B10B
features a minimum Hamming distance 4 [22], which offers a
greater noise resilience at the cost of a higher redundancy. Note
that the 8B10B [2] code is not balanced as it uses codewords
of weight 4, 5, and 6. The codewords with weight 4 or 6 are
sent alternately for balancing the numbers of 0’1 and 1’s, so
that the concatenation of codewords is almost balanced [2].
The minimum Hamming distance of the 8B10B code is unity.
We have applied the new coding method to constructing
balanced codes. Table II shows the rate efficiency of the new
construction versus the codeword length `, where M=`
`/2.
TABLE II
PERFORMANCE OF `-BIT BALANCED CODES.
` M =`
`/2bCc/C R1(M)Rvl(M)
8 70 0.979 0.979 0.994
10 252 0.877 0.999 0.999
12 924 0.914 0.995 0.995
14 3432 0.937 0.993 0.994
16 12870 0.952 0.989 0.994
Except for the case `= 8, the resulting rate efficiency is
close to capacity, in most cases less than half a percent.
Example 5, `= 10, details the construction of a rate 255/320,
255B320B balanced code with minimum Hamming distance
two, whose rate is 0.3 % lower than that of an 8B10B code
with a minimum Hamming distance of unity.
Example 5: There are M= 252 10-bit balanced words.
A straightforward implementation of a block code translates
7 source bits into 10 channel bits. We may improve the
efficiency by combining codewords, see Example 1, but its
implementation requires impracticably large look-up tables.
With the new scheme, we find q=dlog2Me= 8, and
w= 28252 = 4. Then Lq 1 = 255 source bits can
be encoded into L= 32(= 2q1/w)10-bit balanced words.
The rate efficiency is 0.999, see Table II. The new encoding
method requires data storage of 32 bytes, the execution of the
encoding algorithm, and a small look-up table for translating
an 8-bit wide word into a 10-bit balanced word.
VI. CONCLUSIONS
We have presented an encoding method for efficiently trans-
lating binary source data into a series of integer symbols in
the alphabet {0, . . . , M 1}. The series of integer symbols is
translated, using a second encoder, into a series of constrained
codewords. We have compared the rate efficiency of the
new scheme with that of variable-to-fixed (VF) length codes.
As an application example, we have presented constructions
of Pearson codes and fixed-weight and balanced codes that
offer a rate close to capacity. We have presented a high-rate
255B320B balanced code, where 255 source bits are translated
into 32 10-bit balanced codewords, has a rate 0.1 % below
capacity, and a minimum Hamming distance between the 10-
bit words being two.
REFERENCES
[1] K. Balasubramanian, S. S. Agili and A. Morales, “Encoding
and compensation schemes using improved pre-equalization for the
64B/66B Encoder,” 2012 IEEE International Conference on Con-
sumer Electronics (ICCE), Las Vegas, NV, pp. 361-363, 2012, doi:
10.1109/ICCE.2012.6161902.
[2] A. X. Widmer and P. A. Franaszek, “A Dc-balanced, Partitioned-Block,
8B/10B Transmission Code,IBM J. Res. Develop., vol. 27, no. 5, pp.
440-451, Sept. 1983, doi: 10.1147/rd.275.0440.
[3] “IEEE standard for local and metropolitan area networks, part
15.7-2011: Short range wireless optical communication using vis-
ible light,” IEEE Std 802.15.7-2011, pp. 1-309, Sept 2011, doi:
10.1109/IEEESTD.2011.6016195.
[4] B. H. Marcus, P. H. Siegel, and J. K. Wolf, “Finite-state Modulation
Codes for Data Storage,” IEEE Journal on Selected Areas in Commu-
nications, vol. 10, no. 1, pp. 5-37, Jan. 1992, doi: 10.1109/49.124467.
[5] P. H. Siegel, “Recording Codes for Digital Magnetic Storage,” IEEE
Transactions on Magnetics, vol. MAG-21, no. 5, pp. 1344-1349, Sept.
1985, doi: 10.1109/TMAG.1985.1063972.
[6] Z. Wang, Q. Wang, W. Huang, and Z. Xu, Visible Light Communica-
tions: Modulation and Signal Processing, Wiley-IEEE Press, Jan 2018.
[7] M. Blawat, K. Gaedke, I. Hutter, X. Cheng, B. Turczyk, S. Inverso, B. W.
Pruitt, and G. M. Church, “Forward Error Correction for DNA Data Stor-
age,” International Conference on Computational Science (ICCS 2016),
vol. 80, pp. 1011-1022, 2016, doi.org/10.1016/j.procs.2016.05.398.
[8] K. A. S. Immink and K. Cai, “Properties and Constructions of Con-
strained Codes for DNA-based Data Storage,IEEE Access, vol. 8, pp.
49523-49531, 2020, doi: 10.1109/ACCESS.2020.2980036.
[9] S. Fukuda, Y. Kojima, Y. Shimpuku, and K. Odaka, “8/10 Mod-
ulation Codes for Digital Magnetic Recording,” IEEE Transactions
on Magnetics, vol. MAG-22, no. 5, pp. 1194-1196, Sept. 1986, doi:
10.1109/TMAG.1986.1064445.
[10] V. Braun and K. A. S. Immink, “An Enumerative Coding Technique
for DC-free Runlength-Limited Sequences,” IEEE Transactions on
Communications, vol. 48, no. 12, pp. 2024-2031, Dec. 2000, doi:
10.1109/26.891213.
[11] K. A. S. Immink and K. Cai, “Properties and constructions of
energy-harvesting sliding-window constrained codes,IEEE Commu-
nications Letters, vol. 24, no. 9, pp. 1890-1893, Sept. 2020, doi:
10.1109/LCOMM.2020.2993467.
[12] K. A. S. Immink, “A New DC-free Runlength Limited Coding Method
for Data Transmission and Recording,IEEE Transactions on Con-
sumer Electronics, vol. CE-65, no. 4, pp. 502-505, Nov. 2019, doi:
10.1109/TCE.2019.2932795.
[13] K. A. S. Immink and K. Cai, “Spectral Shaping Codes,” IEEE Trans-
actions on Consumer Electronics, vol CE-67, no. 2, pp. 158-165, May
2021, 10.1109/TCE.2021.3073199.
[14] J. H. Weber, K. A. S. Immink, and S. R. Blackburn, “Pearson Codes,”
IEEE Transactions on Information Theory, vol. IT-62, no. 1, pp. 131-
135, Jan. 2016, doi: 10.1109/TIT.2015.2490219.
[15] D. E. Knuth, “Positional Number Systems,” The Art of Computer
Programming, vol. 2: Semi-numerical Algorithms, 3rd ed. Reading, MA:
Addison-Wesley, pp. 195-213, 1998.
[16] K. A. S. Immink, “High-Rate Maximum Runlength Constrained Coding
Schemes Using Nibble Replacement,” IEEE Transactions on Infor-
mation Theory, pp. 6572-6580, vol. IT-58, no. 10, Oct. 2012, doi:
10.1109/TIT.2012.2204034.
[17] C. Cao and I. Fair, “Construction of Multi-State Capacity-Approaching
Variable-Length Constrained Sequence Codes With State-Independent
Decoding,” IEEE Access, vol. 7, pp. 54746-54759, 2019, doi:
10.1109/ACCESS.2019.2913339.
[18] C. Cao and I. Fair, “Capacity-Approaching Variable-Length Pearson
Codes,” IEEE Communications Letters, vol. 22, no. 7, pp. 1310-1313,
July 2018, doi: 10.1109/LCOMM.2018.2829706.
[19] K. A. S. Immink and J. H. Weber, “Minimum Pearson Distance
Detection for Multi-Level Channels with Gain and/or Offset Mismatch,
IEEE Transactions on Information Theory, vol. IT-60, no. 10, pp. 5966-
5974, Oct. 2014, doi: 10.1109/TIT.2014.2342744.
[20] J. H. Weber, T. G. Swart, and K. A. S. Immink, “Simple Sys-
tematic Pearson Coding,” IEEE International Symposium on In-
formation Theory, Barcelona, Spain, pp. 385-389, July 2016, doi:
10.1109/ISIT.2016.7541326.
[21] A. X. Widmer, “Dc-balanced 6B/8B Transmission Codes with Local
Parity,” US Patent 6,876,315, April 2005.
[22] V. A. Reguera, “New RLL code with improved error performance for
visible light communication,” arXiv preprint arXiv:1910.10079, 2019.
... Systematic methods for designing Pearson codes that efficiently translate (arbitrary) source data into n-bit codewords in the codebook S = {0, 1} n \ {0} and vice versa have been presented in [11,12,13]. The rate of the encoder, R = 1 − 1 2 n−1 , presented in [13] is close to the maximum possible, and the complexity of the encoder and decoder scales linearly with n. ...
... Systematic methods for designing Pearson codes that efficiently translate (arbitrary) source data into n-bit codewords in the codebook S = {0, 1} n \ {0} and vice versa have been presented in [11,12,13]. The rate of the encoder, R = 1 − 1 2 n−1 , presented in [13] is close to the maximum possible, and the complexity of the encoder and decoder scales linearly with n. ...
Article
Full-text available
We consider noisy communications and storage systems that are hampered by varying offset of unknown magnitude such as low-frequency signals of unknown amplitude added to the sent signal. We study and analyze a new detection method whose error performance is independent of both unknown base offset and offset’s slew rate. The new method requires, for a codeword length n ≥ 12, less than 1.5 dB more noise margin than Euclidean distance detection. The relationship with constrained codes based on mass-centered codewords and the new detection method is discussed.
Article
Full-text available
We investigate a new approach for designing spectral shaping block codes with a target spectrum, H_t(f), that has been specified at a plurality of frequencies. We analyze the probability density function of the spectral power density function of uncoded n-symbol bipolar code words. We present estimates of the redundancy and the spectrum of spectral shaping codes with specified target spectral densities H_t(f_i) at frequencies f_i. Constructions of low-redundancy codes with suppressed low-frequency content are presented that compare favorably with conventional dc-balanced codes currently used in data transmission and data storage devices with applications in consumer electronics.
Article
Full-text available
We study properties and constructions of constrained binary codes that enable simultaneous energy and information transfer. We specifically study sliding-window constrained codes that guarantee that within any prescribed window of ℓ consecutive bits the constrained sequence has at least t, t > 1, 1’s. We present a K-state source, K = ℓ choose t, that models the (ℓ,t) sliding-window constraint. We compute the information capacity of sliding-window (ℓ,t)-constrained sequences. We design efficient coding techniques for translating source data into sliding-window (ℓ,t)-constrained sequences.
Article
Full-text available
We describe properties and constructions of constraint-based codes for DNA-based data storage which account for the maximum repetition length and AT/GC balance. Generating functions and approximations are presented for computing the number of sequences with maximum repetition length and AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced strings into DNA strands, and compute the efficiency of such routines. Expressions for the redundancy of codes that account for both the maximum repetition length and AT/GC balance are derived.
Article
Full-text available
This paper describes a new coding method based on binary (d, k) runlength constraints used for recording or transmitting an audio or video signal, computer data, etc. Data words of m bits are translated into codewords of n bits using a conversion table. The codewords satisfy a (d, k) runlength constraint in which at least d and not more than k '0's occur between consecutive '1's. The n-bit codewords alternate with p-bit merging words which in the prior art are selected such that the d and k are satisfied at the borders of consecutive codewords. We present a new coding method, where the codewords obey the (d, k)-constraint, but the merging words are not required to obey the (d)-constraint. The merging word that satisfies said conditions, yielding the lowest low-frequency spectral content of the encoded signal obtained after modulo-2 integration, is selected. The spectral performance of the new coding method has been appraised by computer simulations for the EFM (Eight-to-Fourteen Modulation) parameters, d = 2, k = 10, and p = 3. The low-frequency content of the signal generated by the newly presented coding method is around 4 dB lower in the relevant low-frequency range than that generated by the conventional EFM method.
Article
Full-text available
We consider the construction of capacity-approaching variable-length constrained sequence codes based on multi-state encoders that permit state-independent decoding. Based on the finite state machine description of the constraint, we first select the principal states and establish the minimal sets. By performing partial extensions and normalized geometric Huffman coding, efficient codebooks that enable state-independent decoding are obtained. We then extend this multi-state approach to a construction technique based on n-step FSMs. We demonstrate the usefulness of this approach by constructing capacity-approaching variable-length constrained sequence codes with improved efficiency and/or reduced implementation complexity to satisfy a variety of constraints, including the runlength-limited (RLL) constraint, the DC-free constraint, and the DC-free RLL constraint, with an emphasis on their application in visible light communications.
Article
Full-text available
Sequences encoded with Pearson codes are immune to channel gain and offset mismatch that cause performance loss in communication systems. In this paper, we introduce an efficient method of constructing capacity-approaching variable-length Pearson codes. We introduce a finite state machine (FSM) description of Pearson codes, and present a variable-length code construction process based on this FSM. We then analyze the code rate, redundancy and the convergence property of our codes. We show that our proposed codes have less redundancy than codes recently described in the literature and that they can be implemented in a straightforward fashion.
Conference Paper
Full-text available
The recently proposed Pearson codes offer immunity against channel gain and offset mismatch. These codes have very low redundancy, but efficient coding procedures were lacking. In this paper, systematic Pearson coding schemes are presented. The redundancy of these schemes is analyzed for memoryless uniform sources. It is concluded that simple coding can be established at only a modest rate loss.
Article
Full-text available
The Pearson distance has been advocated for improving the error performance of noisy channels with unknown gain and offset. The Pearson distance can only fruitfully be used for sets of $q$-ary codewords, called Pearson codes, that satisfy specific properties. We will analyze constructions and properties of optimal Pearson codes. We will compare the redundancy of optimal Pearson codes with the redundancy of prior art $T$-constrained codes, which consist of $q$-ary sequences in which $T$ pre-determined reference symbols appear at least once. In particular, it will be shown that for $q\le 3$ the $2$-constrained codes are optimal Pearson codes, while for $q\ge 4$ these codes are not optimal.
Article
Full-text available
The performance of certain transmission and storage channels, such as optical data storage and nonvolatile memory (flash), is seriously hampered by the phenomena of unknown offset (drift) or gain. We will show that minimum Pearson distance (MPD) detection, unlike conventional minimum Euclidean distance detection, is immune to offset and/or gain mismatch. MPD detection is used in conjunction with (T) -constrained codes that consist of (q) -ary codewords, where in each codeword (T) reference symbols appear at least once. We will analyze the redundancy of the new (q) -ary coding technique and compute the error performance of MPD detection in the presence of additive noise. Implementation issues of MPD detection will be discussed, and results of simulations will be given.
Article
We report on a strong capacity boost in storing digital data in synthetic DNA. In principle, synthetic DNA is an ideal media to archive digital data for very long times because the achievable data density and longevity outperforms today's digital data storage media by far. On the other hand, neither the synthesis, nor the amplification and the sequencing of DNA strands can be performed error-free today and in the foreseeable future. In order to make synthetic DNA available as digital data storage media, specifically tailored forward error correction schemes have to be applied. For the purpose of realizing a DNA data storage, we have developed an efficient and robust forwarderror-correcting scheme adapted to the DNA channel. We based the design of the needed DNA channel model on data from a proof-of-concept conducted 2012 by a team from the Harvard Medical School [1]. Our forward error correction scheme is able to cope with all error types of today's DNA synthesis, amplification and sequencing processes, e.g. insertion, deletion, and swap errors. In a successful experiment, we were able to store and retrieve error-free 22 MByte of digital data in synthetic DNA recently. The found residual error probability is already in the same order as it is in hard disk drives and can be easily improved further. This proves the feasibility to use synthetic DNA as longterm digital data storage media.