## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

The sequence replacement technique converts an input sequence into a constrained sequence in which a prescribed subsequence is forbidden to occur. Several coding algorithms are presented that use this technique for the construction of maximum run-length limited sequences. The proposed algorithms show how all forbidden subsequences can be successively or iteratively removed to obtain a constrained sequence and how special subsequences can be inserted at predefined positions in the constrained sequence to represent the indices of the positions where the forbidden subsequences were removed. Several modifications are presented to reduce the impact of transmission errors on the decoding operation, and schemes to provide error control are discussed as well. The proposed algorithms can be implemented efficiently, and the rates of the constructed codes are close to their theoretical maximum. As such, the proposed algorithms are of interest for storage systems and data networks.

To read the full-text of this research,

you can request a copy directly from the authors.

... Then, in Section III, we present the main contribution of this work, algorithms for translating arbitrary binary source data into k-constrained q-ary data. Among the three code design methods we describe, the second method removes forbidden substrings of q-ary sequences by using a recursive, 'sequence replacement', method yielding a significant improvement in coding redundancy than the prior art binary sequence replacement method [9]. In the third method, standard binary maximum runlength limited sequences are transformed into maximum runlength limited q-ary sequences using two simple steps of precoding, which opens the door to using the vast prior art binary code constructions to DNA-based storage. ...

... The three sequence replacement techniques published by Wijngaarden et al. [9] are recursive methods for removing forbidden substrings from a binary source word. The encoder removes the forbidden substrings, and the positions of the forbidden substrings are encoded as binary pointer words, and subsequently inserted at predefined positions of the codeword. ...

... Very efficient constructions of binary k ′ -constrained codes that avoid long repetitions of a 'zero' have been published in the literature, see, for example, the survey in [9]. We show that after applying a simple coding step to a k ′ -constrained binary sequence, we obtain a strand of nucleotides, where the length of a homopolymer run is at most m = ...

We consider coding techniques that limit the lengths of homopolymer runs in strands of nucleotides used in DNA-based mass data storage systems. We compute the maximum number of user bits that can be stored per nucleotide when a maximum homopolymer runlength constraint is imposed. We describe simple and efficient implementations of coding techniques that avoid the occurrence of long homopolymers, and the rates of the constructed codes are close to the theoretical maximum. The proposed sequence replacement method for k-constrained q-ary data yields a significant improvement in coding redundancy than the prior art sequence replacement method for the k-constrained binary data. Using a simple transformation, standard binary maximum runlength limited sequences can be transformed into maximum runlength limited q-ary sequences, which opens the door to applying the vast prior art binary code constructions to DNA-based storage.

... RLL codes have been applied in practice to various data storage devices, including virtually all magnetic and optical disc recording systems [14]. Over the years, different construction schemes for RLL codes, with varied enhancements, have been proposed and analyzed [15]- [18]. Study of RLL codes has continued to be an important research topic, and recent work includes its application to high density data storage [19], DNA-based storage [20], and visible light communication [21]. ...

... and w = ⌈LB⌉ belongs to the feasible set A Emax SEC (B), and the proof is complete using (18). ...

... Over the years, different construction approaches with varied enhancements have been proposed for RLL codes [15]- [18]. We remark here that SEC codes are also amenable to efficient implementation via concatenation [35], where the inner code is a heavy weight code [36] and the outer code is a high rate code over large alphabet, such as a Reed-Solomon code [37]. ...

Run-length limited (RLL) codes are a well-studied class of constrained codes having application in diverse areas such as optical and magnetic data recording systems, DNA-based storage, and visible light communication. RLL codes have also been proposed for the emerging area of simultaneous energy and information transfer, where the receiver uses the received signal for decoding information as well as for harvesting energy to run its circuitry. In this paper, we show that RLL codes are not the best codes for simultaneous energy and information transfer, in terms of the maximum number of codewords which avoid energy outage, i.e., outage-constrained capacity. Specifically, we show that sliding window constrained (SWC) codes and subblock energy constrained (SEC) codes have significantly higher outage-constrained capacities than RLL codes.

... RLL codes have been applied in practice to various data storage devices, including virtually all magnetic and optical disc recording systems [14]. Over the years, different construction schemes for RLL codes, with varied enhancements, have been proposed and analyzed [15]- [18]. Study of RLL codes has continued to be an important research topic, and recent work includes its application to high density data storage [19], DNA-based storage [20], and visible light communication [21]. ...

... and w = ⌈LB⌉ belongs to the feasible set A Emax SEC (B), and the proof is complete using (18). ...

... Over the years, different construction approaches with varied enhancements have been proposed for RLL codes [15]- [18]. We remark here that SEC codes are also amenable to efficient implementation via concatenation [35], where the inner code is a heavy weight code [36] and the outer code is a high rate code over large alphabet, such as a Reed-Solomon code [37]. ...

Run-length limited (RLL) codes are a well-studied class of constrained codes having application in diverse areas, such as optical and magnetic data recording systems, DNA-based storage, and visible light communication. RLL codes have also been proposed for the emerging area of simultaneous energy and information transfer, where the receiver uses the received signal for decoding information as well as for harvesting energy to run its circuitry. In this paper, we show that RLL codes are
not
the best codes for simultaneous energy and information transfer, in terms of the maximum number of codewords which avoid energy outage, i.e., outage-constrained capacity. Specifically, we show that sliding window constrained (SWC) codes and sub-block energy constrained (SEC) codes have significantly higher outage-constrained capacities than RLL codes for moderate to large energy buffer sizes.

... In many cases, the proposed techniques require careful selection of the order of repeat removals and involve a special encoding process for the repeats. Other replacement techniques were investigated in [16], [30], with the goal of imposing runlength or balancing constraints on a string. In these scenarios, removing offending substrings does not cause the introduction of other offending substrings, which makes the underlying problem solution simpler than repeat replacement. ...

... Next, we turn to the problem of designing an efficient encoder for an L-reconstruction code. Our constructive approach is inspired by techniques described in [25] and [30] for removing runs of zeros exceeding a certain length from arbitrary strings. Unlike the known runlength replacement strategy, our approach -repeat replacement -is iterative and it may lead to the creation of new repeats in already processed substrings. ...

The problem of reconstructing strings from their substring spectra has a long history and in its most simple incarnation asks for determining under which conditions the spectrum uniquely determines the string. We study the problem of coded string reconstruction from multiset substring spectra, where the strings are restricted to lie in some codebook. In particular, we consider binary codebooks that allow for unique string reconstruction and propose a new method, termed repeat replacement, to create the codebook. Our contributions include algorithmic solutions for repeat replacement and constructive redundancy bounds for the underlying coding schemes. We also consider extensions of the problem to noisy settings in which substrings are compromised by burst and random errors. The study is motivated by applications in DNA-based data storage systems that use high throughput readout sequencers.

... The sequence replacement technique has been widely applied in the literature [9], [20]- [22]. This is an efficient method for removing forbidden substrings from a source word. ...

We propose coding techniques that simultaneously limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given ℓ, ϵ > 0, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy all of the following properties: • Runlength constraint: the maximum homopolymer run in each codeword is at most ℓ, • GC-content constraint: the GC-content of each codeword is within [0.5 - ϵ; 0.5 + ϵ], • Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. While various combinations of these properties have been considered in the literature, this work provides generalizations of codes constructions that satisfy all the properties with arbitrary parameters of ℓ and ϵ. Furthermore, for practical values of ℓ and ϵ, we show that our encoders achieve higher rates than existing results in the literature and approach capacity. Our methods have low encoding/decoding complexity and limited error propagation.

... II. DESIGN OF HIGH-RATE K CONSTRAINED CODES Among many works available in the literature on the design of k constrained codes [9], [10], [11], [12], [13], the nibble replacement method recently proposed in [14] achieves code rates higher than most of the literature, with simple encoders and decoders and limited error propagation. However, the nibble replacement method designs codes in the NRZI format whereas a change in the state of the recording medium corresponds to a channel bit '1', and no change corresponds to a '0'. ...

This paper proposes systematic code design methods for constructing efficient spectrum shaping codes with the maximum runlength limited constraint k, which are widely used in data storage systems for digital consumer electronics products. Through shaping the spectrum of the input user data sequence, the codes can effectively circumvent the interaction between the data signal and servo signal in high-density data storage systems. In particular, we first propose novel methods to design high-rate k constrained codes in the non-return-to-zero (NRZ) format, which can not only facilitate timing recovery of the storage system, but also avoid error propagation during decoding and reduce the system complexity. We further propose to combine the Guided Scrambling (GS) technique with the k constrained code design methods to construct highly efficient spectrum shaping k constrained codes. Simulation results demonstrate that the designed codes can achieve significant spectrum shaping effect with only around 1% code rate loss and reasonable computational complexity.

... Then, the polarities of the extrinsic LLRs of the bit positions corresponding to the non-zero bits in q are reversed in the LLR adjuster before being passed to the decoder and the equalizer. Specifically, 6 6.2 6.4 6.6 6. (A1) Non-flipped system using a non-reset LDPC decoder; (A2) Non-flipped system using a reset LDPC decoder; (B1) Flipped system using the Soft-O method and a non-reset LDPC decoder; (B2) Flipped system using the Soft-O method and a reset LDPC decoder; (C1) Flipped system using the Soft-I method and a non-reset LDPC decoder; (C2) Flipped system using the Soft-I method and a reset LDPC decoder; (D1) Flipped system using the Soft-II method and a non-reset LDPC decoder; (D2) Flipped system using the Soft-II method and a reset LDPC decoder; (E1) Flipped system using the Soft-III method and a non-reset LDPC decoder; (E2) Flipped system using the Soft-III method and a reset LDPC decoder. ...

In this paper, a low-density parity-check (LDPC) coded recording system is investigated, for which the run-length-limited (RLL) constraint is satisfied by deliberate flipping at the write side and by estimating the flipped bits at the read side. Two approaches are proposed for enhancing the error performance of such a system. The first approach is to alleviate the negative effect of incorrect estimation of the flipped bits by adjusting the soft information. The second approach is to increase the likelihood of the correct detection of flipped bits by designing a flipped-bit detection algorithm that utilizes both the RLL constraint and the parity-check constraint of the LDPC code. These two approaches can be combined to obtain significant improvement in performance over previously proposed methods.

... Kautz [4] was probably the first to present a simple algorithmic method, called enumerative encoding, for translating user words into -constrained codewords and vice versa. Wijngaarden and Immink presented various codes of rate , where subsequences that violate the maximum runlength are iteratively removed to obtain a -constrained sequence [5]. ...

In this paper, we will present coding techniques for the character-constrained channel, where information is conveyed using q-bit characters (nibbles), and where w prescribed characters are disallowed. Using codes for the character-constrained channel, we present simple and systematic constructions of high-rate binary maximum runlength constrained codes. The new constructions have the virtue that large lookup tables for encoding and decoding are not required. We will compare the error propagation performance of codes based on the new construction with that of prior art codes.

... Example 1: Using the binary sequence replacement technique [23], we encode binary source sequences of length n ≤ 65 into binary codewords of length n + 1 that have a maximum runlength m = 6. Using a prefix of two bits plus one interfix bit, we balance the m-constrained word into nearly balanced (m = 6)-constrained word of length n + 4. ...

We describe properties and constructions of constraint-based codes for DNA-based data storage which accounts for the maximum repetition length and AT balance. We present algorithms for computing the number of sequences with maximum repetition length and AT balance constraint. We present efficient routines for translating binary runlength limited and/or balanced strings into DNA strands. We show that the implementation of AT-balanced codes is straightforwardly accomplished with binary balanced codes. We present codes that accounts for both the maximum repetition length and AT balance.

... One of the interesting approaches is the maximum RLL coding. Codes from this group are basically RLL(0,k) techniques which eliminate predefined, unwanted sequences from the output stream of symbols (Van Wijngaarden and Immink, 2010). (1,7) has coding rate of R = 2/3 and achieves better density rate of DR = 1.33. ...

Comprehensive (d,k) sequences study is presented, complemented with the design of a new, efficient, Run-Length Limited (RLL) code. The new code belongs to group of constrained coding schemas with a coding rate of R = 2/5 and with the minimum run length between two successive transitions equal to 4. Presented RLL (4, ∞) code uses channel capacity highly efficiently, with 98.7% and consequently it achieves a high-density rate of DR = 2.0. It is implying that two bits can be recorded, or transmitted with one transition. Coding techniques based on the presented constraints and the selected coding rate have better efficiency than many other currently used codes for high density optical recording and transmission.

... Hence, x ∈ B 4k (n). Now, we may modify the sequence replacement techniques [5], [20] to encode for the restricted-sum-balanced constraint. ...

An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this paper, we investigate codes that combat either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with log n + O(log log n) redundancy bits, while the other corrects a single indel with ⌈log n⌉ + 2 redundant bits. These two encoders are order-optimal. The former encoder is the first known order-optimal encoder that corrects a single edit, while the latter encoder (that corrects a single indel) reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits. Over the DNA alphabet, we impose an additional constraint: the GC-balanced constraint and require that exactly half of the symbols of any DNA codeword to be either C or G. In particular, via a modification of Knuth's balancing technique, we provide a linear-time map that translates binary messages into GC-balanced codewords and the resulting codebook is able to correct a single indel or a single edit. These are the first known constructions of GC-balanced codes that correct a single indel or a single edit.

... Hence, x ∈ B 4k (n). Now, we may modify the sequence replacement techniques [5], [20] to encode for the restricted-sum-balanced constraint. ...

An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this paper, we investigate codes that combat either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with log n + O(log log n) redundancy bits, while the other corrects a single indel with log n + 2 redundant bits. These two encoders are order-optimal. The former encoder is the first known order-optimal encoder that corrects a single edit, while the latter encoder (that corrects a single indel) reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits. Over the DNA alphabet, we impose an additional constraint: the GC-balanced constraint and require that exactly half of the symbols of any DNA codeword to be either C or G. In particular, via a modification of Knuth's balancing technique, we provide a linear-time map that translates binary messages into GC-balanced codewords and the resulting codebook is able to correct a single indel or a single edit. These are the first known constructions of GC-balanced codes that correct a single indel or a single edit.

... The sequence replacement technique has been widely used in the literature [8], [15]- [17]. This is an efficient method for removing forbidden substrings from a source word. ...

We propose coding techniques that limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given $\ell, {\epsilon} > 0$, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following properties: (i) Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, (ii) GC-content constraint: the GC-content of each codeword is within $[0.5-{\epsilon}, 0.5+{\epsilon}]$, (iii) Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. For practical values of $\ell$ and ${\epsilon}$, we show that our encoders achieve much higher rates than existing results in the literature and approach the capacity. Our methods have low encoding/decoding complexity and limited error propagation.

... We now use the sequence replacement technique to construct W(m, L, [p 1 L, p 2 L]) where L (1/c 2 ) log e n and c = min{1/2 − p 1 , p 2 − 1/2}. The sequence replacement technique has been widely used in the literature [17]- [20]. This is an efficient method for removing forbidden substrings from a source word. ...

The subblock energy-constrained codes (SECCs) have recently attracted attention due to various applications in communication systems such as simultaneous energy and information transfer. In a SECC, each codeword is divided into smaller subblocks, and every subblock is constrained to carry sufficient energy. In this work, we study SECCs under more general constraints, namely bounded SECCs and sliding-window constrained codes (SWCCs), and propose two methods to construct such codes with low redundancy and linear-time complexity, based on Knuth’s balancing technique and sequence replacement technique. For certain codes parameters, our methods incur only one redundant bit.

... A weaker bound than the one in Theorem 8 for σ = 2 was given in [13] (Theorem 13). Finally, to encode the (b, k)-constrained de Bruijn code efficiently with only a single symbol of redundancy, we may use sequence replacement techniques [62]. ...

The de Bruijn graph, its sequences, and their various generalizations, have found many applications in information theory, including many new ones in the last decade. In this paper, motivated by a coding problem for emerging memory technologies, a set of sequences which generalize sequences in the de Bruijn graph are defined. These sequences can be also defined and viewed as constrained sequences. Hence, they will be called constrained de Bruijn sequences and a set of such sequences will be called a constrained de Bruijn code. Several properties and alternative definitions for such codes are examined and they are analyzed as generalized sequences in the de Bruijn graph (and its generalization) and as constrained sequences. Various enumeration techniques are used to compute the total number of sequences for any given set of parameters. A construction method of such codes from the theory of shift-register sequences is proposed. Finally, we show how these constrained de Bruijn sequences and codes can be applied in constructions of codes for correcting synchronization errors in the $\ell$-symbol read channel and in the racetrack memory channel. For this purpose, these codes are superior in their size on previously known codes.

... Kautz [4] was probably the first to present a simple algorithmic method, called enumerative encoding, for translating user words into -constrained codewords and vice versa. Wijngaarden and Immink presented various codes of rate , where subsequences that violate the maximum runlength are iteratively removed to obtain a -constrained sequence [5]. ...

In this paper, we will present coding techniques for the character-constrained channel, where information is conveyed using q-bit characters (nibbles), and where w prescribed characters are disallowed. Using codes for the character-constrained channel, we present simple and systematic constructions of high-rate binary maximum runlength constrained codes. The
new constructions have the virtue that large lookup tables for encoding and decoding are not required. We will compare the error propagation performance of codes based on the new construction with that of prior art codes.

... Our method is based on the sequence replacement technique. The sequence replacement technique has been widely used in the literature [23]- [26]. It is an efficient method for removing forbidden windows from a source word. ...

The subblock energy-constrained codes (SECCs) and sliding window-constrained codes (SWCCs) have recently attracted attention due to various applications in communcation systems such as simultaneous energy and information transfer. In a SECC, each codewod is divided into smaller non-overlapping windows, called subblocks, and every subblock is constrained to carry sufficient energy. In a SWCC, the energy constraint is enforced over every window. In this work, we focus on the binary channel, where sufficient energy is achieved theoretically by using relatively high weight codes, and study SECCs and SWCCs under more general constraints, namely bounded SECCs and bounded SWCCs. We propose two methods to construct such codes with low redundancy and linear-time complexity, based on Knuth's balancing technique and sequence replacement technique. For certain codes parameters, our methods incur only one redundant bit. We also impose the minimum distance constraint for error correction capability of the designed codes, which helps to reduce the error propagation during decoding as well.

... The decoding algorithm is described in [27]. ...

Maximum run-length limited codes are constraint codes used in communication and data storage systems. Insertion/deletion correcting codes correct insertion or deletion errors caused in transmitted sequences and are used for combating synchronization errors. This paper investigates the maximum run-length limited single insertion/deletion correcting (RLL-SIDC) codes. More precisely, we construct efficiently encodable and decodable RLL-SIDC codes. Moreover, we present its encoding algorithm and show the redundancy of the code.

Design of a new Run Length Limited (RLL) code is presented. The new coding scheme has coding rate of R=2/5, with the minimum runlength between two successive transitions equal to 4. This RLL (4, ∞) code uses channel capacity extremely efficiently, 98.7% and consequently it achieves density rate of DR=2.0. It has better efficiency than many other currently used codes for high density optical recording, or transmission.

Recent years have seen a fundamental shift in the way that mobile applications are delivered to users. Developers are increasingly moving away from custom deployment approaches towards the use of platform markets for advertising and distributing their applications. Application developers use the platform to manage distribution and payment for applications. In return the application developer pays either a fixed and/or variable fee to the platform provider. Platform providers benefit from the availability of quality applications necessary to attract and retain end users to the platform. In this paper we present results from an original survey. We find that overall willingness to pay for applications remain low consumers are willing to pay for key apps which are perceived to significantly enhance everyday life. We then discuss opportunities for developers to increase cooperation with platform providers in order to enhance value creation, value delivery and value capture.

We suggest an approach to constructing low-redundant RLL (d; k)-codes whose complexity does not depend on the code length and is determined solely by the achievable redundancy r, the time and space complexity being O(log2(1/r)) and O(log(1/r)), respectively, as r → 0. First we select code-words whose combinations may constitute all (d; k)-constrained sequences of any length. Then we use arithmetic decoding to produce these codewords with (or close to) optimal probabilities from an input sequence. The coding algorithms and estimates of performance are provided.

Designing run-length limited (RLL) codes for visible light communication systems must account for multiple performance factors including spectral efficiency, power efficiency, DC balance, and flicker avoidance. This paper reports a new class of enhanced Miller codes, termed eMiller codes, which are capable of achieving highly desirable performances in all of these accounts. An improved Viterbi algorithm (VA), termed $mn$ VA, is developed to help further enhance the performance of eMiller codes by preserving multiple candidate sequences at each decoding stage. This performance-enhancing algorithm introduces little complexity increase compared to the original VA. Analysis on flicker control, power spectral density and minimum Hamming distance demonstrates the all-around wellness of these new codes. Extensive simulations are carried out to evaluate eMiller codes by themselves and in practical VLC systems. It is shown that the original VA already allows eMiller codes to deliver a performance noticeably better than conventional Miller and FM0/FM1 codes (and on par with Manchester codes). This result is particularly exciting, as eMiller codes are also more spectral efficient than Manchester codes. The $mn$ VA further allows eMiller codes to surpass Manchester codes and 4B6B codes in practical RS-coded VLC systems. Simulation results confirm the superb performance of the RS-eMiller schemes. IEEE

For a recording system that has a run-lengthlimited(RLL) constraint, this approach imposes the hard error by flipping bits before recording. A high error coding rate limits the correcting capability of the RLL bit error. Since iterative decoding does not include the estimation technique, it has the potential capability of solving the hard error bits within several iterations compared to an LDPC coded system. In this paper, we implement density evolution (DE) and the differential evolution approach to provide a performance evaluation of unequal error protection (UEP) LDPC code to investigate the optimal LDPC code distribution for an RLL flipped system.

An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this article, we investigate codes that correct either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with
$\lceil {\log \text {n}}\rceil+\text {O}(\log \log \text {n})$
redundancy bits, while the other corrects a single indel with
$\lceil {\log \text {n}}\rceil+2$
redundant bits. These two encoders are
order-optimal
. The former encoder is the first known order-optimal encoder that corrects a single edit, while the latter encoder (that corrects a single indel) reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits. Over the DNA alphabet, we impose an additional constraint: the
$\mathtt {GC}$
-balanced constraint
and require that exactly half of the symbols of any DNA codeword to be either
$\mathtt {C}$
or
$\mathtt {G}$
. In particular, via a modification of Knuth’s balancing technique, we provide a linear-time map that translates binary messages into
$\mathtt {GC}$
-balanced codewords and the resulting codebook is able to correct a single indel or a single edit. These are the first known constructions of
$\mathtt {GC}$
-balanced codes that correct a single indel or a single edit.

The subblock energy-constrained codes (SECCs) have recently attracted attention due to various applications in communication systems such as simultaneous energy and information transfer. In a SECC, each codeword is divided into smaller subblocks, and every subblock is constrained to carry sufficient energy. In this work, we study SECCs under more general constraints, namely bounded SECCs and sliding-window constrained codes (SWCCs), and propose two methods to construct such codes with low redundancy and linear-time complexity, based on Knuth’s balancing technique and sequence replacement technique. For certain codes parameters, our methods incur only one redundant bit.

In this paper, we first propose coding techniques for DNA-based data storage which account the maximum homopolymer runlength and the GC-content. In particular, for arbitrary $\ell,\epsilon > 0$, we propose simple and efficient $(\epsilon, \ell)$-constrained encoders that transform binary sequences into DNA base sequences (codewords), that satisfy the following properties: • Runlength constraint: the maximum homopolymer run in each codeword is at most $\ell$, • GC-content constraint: the GC-content of each codeword is within $[0.5 − \epsilon, 0.5 + \epsilon]$. For practical values of l and ε, our codes achieve higher rates than the existing results in the literature. We further design efficient $(\epsilon,\ell)$-constrained codes with error-correction capability. Specifically, the designed codes satisfy the runlength constraint, the GC-content constraint, and can correct a single edit (i.e. a single deletion, insertion, or substitution) and its variants. To the best of our knowledge, no such codes are constructed prior to this work.

We will study simple and systematic constructions of high-rate binary maximum runlength constrained codes, which are based on base conversion, where specific subsequences are disallowed. We will compare the error propagation performance of base-change codes with that of prior art codes.

Preface to the Second Edition
About five years after the publication of the first edition, it was felt that an update of this text would be inescapable as so many relevant publications, including patents and survey papers, have been published. The author's principal aim in writing the second edition is to add the newly published coding methods, and discuss them in the context of the prior art. As a result about 150 new references, including many patents and patent applications, most of them younger than five years old, have been added to the former list of references. Fortunately, the US Patent Office now follows the European Patent Office in publishing a patent application after eighteen months of its first application, and this policy clearly adds to the rapid access to this important part of the technical literature. I am grateful to many readers who have helped me to correct (clerical) errors in the first edition and also to those who brought new and exciting material to my attention. I have tried to correct every error that I found or was brought to my attention by attentive readers, and seriously tried to
avoid introducing new errors in the Second Edition.
China is becoming a major player in the art of constructing, designing, and basic research of electronic storage systems. A Chinese translation of the first edition has been published early 2004. The author is indebted to prof. Xu, Tsinghua University, Beijing, for taking the initiative for this Chinese version, and also to Mr. Zhijun Lei, Tsinghua University, for undertaking the arduous task of translating this book from English to Chinese. Clearly, this translation makes it possible that a billion more people will now have access to it.
Kees A. Schouhamer Immink
Rotterdam, November 2004

General construction methods of prefix synchronized codes and
runlength limited codes are presented, which make use of so-called
sequence replacement techniques. These techniques provide a simple and
efficient conversion of data words into codewords of a constrained
block-code, where subsequences violating the imposed constraints are
replaced by encoded information to indicate their relative positions in
the data word. Several constructions are proposed for constrained codes
with low error propagation, and for variable length constrained codes.
The coding algorithms have a low computational and hardware complexity.
The rate of the constructed codes approaches the theoretical maximum. It
is feasible to apply these high rate constrained block codes in
communication and recording systems

We present advanced combinatorial techniques for constructing
maximum runlength-limited (RLL) block codes and maximum transition run
(MTR) codes. These codes find widespread application in recording
systems. The proposed techniques are used to construct a high-rate
multipurpose modulation code for recording systems. The code, a rate
16/17, (0,3,2,2) MTR code, that also fulfills (0,15,9,9) RLL constraints
is a high-rate distance-enhancing code with additional constraints for
improving timing and gain control. The encoder and decoder have a
particularly efficient architecture and allow an instantaneous
translation of 16-bit source words into 17-bit codewords and vice versa.
The code has been implemented in Lucent read-channel chips and has
excellent performance

Efficient encoding algorithms are presented for two types of
constraints on two-dimensional binary arrays. The first constraint
considered is that of t-conservative arrays, where each row and each
column has at least t transitions of the form `0'→`1' or
`1'→`0.' The second constraint is that of two-dimensional DC-free
arrays, where in each row and each column the number of `0's equals the
number of `1's

A basic theory of frame synchronization for a singlechannel digital communication system is presented, along with extensive references to the literature. The design of frame markers is discussed and comparisons are drawn with more exotic techniques such as comma-free coding.

This paper studies several topics concerning the way strings can overlap. The key notion of the correlation of two strings is introduced, which is a representation of how the second string can overlap into the first. This notion is then used to state and prove a formula for the generating function that enumerates the q-ary strings of length n which contain none of a given finite set of patterns. Various generalizations of this basic result are also discussed. This formula is next used to study a wide variety of seemingly unrelated problems. The first application is to the nontransitive dominance relations arising out of a probabilistic coin-tossing game. Another application shows that no algorithm can check for the presence of a given pattern in a text without examining essentially all characters of the text in the worst case. Finally, a class of polynomials arising in connection with the main result are shown to be irreducible.

We present a systematic procedure for mapping data sequences into
codewords of a prefix-synchronized code (PS-code), as well as for
performing the inverse mapping. A PS-code, proposed by Gilbert (1960),
belongs to a subclass of comma-free codes and is useful to recover word
synchronization when errors have occurred in the stream of codewords. A
PS-code is defined as a set of codewords with the property that each
codeword has a known sequence as a prefix, followed by a coded data
sequence in which this prefix is not allowed to occur. The largest
PS-code among all PS-codes of the same code length is called a maximal
prefix-synchronized code (MPS-code). We develop an encoding and decoding
algorithm for Gilbert's MPS-code with a prefix of the form 11...10 and
extend the algorithm to the class PS-codes of which the prefix is
self-uncorrelated. The computational complexity of the entire mapping
process is proportional to the length of the codewords

Thesis (doctoral)--Universität Essen, 1998.

We present new methods to protect maximum run-length constrained
sequences against random and burst errors and to avoid error
propagation. Specific parallel conversion techniques and enumerative
coding algorithms for the transformation of binary user information into
constrained codewords are proposed. The new schemes are simple and very
efficient. The methods can be used for synchronization in communication
systems and for modulation coding in magnetic and optical recording
systems

New combinatorial construction techniques are proposed which
convert binary user information into a (0,k) constrained sequence having
the virtue that at most k `zeroes' between logical `ones' will occur. In
this way sequences are constructed which have a limited runlength. These
codes find application in optical and magnetic recording systems. The
new construction methods provide efficient, high rate codes with a low
complexity. The low complex combinatorial structure of the encoder and
the decoder ensure a very fast and efficient parallel conversion of
binary information to codewords and vice versa. Specifically, we present
the combinatorial structures to convert 16 data bits into a 17 bit
constrained sequence to obtain an optimum (0,4) code, a (0,6) code with
at most one byte error propagation, and a (0,6/6)-code, respectively.
Serious error propagation is avoided by using constrained codes with
several unconstrained positions, which are reserved to store the parity
bits of an error control code which protects the constrained codeword

In digital recorders, the coded information is commonly grouped in
large blocks, called frames. The authors concentrate on the frame
synchronization problem of run-length-limited sequences, or ( d ,
k ) sequences. They commence with a brief description of ( d
, k )-constrained sequences, and proceed with the
examination of the channel capacity. It is shown that for certain sync
patterns, called repetitive-free sync patterns, the capacity can be
formulated in a simple manner as it is solely a function of the ( d
, k ) parameters and the length of the sync pattern. For
each forbidden pattern and ( d , k ) constraints, methods
for enumerating constrained sequences are given. Design considerations
of schemes for encoding and decoding are addressed. Examples of
prefix-synchronized ( d , k ) codes, based for the
purpose of illustration on the sliding-block coding algorithm, are
presented

Many of the types of modulation codes designed for use in storage devices using magnetic recording are discussed. The codes are intended to minimize the negative effects of intersymbol interference. The channel model is first presented. The peak detection systems used in most commercial disk drives are described, as are the run length-limited (d,k) codes they use. Recently introduced recording channel technology based on sampling detection-partial-response (or PRML) is then considered. Several examples are given to illustrate that the introduction of partial response equalization, sampling detection, and digital signal processing has set the stage for the invention and application of advanced modulation and coding techniques in future storage products.< >

We introduce the fixed-rate bit stuff (FRB) algorithm for efficiently encoding and decoding maximum-runlength-limited (MRL) sequences. Our approach is based on a simple, variable-rate technique called bit stuffing . Bit stuffing produces near-capacity achieving codes for a wide range of constraints, but encoding is variable-rate, which is unacceptable in most applications. In this work, we design near-capacity fixed-rate codes using a three-step procedure. The fixed-length input data block first undergoes iterative preprocessing, followed by variable-rate bit stuffing, and finally dummy-bit padding to a fixed output length. The iterative preprocessing is key to achieving high encoding rates. We discuss rate computation for the proposed FRB algorithm and show that the asymptotic (in input block length) encoding rate is close to the average rate of the variable-rate bit stuff code. Then, we proceed to explore the effect of decreasing/increasing the number of preprocessing iterations. Finally, we derive a lower bound on the encoding rate with finite-length input blocks and tabulate the parameters required to design FRB codes with rate close to 100/101 and 200/201.

Let S be a given subset of binary n-sequences. We provide an explicit scheme for calculating the index of any sequence in S according to its position in the lexicographic ordering of S . A simple inverse algorithm is also given. Particularly nice formulas arise when S is the set of all n -sequences of weight k and also when S is the set of all sequences having a given empirical Markov property. Schalkwijk and Lynch have investigated the former case. The envisioned use of this indexing scheme is to transmit or store the index rather than the sequence, thus resulting in a data compression of (logmidSmid)/n .

A new coding technique is proposed that translates user
information into a constrained sequence using very long codewords. Huge
error propagation resulting from the use of long codewords is avoided by
reversing the conventional hierarchy of the error control code and the
constrained code. The new technique is exemplified by focusing on (d,
k)-constrained codes. A storage-effective enumerative encoding scheme is
proposed for translating user data into long dk sequences and vice
versa. For dk runlength-limited codes, estimates are given of the
relationship between coding efficiency versus encoder and decoder
complexity. We show that for most common d, k values, a code rate of
less than 0.5% below channel capacity can be obtained by using hardware
mainly consisting of a ROM lookup table of size 1 kbyte. For selected
values of d and k, the size of the lookup table is much smaller. The
paper is concluded by an illustrative numerical example of a rate
256/466, (d=2, k=15) code, which provides a serviceable 10% increase in
rate with respect to its traditional rate 1/2, (2, 7) counterpart