Preliminaries - Coding over Sets for DNA Storage: Substitution Errors

Chapter V: Coding over Sets for DNA Storage: Substitution Errors

5.2 Preliminaries

To discuss the problem in its most general form and illustrate the ideas, we restrict our attention to binary strings. Most of the techniques in this chapter can be extended to non-binary cases. Generally, we denote scalars by lower-case letters𝑥 , 𝑦, . . ., vectors by bold symbolsx,y, . . ., integers by capital letters𝑘 , 𝐿 , . . ., and[𝑘] ≜ {1,2, . . . , 𝑘}.

For integers 𝑀 and 𝐿 such that1 𝑀 ≤ 2^𝐿 we denote by ^{⁰^,¹^}

𝐿

𝑀

the family of all subsets of size 𝑀of{0,1}^𝐿, and by ^{0^,^1}

𝐿

≤𝑀

the family of subsets of sizeat most 𝑀 of {0,1}^𝐿. In our channel model, a word is an element 𝑊 ∈ ^{0^,^1}

𝐿

𝑀

, and a code C ⊆ ^{0^,^1}

𝐿

𝑀

is a set of words (for clarity, we refer to words in a given code as codewords). To prevent ambiguity with classical coding theoretic terms, the elements in a word 𝑊 = {x₁, . . . ,x𝑀} are referred to as strings. We emphasize

1We occasionally also assume that𝑀 ≤2^{𝑐 𝐿} for some 0 < 𝑐 < 1. This is in accordance with typical values of𝑀and𝐿in contemporary DNA storage prototypes (see Sec.5.1).

that the indexing in𝑊 is merely a notational convenience, e.g., by the lexicographic order of the strings, and this information is not available at the decoder.

For 𝑘 ≤ 𝑀 𝐿, a 𝑘-substitution error (𝑘-substitution, in short), is an operation that changes the values of at most 𝑘 different positions in a word. Notice that the result of a𝑘-substitution is not necessarily an element of ^{0^,^1}

𝐿

𝑀

, and might be an element of ^{0^,^1}

𝐿

𝑇

for some𝑀 −𝑘 ≤𝑇 ≤ 𝑀. This gives rise to the following definition.

Definition 5.2.1. For a word𝑊 ∈ ^{⁰^,¹^}

𝐿

𝑀

, a ballB𝑘(𝑊) ⊆Ð^𝑀

𝑗=𝑀−𝑘 {0,1}^𝐿

𝑗

centered at𝑊is the collection of all subsets of{0,1}^𝐿that can be obtained by a𝑘-substitution in𝑊.

Example 5.2.1. For 𝑀 =2, 𝐿 =3,𝑘 =1, and𝑊 ={001,011}, we have that B𝑘(𝑊) ={{001,011},{101,011},{011},{000,011},

{001,111},{001},{001,010}}.

In this chapter, we discuss bounds and constructions of codes in ^{0^,^1}

𝐿

𝑀

that can correct𝑘substitutions (𝑘-substitution codes, for short), for various values of𝑘. The sizeof a code, which is denoted by|C |, is the number of codewords (that is, sets) in it. Theredundancyof the code, a quantity that measures the amount of redundant information that is to be added to the data to guarantee successful decoding, is defined as𝑟(C) ≜ log ²

𝐿

𝑀

−log(|C |), where the logarithms are in base 2.

A codeCis used in our channel as follows. First, the data to be stored (or transmitted) is mapped by a bijectiveencoding functionto a codeword 𝐶 ∈ C. This codeword passes through a channel that might introduce up to𝑘 substitutions, and as a result a word𝑊 ∈ B𝑘(𝐶) is obtained at the decoder. In turn, the decoder applies some decoding functionto extract the original data. The codeCis called a𝑘-substitution code if the decoding process always recovers the original data successfully. Having settled the channel model, we are now in a position to formally state our contribution.

Theorem 5.2.1. (Main) For any integers𝑀,𝐿, and𝑘such that𝑀 ≤ 2^𝐿/(⁴^𝑘+²⁾, there exists an explicit code construction with redundancy 𝑂(𝑘²log(𝑀 𝐿))2 (Sec. 5.6).

For 𝑘 = 1, the redundancy of this construction is asymptotically at most six times larger than the optimal one (Sec.5.5), when𝐿goes to infinity and𝑀 ≥ 4. Further- more, an improved construction for𝑘 = 1achieves redundancy which is asymptot-

2Throughout this chapter, we write 𝑔(𝑛) = 𝑂(𝑓(𝑛)) for any functions 𝑓 , 𝑔 : N → R and integer𝑛, if lim sup𝑛→∞

𝑔(𝑛) 𝑓(𝑛) <∞.

ically at most three times the optimal one (Appendix5.8), when 𝐿 goes to infinity and4 ≤ 𝑀 ≤2^𝐿/4.

Remark 5.2.1. We note that under the current technology restriction, our code constructions apply to a limited parameter range. For example, when 𝑘 = 1 and 𝐿 = 100, our code requires that 𝑀 ≤ 10⁵. The upper bound constraint on 𝑀 is more strict as𝑘 increases. Yet, our constructions provide the following insights.

First, as the technology advances, it is reasonable to hope that the error rate will decrease and the synthesis length𝐿 will increase, in which case we have smaller𝑘 and larger𝐿. As𝑘 gets smaller, the range of 𝑀increases exponentially with𝐿. Second, while 𝐿 is limited under current technology, the number of strings 𝑀 can be more freely chosen. One can get smaller𝑀 by storing information in more𝐷 𝑁 𝐴 pools.

Third, the application of this work is not limited to DNA storage. Our techniques apply to other cases such as network packet transmission.

A few auxiliary notions are used throughout this chapter, and are introduced herein.

For two stringss,t∈ {0,1}^𝐿, theHamming distance𝑑_𝐻(s,t)is the number of entries in which they differ. To prevent confusion with common terms, a subset of{0,1}^𝐿 is calleda vector-code, and the setB^𝐻

𝐷(s)of all strings within Hamming distance𝐷 or less of a given string sis called theHamming ballof radius𝐷 centered ats. A linear vector code is called an [𝑛, 𝑘]𝑞 code if the strings in it form a subspace of dimension𝑘 inF^𝑛^𝑞, whereF^𝑞is the finite field with𝑞elements.

Several well-known vector-codes are used in the sequel, such as Reed-Solomon codes or Hamming codes. For an integer𝑡, the Hamming code is an [2^𝑡−1,2^𝑡−𝑡−1]₂ code (i.e., there are𝑡redundant bits in every codeword), and its minimum Hamming distance is 3. Reed-Solomon (RS) codes over F^𝑞 exist for every length 𝑛 and dimension𝑘, as long as𝑞 ≥ 𝑛−1 [79, Sec. 5], and require𝑛−𝑘 redundant symbols inF^𝑞. Whenever𝑞is a power of two, RS codes can be made binary by representing each element ofF^𝑞 as a binary string of length log₂(𝑞). In the sequel we use this form of RS code, which requires log(𝑛) (𝑛−𝑘)redundant bits.

Finally, our encoding algorithms make use ofcombinatorial numbering maps[53], that are functions that map a number to an element in some structured set. Specif- ically,𝐹_𝑐𝑜𝑚 : [ ^𝑁

𝑀

] → {𝑆 : 𝑆 ⊂ [𝑁],|𝑆| = 𝑀} maps a number to a set of distinct elements, and𝐹_{𝑝 𝑒𝑟 𝑚} : [𝑁!] →𝑆_𝑁maps a number to a permutation in the symmetric

group𝑆_𝑁. The function𝐹_𝑐𝑜𝑚can be computed using a greedy algorithm with com- plexity𝑂(𝑀 𝑁log𝑁), and the function𝐹_{𝑝 𝑒𝑟 𝑚}can be computed in a straightforward manner with complexity𝑂(𝑁log𝑁). Using 𝐹_𝑐𝑜𝑚 and𝐹_{𝑝 𝑒𝑟 𝑚} together, we define a map 𝐹 : [ ^𝑁

𝑀

𝑀!] → {𝑆 : 𝑆 ⊂ [𝑁],|𝑆| = 𝑀} ×𝑆_𝑀 that maps a number into an unordered set of size𝑀 together with a permutation.

Dalam dokumen CorrectingErrorsinDNAStorage - California Institute of Technology (Halaman 145-148)