Appendix: Generation of Random Seed z with Size 𝑂 ( log 𝑚 )

Chapter IV: Correcting Deletions/Insertions-Generalizations

4.9 Appendix: Generation of Random Seed z with Size 𝑂 ( log 𝑚 )

In the following, we prove that we can generate strings in the 𝑘-mixed string set M (𝑚, 𝑘) using 𝑂(log𝑚) bits. The proof follows similar steps to those in [22]. We first recall the definition of the 𝑘-mixed string set

M (𝑚, 𝑘) ={x∈ {0,1}^𝑚 : For integersℓ = ⌈log𝑘 +log log(𝑘+1) +5⌉ and 𝑑 =𝑂(𝑘(log𝑘)²log𝑚) and for any stringp∈ {0,1}^ℓ, every substring of consecutive𝑑 bits inxcontainspas a substring.}

We now proceed to proving the following lemma, which appears in Sec. 4.3 as Lemma4.3.2.

Lemma 4.9.1. There exists a𝑝 𝑜𝑙 𝑦(𝑚)time algorithm for a seedswith length𝑂(log𝑚) such that for anyu ∈F^𝑚₂,u+𝑔(s) ∈ M (𝑚, 𝑘).

Split the stringuinto𝑛 = 𝑚/𝑑 blocks u𝑖, 𝑖 ∈ [𝑛] of length𝑑. Split each blocku𝑖

into𝑛₀ =𝑂(log𝑚) subblocksu𝑖, 𝑗, 𝑗 ∈ [𝑛₀] of length 𝑑₀ = 𝑑/𝑛₀ =𝑂(𝑘(log𝑘)²). Using the random seed s, we generate the same mask for each block u𝑖. Hence in the following, we focus on generating a random mask for block u₁. The idea is to generate the random seeds = (s₁, . . . ,s𝑛₀) and the corresponding maske = (e₁, . . . ,e𝑛₀)subblock by subblock for eachu₁, 𝑗, 𝑗 ∈ [𝑛₀]. The goal is thatu₁, 𝑗+e𝑗

contains pwith at least constant probability 𝑐 for 𝑗 ∈ [𝑛₀]. Then the probability that all subblocksu₁, 𝑗 do not containpis upper bounded by(1−𝑐)^𝑛⁰ =1/𝑝 𝑜𝑙 𝑦(𝑚).

We begin with the following claim.

Claim 4.9.1. For any stringp∈ {0,1}^ℓ, a uniformly distributed random string𝑈_𝑑

of length𝑑₀containspwith probability at least2/3.

Proof. Note that a uniformly random string of length ℓ equals p with probability 1/2^ℓ, by properly choosing 𝑑₀. Split 𝑈_𝑑

0 into blocks of length ℓ. Then the probability that all these blocks do not equal p is given by (1−1/2^ℓ)^𝑑⁰^/^ℓ = (1−1/2^ℓ)^𝑂⁽²^ℓ⁾ ≤ 1/𝑒^𝑂⁽¹⁾. By properly choosing𝑑₀ we can make this probability less than 1/3. Then, the probability that 𝑈_𝑑

0 containspis at least 2/3. □ By virtue of Claim4.9.1, if we generates𝑗 =𝑈_𝑑

0independently and let the maske𝑗 = s𝑗 for 𝑗 ∈ [𝑛₀]. Then u₁, 𝑗 +e𝑗 containsp with probability at least𝑐 = 2/3. The probability thatu₁ does not containpis upper bounded by (1/3)^𝑛⁰ = 1/𝑝 𝑜𝑙 𝑦(𝑚).

However, this requires the seed length to be𝑛₀𝑑₀ = 𝑘log²𝑘log𝑚, which is larger than𝑂(log𝑚). To reduce the size of the seed, we generate the seeds𝑗and the maske𝑗

by random walk on expander graphs. For reference we restate the definitions and the results below. Lemma4.9.2and Lemma4.9.3are also cited in [22].

Definition 4.9.1. If 𝐺 is a 𝑞-regular graph with 𝑛 vertices and 𝜆(𝐺) ≤ 𝜆 for integers 𝑛 and 𝑞 and real number 𝜆 < 1, we say that 𝐺 is an (𝑛, 𝑞, 𝜆)-expander graph. Here 𝜆(𝐺) is the second largest eigenvalue of the normalized adjacency matrix of𝐺.

For some constant integer𝑞 and some number𝜆 < 1, a family of graphs {𝐺_𝑛}^∞

𝑛=1

is a (𝑞, 𝜆)-expander graph family if for every 𝑛 ∈ N, 𝐺_𝑛 is an (𝑛, 𝑞, 𝜆)-expander graph.

Lemma 4.9.2. Let 𝐴₀, . . . , 𝐴_𝑠 be vertex sets with densities 𝛼₀ = |𝐴₀|/𝑛, . . . , 𝛼_𝑠 =

|𝐴_𝑠|/𝑛 in an (𝑛, 𝑞, 𝜆)-expander graph𝐺. Let 𝑋₀, . . . , 𝑋_𝑠 be a random walk on𝐺. Then we have that

𝑃𝑟[∀𝑖 ∈ {0, . . . , 𝑠}, 𝑋_𝑖 ∈ 𝐴_𝑖] ≤

𝑠−1

𝑖=0

(√

𝛼_𝑖𝛼_𝑖+₁+𝜆).

Lemma 4.9.3. For every constant𝜆 < 1, and some constant integer𝑞 depending on𝜆, there exists a strongly explicit (𝑞, 𝜆)-expander graph family (can be provided in 𝑝 𝑜𝑙 𝑦(log𝑛) time).

According to Lemma4.9.3, we can generate a(2^𝑑⁰, 𝑞, 𝜆=1/6)-expander graph𝐺, where 𝑞 is a constant integer depending on 𝜆. Instead of generating the seed s𝑗

independently for each 𝑗 ∈ [𝑛₀], we do a 𝑛₀-step random walk on the graph 𝐺. We show that the𝑛₀step random walk on𝐺 can be generated using𝑂(log𝑚) bits.

Let s₀ = 𝑈_𝑑

0 be a uniformly and randomly chosen vertex on the graph 𝐺. The

vertexs₀is the starting point of the random walk. Since each vertex is connected to a constant 𝑞number of vertices, the 𝑗-th step can be generated using a log𝑞 bit random seeds𝑗, 𝑗 ∈ [𝑛₀], indicating which vertex to go in the 𝑗-th step. Hence the total number of bits needed is𝑑₀+log𝑞 𝑛₀ =𝑂(log𝑚). Let𝐸₀, . . . , 𝐸_𝑛

0 be the trace of the random walk, where𝐸_𝑗 ∈ {0,1}^𝑑⁰is a vertex of the graph𝐺. We usee𝑗 =𝐸_𝑗 as the mask for subblocku₁, 𝑗.

We are now able to prove our main result in this section.

Proof of Lemma 4.3.2: Let 𝐸₀, 𝐸₁, . . . , 𝐸_𝑛

0 be a random walk on the (2^𝑑⁰, 𝑞, 𝜆 = 1/6)-expander graph𝐺. Recall, that𝑛₀represents the number of subblocks in each block ofx. For a stringp∈ {0,1}^ℓ and a fixed 𝑗 ∈ [𝑛₀], define the set𝐴^p

𝑗, 𝑗 ∈ [𝑛₀] to be the set of vertices in𝐺such that for any vertexs^′in this set, the sequenceu₁, 𝑗+s^′ does not containp. Then according to Claim4.9.1, we have that𝛼_𝑖 =|𝐴^p

𝑗|/𝑛≤ 1/3.

From Lemma4.9.2, the probability that 𝐸_𝑗 ∈ 𝐴^p

𝑗, i.e.,𝐸_𝑗 +u₁, 𝑗 does not containp, for all 𝑗 ∈ [𝑛₀]is at most

(1/3+1/6)^𝑛⁰ = (1/2)^𝑂^(log^𝑛⁾ =1/𝑝 𝑜𝑙 𝑦(𝑚). Let𝑔^′(s) = (𝐸₁, . . . , 𝐸_𝑛

0). Recall that sis the length𝑂(log𝑚) random seed that generates the random walk 𝐸₀, 𝐸₁, . . . , 𝐸_𝑛

0. Then the probability that u₁+𝑔^′(s) does not containp is at most 1/𝑝 𝑜𝑙 𝑦(𝑚). Similarly,u𝑖 +𝑔^′(s), 𝑖 ∈ [𝑛] does not containp with probability at most 1/𝑝 𝑜𝑙 𝑦(𝑚). Hence from the union bound, the probability that there exists a block u𝑖 such that u𝑖 +𝑔^′(s) does not containp for some𝑖 ∈ [𝑛]and somep∈ {0,1}^ℓ is upper bounded by

2^ℓ𝑛/𝑝 𝑜𝑙 𝑦(𝑚) =1/𝑝 𝑜𝑙 𝑦(𝑚) by choosing𝑛₀to be sufficiently large enough.

Let𝑔(s) be the𝑛 =𝑚/𝑑-fold repetition of 𝑔^′(s). Then we conclude thatu⊕𝑔(s) belongs to M (𝑚, 𝑘) with probability 1− 1/𝑝 𝑜𝑙 𝑦(𝑚). Thus, it suffices to search in 2^𝑂^(log^𝑚⁾ = 𝑝 𝑜𝑙 𝑦(𝑚) time for a seedswith length𝑂(log𝑚) such thatu+𝑔(s) ∈

M (𝑚, 𝑘). This completes the proof. ■

We now proceed to the proof of Corollary4.3.1.

Proof. (of Corollary 4.3.1) The proof is similar to that of Lemma 4.3.2. By The- orem 4.3.1, it suffices to show that u ∈ M (𝑛, 𝑘) with high probability. For any sequencep∈ {0,1}^ℓand integers𝑖and 𝑗 such that𝑖+ 𝑗−1 ≤ 𝑛, let 𝐴(p, 𝑖, 𝑗)be the event thatu𝑖, . . . ,u𝑖+𝑗−1 does not containp, whereℓ = ⌈log𝑘+log log(𝑘 +1) +5⌉

is given in the definition of M (𝑛, 𝑘). According to Claim 4.9.1, the probability of 𝐴(p, 𝑖, 𝑑₀) is at most 1/3 for sufficiently large 𝑑₀ = 𝑂(𝑘log²𝑘) and for any p and𝑖 ∈ [𝑛−𝑑+1]. Hence the probability of 𝐴(p, 𝑖, 𝑑₀𝑛₀), where𝑛₀=𝑂(log𝑛), is at most(1/3)^𝑛⁰ =1/𝑝 𝑜𝑙 𝑦(𝑛) for anypand𝑖 ∈ [𝑛−𝑑₀𝑛₀+1].

Then by the union bound, the probability of∪_p∈{0_,_1}^ℓ_{,𝑖∈[𝑛−𝑑}

0𝑛₀+1] is upper bounded by 2^ℓ𝑛/𝑝 𝑜𝑙 𝑦(𝑛), which is 1/𝑝 𝑜𝑙 𝑦(𝑛) for sufficiently large𝑛₀. Therefore, the probability that u ∉ M (𝑛, 𝑘) is at most 1/𝑝 𝑜𝑙 𝑦(𝑛), which goes to zero as𝑛 grows to

infinity. Thus the corollary is proved. □

C h a p t e r 5

CODING OVER SETS FOR DNA STORAGE: SUBSTITUTION ERRORS

In this chapter, we study codes over unordered sets of sequences, that correct substitution errors.

5.1 Introduction

In the model of coding over sets, the data to be stored is encoded as a set of𝑀strings of length 𝐿 over a certain alphabet, for some integers 𝑀 and 𝐿 such that 𝑀 < 2^𝐿; typical values for𝑀and𝐿are currently within the order of magnitude of 10⁷and 10², respectively [73]. Each individual string is subject to various types of errors, such as deletions (i.e., omissions of symbols, which result in a shorter string), insertions (which result in a longer string), and substitutions (i.e., replacements of one symbol by another).

One of the reasons for error correction is that errors in synthesis might cause the PCR process to amplify a string that was written erroneously, and hence the reconstructed origins might include this erroneous string. In some cases, error correction after synthesis is possible, and yet our model in this chapter is most suitable for substitution errors that were amplified by the PCR process. Deletions and insertions will be discussed in the next chapter. The work in [62] provided a scheme that efficiently corrects a single deletion, which is easier to handle since a single deletion only results a change in the sequence length. Substitution errors, however, are more challenging to combat, and are discussed next.

A substitution error that occurs prior to amplification by PCR can induce either one of two possible error patterns. In one, the newly created string already exists in the set of strings, and hence, the reconstructed origins will constitute a set of 𝑀 −1 strings. In the other, which is undetectable by counting the size of the set of reconstructed origins, the substitution generates a string which is not equal to any other string in the set. In this case the output set has the same size as the error free one. These error patterns, which are referred to simply assubstitutions, are the main focus of this chapter.

Following a formal definition of the channel model in Sec. 5.2, previous work is

discussed in Sec.5.3. Upper and lower bounds on the amount of redundant bits that are required to combat substitutions are given in Sec.5.4. In Sec.5.5we provide a construction of a code that can correct a single substitution. The redundancy of this construction is shown to be optimal up to some constant, which is later improved in Appendix5.8. In Sec.5.6the construction for a single substitution is generalized to multiple substitutions, and is shown to be order-wise optimal whenever the number of substitutions is a constant. Finally, open problems for future research are discussed in Sec.5.7.

Remark 5.1.1. The channel which is discussed in this chapter can essentially be seen as taking a string of a certain length 𝑁 as input. Then, during transmission, the string is sliced into substrings of equal length, and each substring is subject to substitution errors in the usual sense. Moreover, the order between the slices is lost during transmission, and they arrive as an unordered set.

It follows from the sphere-packing bound [79, Sec. 4.2] that without the slicing op- eration, one must introduce at least𝑘log(𝑁)redundant bits at the encoder in order to combat 𝑘 substitutions. The surprising result of this chapter is that the slicing operationdoes notincur a substantial increase in the amount of redundant bits that are required to correct these 𝑘 substitutions. In the case of a single substitution, our codes attain an amount of redundancy that is asymptotically equivalent to the ordinary (i.e., unsliced) channel, whereas for a larger number of substitutions we come close to that, but prove that a comparable amount of redundancy is achievable.

5.2 Preliminaries

To discuss the problem in its most general form and illustrate the ideas, we restrict our attention to binary strings. Most of the techniques in this chapter can be extended to non-binary cases. Generally, we denote scalars by lower-case letters𝑥 , 𝑦, . . ., vectors by bold symbolsx,y, . . ., integers by capital letters𝑘 , 𝐿 , . . ., and[𝑘] ≜ {1,2, . . . , 𝑘}.

For integers 𝑀 and 𝐿 such that1 𝑀 ≤ 2^𝐿 we denote by ^{⁰^,¹^}

𝐿

𝑀

the family of all subsets of size 𝑀of{0,1}^𝐿, and by ^{0^,^1}

𝐿

≤𝑀

the family of subsets of sizeat most 𝑀 of {0,1}^𝐿. In our channel model, a word is an element 𝑊 ∈ ^{0^,^1}

𝐿

𝑀

, and a code C ⊆ ^{0^,^1}

𝐿

𝑀

is a set of words (for clarity, we refer to words in a given code as codewords). To prevent ambiguity with classical coding theoretic terms, the elements in a word 𝑊 = {x₁, . . . ,x𝑀} are referred to as strings. We emphasize

1We occasionally also assume that𝑀 ≤2^{𝑐 𝐿} for some 0 < 𝑐 < 1. This is in accordance with typical values of𝑀and𝐿in contemporary DNA storage prototypes (see Sec.5.1).

that the indexing in𝑊 is merely a notational convenience, e.g., by the lexicographic order of the strings, and this information is not available at the decoder.

For 𝑘 ≤ 𝑀 𝐿, a 𝑘-substitution error (𝑘-substitution, in short), is an operation that changes the values of at most 𝑘 different positions in a word. Notice that the result of a𝑘-substitution is not necessarily an element of ^{0^,^1}

𝐿

𝑀

, and might be an element of ^{0^,^1}

𝐿

𝑇

for some𝑀 −𝑘 ≤𝑇 ≤ 𝑀. This gives rise to the following definition.

Definition 5.2.1. For a word𝑊 ∈ ^{⁰^,¹^}

𝐿

𝑀

, a ballB𝑘(𝑊) ⊆Ð^𝑀

𝑗=𝑀−𝑘 {0,1}^𝐿

𝑗

centered at𝑊is the collection of all subsets of{0,1}^𝐿that can be obtained by a𝑘-substitution in𝑊.

Example 5.2.1. For 𝑀 =2, 𝐿 =3,𝑘 =1, and𝑊 ={001,011}, we have that B𝑘(𝑊) ={{001,011},{101,011},{011},{000,011},

{001,111},{001},{001,010}}.

In this chapter, we discuss bounds and constructions of codes in ^{0^,^1}

𝐿

𝑀

that can correct𝑘substitutions (𝑘-substitution codes, for short), for various values of𝑘. The sizeof a code, which is denoted by|C |, is the number of codewords (that is, sets) in it. Theredundancyof the code, a quantity that measures the amount of redundant information that is to be added to the data to guarantee successful decoding, is defined as𝑟(C) ≜ log ²

𝐿

𝑀

−log(|C |), where the logarithms are in base 2.

A codeCis used in our channel as follows. First, the data to be stored (or transmitted) is mapped by a bijectiveencoding functionto a codeword 𝐶 ∈ C. This codeword passes through a channel that might introduce up to𝑘 substitutions, and as a result a word𝑊 ∈ B𝑘(𝐶) is obtained at the decoder. In turn, the decoder applies some decoding functionto extract the original data. The codeCis called a𝑘-substitution code if the decoding process always recovers the original data successfully. Having settled the channel model, we are now in a position to formally state our contribution.

Theorem 5.2.1. (Main) For any integers𝑀,𝐿, and𝑘such that𝑀 ≤ 2^𝐿/(⁴^𝑘+²⁾, there exists an explicit code construction with redundancy 𝑂(𝑘²log(𝑀 𝐿))2 (Sec. 5.6).

For 𝑘 = 1, the redundancy of this construction is asymptotically at most six times larger than the optimal one (Sec.5.5), when𝐿goes to infinity and𝑀 ≥ 4. Further- more, an improved construction for𝑘 = 1achieves redundancy which is asymptot-

2Throughout this chapter, we write 𝑔(𝑛) = 𝑂(𝑓(𝑛)) for any functions 𝑓 , 𝑔 : N → R and integer𝑛, if lim sup𝑛→∞

𝑔(𝑛) 𝑓(𝑛) <∞.

ically at most three times the optimal one (Appendix5.8), when 𝐿 goes to infinity and4 ≤ 𝑀 ≤2^𝐿/4.

Remark 5.2.1. We note that under the current technology restriction, our code constructions apply to a limited parameter range. For example, when 𝑘 = 1 and 𝐿 = 100, our code requires that 𝑀 ≤ 10⁵. The upper bound constraint on 𝑀 is more strict as𝑘 increases. Yet, our constructions provide the following insights.

First, as the technology advances, it is reasonable to hope that the error rate will decrease and the synthesis length𝐿 will increase, in which case we have smaller𝑘 and larger𝐿. As𝑘 gets smaller, the range of 𝑀increases exponentially with𝐿. Second, while 𝐿 is limited under current technology, the number of strings 𝑀 can be more freely chosen. One can get smaller𝑀 by storing information in more𝐷 𝑁 𝐴 pools.

Third, the application of this work is not limited to DNA storage. Our techniques apply to other cases such as network packet transmission.

A few auxiliary notions are used throughout this chapter, and are introduced herein.

For two stringss,t∈ {0,1}^𝐿, theHamming distance𝑑_𝐻(s,t)is the number of entries in which they differ. To prevent confusion with common terms, a subset of{0,1}^𝐿 is calleda vector-code, and the setB^𝐻

𝐷(s)of all strings within Hamming distance𝐷 or less of a given string sis called theHamming ballof radius𝐷 centered ats. A linear vector code is called an [𝑛, 𝑘]𝑞 code if the strings in it form a subspace of dimension𝑘 inF^𝑛^𝑞, whereF^𝑞is the finite field with𝑞elements.

Several well-known vector-codes are used in the sequel, such as Reed-Solomon codes or Hamming codes. For an integer𝑡, the Hamming code is an [2^𝑡−1,2^𝑡−𝑡−1]₂ code (i.e., there are𝑡redundant bits in every codeword), and its minimum Hamming distance is 3. Reed-Solomon (RS) codes over F^𝑞 exist for every length 𝑛 and dimension𝑘, as long as𝑞 ≥ 𝑛−1 [79, Sec. 5], and require𝑛−𝑘 redundant symbols inF^𝑞. Whenever𝑞is a power of two, RS codes can be made binary by representing each element ofF^𝑞 as a binary string of length log₂(𝑞). In the sequel we use this form of RS code, which requires log(𝑛) (𝑛−𝑘)redundant bits.

Finally, our encoding algorithms make use ofcombinatorial numbering maps[53], that are functions that map a number to an element in some structured set. Specif- ically,𝐹_𝑐𝑜𝑚 : [ ^𝑁

𝑀

] → {𝑆 : 𝑆 ⊂ [𝑁],|𝑆| = 𝑀} maps a number to a set of distinct elements, and𝐹_{𝑝 𝑒𝑟 𝑚} : [𝑁!] →𝑆_𝑁maps a number to a permutation in the symmetric

group𝑆_𝑁. The function𝐹_𝑐𝑜𝑚can be computed using a greedy algorithm with com- plexity𝑂(𝑀 𝑁log𝑁), and the function𝐹_{𝑝 𝑒𝑟 𝑚}can be computed in a straightforward manner with complexity𝑂(𝑁log𝑁). Using 𝐹_𝑐𝑜𝑚 and𝐹_{𝑝 𝑒𝑟 𝑚} together, we define a map 𝐹 : [ ^𝑁

𝑀

𝑀!] → {𝑆 : 𝑆 ⊂ [𝑁],|𝑆| = 𝑀} ×𝑆_𝑀 that maps a number into an unordered set of size𝑀 together with a permutation.

5.3 Previous Work

The idea of manipulating atomic particles for engineering applications dates back to the 1950s, with R. Feynman’s famous citation, “There’s plenty of room at the bottom” [31]. The specific idea of manipulating DNA molecules for data storage as been circulating the scientific community for a few decades, and yet it was not until 2012–2013 where two prototypes have been implemented [24, 36]. These prototypes have ignited the imagination of practitioners and theoreticians alike, and many works followed suit with various implementations and channel models [14, 32,46,52,77,99].

By and large, all practical implementations to this day follow the aforementioned channel model, in which multiple short strings are stored inside a solution. Normally, deletions and insertions are also taken into account, but substitutions were found to be the most common form of errors [73, Fig. 3.b], and strings that were subject to insertions and deletions are scarcer, and some of them can be corrected using clustering and sequence reconstruction algorithms.

The channel model in this work has been studied by several authors in the past.

The work of [46] addressed this channel model under the restriction that individual strings are read in an error-free manner, and some strings might get lost as a result of random sampling of the DNA pool. In their techniques, the strings in a codeword are appended with an indexing prefix, a solution already incurs Θ(𝑀) redundant bits, or log(𝑒)𝑀−𝑜(1)redundancy [62, Remark 1], and will be shown to be strictly sub-optimal in our case. Such index based construction was also considered in [61]

and [83], which attempted to correct errors in the index.

In a different setting, where substitution errors occur probabilistically and the decoding error fades with𝑀 and 𝐿, using indexing prefix was proved asymptotically optimal in terms of coding rate [85]. The indexing prefix technique was also studied in an adversarial setting under substitution errors [61, 83], where codes correcting errors in the indexing prefix were proposed.

The recent work of [62] addressed this model under a bounded number of substitu-

tions, deletions, and insertions per string. When discussing substitutions only, [62]

suggested a code construction for 𝑘 = 1 with 2𝐿 +1 bits of redundancy. Further- more, by using a reduction to constant Hamming weight vector-codes, it is shown that there exists a code that can correct𝑒errors in each one of the𝑀sequences with redundancy𝑀 𝑒log(𝐿+1).

The work of [89] studied a more generalized model, where in addition to substitution errors, insertions/deletions of strings were considered. A distance called sequence- subset distance was defined, and upper bounds and constructions were presented based on the sequence-subset distance. When considering only substitution errors, the upper bounds deal with cases where the number of errors is at least a fraction of𝐿.

The work of [54] addressed a similar model, where multisets are received at the decoder, rather than sets. In addition, errors in the stored strings are not seen in a fine-grained manner. That is, any set of errors in an individual string is counted as a single error, regardless of how many substitutions, insertions, or deletions it contains. As a result, the specific structure of{0,1}^𝐿is immaterial, and the problem reduces to decodinghistogramsover an alphabet of a certain size.

The specialized reader might suggest the use offountain codes, such as the LT [68]

codes or Raptor [84] codes. However, we stress that these solutions rely on random- ness at much higher redundancy rates, whereas this work aims for a deterministic and rigorous solution at redundancy which is close to optimal.

Finally, we also mention thepermutation channel [55, 58,96], which is similar to our setting, and yet it is farther away in spirit than the aforementioned works. In that channel, a vector over a certain alphabet is transmitted, and its symbols are received at the decoder under a certain permutation. If no restriction is applied over the possible permutations, than this channel reduces tomultiset decoding, as in [54].

This channel is applicable in networks in which different packets are routed along different paths of varying lengths, and are obtained in an unordered and possibly erroneous form at the decoder. Yet, this line of works is less relevant to ours, and to DNA storage in general, since the specific error pattern in each “symbol”

(which corresponds to a string in{0,1}^𝐿 in our case) is not addressed, and perfect knowledge of the number of appearances of each “symbol” is assumed.

Dalam dokumen CorrectingErrorsinDNAStorage - California Institute of Technology (Halaman 140-144)