CorrectingErrorsinDNAStorage - California Institute of Technology

Introduction

DNA-Based Storage: Models

Due to the limitations of synthesis and sequencing technology, the length of each string, ie. the number of bases in the series is not large. Since the desired DNA molecules are amplified in the PCR process, it is very likely that each amplified DNA molecule is sampled multiple times and.

Error Correction in DNA Storage: Challenges

Substitution errors replace some of the bases, whose locations are unknown, in the strand with other bases. The purpose of the decoder is to retrieve the set of strings, from a collection of noisy copies of the strings, and then decode the strings back to data.

Contributions

Similar to the coding in Proposition 3.6.3 and Proposition 3.6.4, we show that the integer 𝑛′ marks the end of the unsigned bits, i.e., the successor (𝑇(c)𝑛′+1,. Note that there are two sources of redundancy—the redundancy of the Hamming code, which is at most 3 (log𝑀 𝐿+ log𝑀 +2) and the fact that the groups {a𝑖}𝑀.

Binary 2-Deletion/Insertion Correcting Codes

Introduction

In particular, for non-adjacent sequences, the "syndrome", i.e., the difference between the parity checks of a codeword and an erroneous sequence, can be expressed as a linear function. For any integer 𝑘 let 𝐵𝑘(c) be the deletion ball k i c, i.e., the set of bit sequences that share a common length𝑛−𝑘 with c.

Outline

Furthermore, one example of the proof of Lemma 2.2.2 requires a specialized version of Proposition 2.1.3 as follows. Furthermore, the proof of Statement 2.2.3 contains one of the main ideas in this chapter, proving the correctness of repairing two deletions using higher-order parity checks.

Protecting 1 10 Indicator Vectors

Note that in cases (a) and (b) we used Proposition 2.1.3, which implies that only two parity checks with the weights sm(0) and m(1) are needed. If all columns of 𝐴 are not zero columns, then according to the multilinearity of the determinant.

Figure 2.2: Case (a), where ℓ 1 ≤ ℓ 2 < 𝑘 2 ≤ 𝑘 1 . The diagonal lines indicate equality between the respective entries of 1 10 (c) and 1 10 (c ′ ).

Protecting 1 01 Indicator Vectors

Since the sum of the (integer) expressions |𝜋𝑖−𝜏𝑖| is at most 4, and since the values of the individual expressions are equal for adjacent values of𝑖(i.e. for𝑖 =2𝑟 and𝑖 =2𝑟 +1 for an integer𝑟), an inequality between the 01 indicators ofcandc′can be only one of the following two fallen. This easily implies that (2.38) is equal to ±(𝑠−𝑡)or±(𝑠+𝑡+2), and since neither of these is 0 modulo 2𝑛, another contradiction arises.

Decoding of Two-Deletion Correcting Codes

As will be shown, the two calls of the function D2 aim to recover the redundancy 𝑓(c)andℎ(c) and the sequence respectively. We now show that the entry of 𝐴(𝑒) we are looking for is equal modulo𝑛𝑒 to the difference in entry𝑒 of the 𝑓 excesses of cand ofd.

Figure 2.5: The path of Algorithm 2 on the matrix 𝐴 (0) .

Appendix

Since 𝑅′′(c) is a 𝑘 +1-fold repetition code and thus a 𝑘-deletion correcting code that protects 𝐻(𝑅′(c)), the hash function𝐻(𝑅′(c)) can be recovered from (𝑑𝑛+𝑁 .The nice properties of 𝐼𝑤(c) are as follows. 2) The vector 𝐼𝑤(c) is very limited if 𝐼𝑤+1(c) is known, as will be discussed in the proof of Lemma 4.4.3.

Binary Codes Correcting 𝑘 Deletions/Insertions

Introduction

Then the general VT construction can be used to protect the pointer vector and thus the locations of the block boundary markers. Our generalization of the VT code construction shows that the vector 𝑓(1𝑠 𝑦𝑛𝑐(c)) helps protect the synchronization vector 1𝑠 𝑦𝑛𝑐(c) from 𝑘 erasures inc.

Outline and Preliminaries

Combining Lemma3.2.2 and Lemma3.2.3, we obtain size𝑂(𝑘log𝑛) hash function to correct deletions for a 𝑘-dense sequence. Lemma3.2.6 will be used in the proof of Lemma3.2.1, where we need an upper bound on the number of deletions/insertions i1𝑠 𝑦𝑛𝑐(c) caused by a deletion in c.

Table 3.1: Summary of Notations in Ch. 3 B 𝑘 ( c ) ≜ The set of sequences that share a

Proof of Theorem 3.1.1

Function 𝑇(c) is defined in Lemma 3.2.4 and transforms c into a 𝑘-dense sequence (see definition 3.2.1 for the definition of a 𝑘-dense sequence). To show how the sequence can be recovered from a length𝑁 −𝑘 sequence of E (c) and implement the calculation of the decoding function D (d), we prove that.

Figure 3.2: Dependencies of the claims in Ch. 3.

Protecting the Synchronization Vectors

The following lemma shows that the sequences inR3𝑘 can be protected by using higher order parity checks. Note that compared to the higher order parity checks 𝑓(c), the higher order parity checks in the following proposition do not have modulo operations.

Hash for 𝑘 -Dense Sequences

For blocks that are not recovered, we show that they can be recovered with up to 𝑘 deletion errors. Note that the redundancy of an 𝑘-error-correcting Reed-Solomon code of length𝑛and alphabet size𝑞 ≥ 𝑛−1 is 2𝑘log𝑞bits [79].

Generating 𝑘 -Dense Sequences

In the initialization step, the length of the deleted subsequence is greater than the length of the appended subsequence by 2. Note that in the encoding procedure, the appended subsequence in the Initialization Step is terminated by a 0.

Figure 3.3: An example of how the encoding in Proposition 3.6.3 proceeds.

Conclusion and Future Work

Note that the multiplier of |𝜕 𝑓𝑘𝑊|. in the above expression can be bounded as follows. where the former inequality follows since|𝜕 𝑓𝑊| ≥ 𝜖 𝑀 𝐿by Lemma5.4.6; the latter inequality follows since 𝑘 ≤ 𝑐𝜖. 𝑀, which holds for every non-constant 𝑀 and 𝐿. of Theorem 5.4.2) It is clear that no two 𝑘-balls can intersect codewords in C, and so we must|C | ≤ 2. The step starts at layer 1, i.e. the root of the binary tree, and ends at layer 𝐿′+1 at one of the leaf nodes.

Correcting Deletions/Insertions-Generalizations

Introduction

For the case of multiple erasures, Helberg codes [47], which were originally proposed for the binary erasure channel, were adapted and shown to produce non-binary erasure correction codes [59]. Then, there is a systematic 𝑘-deletion code of message length 𝑛 over an alphabet of size 𝑞 that takes at most 4𝑘log

Syndrome Compression

We now prove the key lemma that will be used to prove the main result in Theorem 4.2.1. In the case where 𝑘 = O(1), the technique allows for a polynomial-time encoding/decoding algorithm (since decoding can also be performed using standard brute-force methods).

Non-Systematic Deletion Correcting Codes

Note from the previous lemma that if we want to use the mapping 𝑓𝑘, we must restrict our sequences to be from the set of 𝑘-mixed strings. To transform arbitrary information sequences into 𝑘-mixed strings from the set M (𝑚, 𝑘), we will require the following result, which is proved in Appendix 4.9 using similar techniques from [22].

Systematic Deletion Correcting Codes

Given the upper bounds of deletions and substitutions in Lemma4.4.1, we next show that the generalization of VT constraints 𝑓(x) can be used to fix these deletions and substitutions for constrained sequences x∈ R𝑘 ,𝑛. Since 𝑔𝑐(𝑔𝑐(c)) is a 𝑘 deletion-corrective hash of 𝑔𝑐(c), the hash𝑔𝑐(c) can be recovered. 𝑧𝑛−𝑘) is a length𝑛−𝑘 sequence of𝑛, we can use𝑔𝑐(c) to recover.

Codes Capable of Correcting Bursts of Deletions

To prove the result, suppose that it is the result of a burst of sweeps of length at most 𝑘 that happen to us 𝑓𝑏𝑐. 𝐿(x) be the set of possible vectors given that 𝑘 bursts occur within a window of maximum length𝑡𝐿 occurs tox.

Since the deletion ranges have a short length, most symbols in the sequence u ∈ [ [𝑞]]𝑛 can be recovered. The correctness of the code can be proven with the same argument as in the proof of Theorem 4.6.1.

Encoding step: We add additional redundancy symbols to our codewords to ensure that we can recover their 𝐿-spectrum, subject to erasure errors. Since each symbol of this sequence has its position encoded into it, it follows that we can determine the locations of deletions from any length 29𝑘-subsequence.

Conclusion

Appendix: Generation of Random Seed z with Size 𝑂 ( log 𝑚 )

When only substitution errors are considered, the upper bounds handle cases where the number of errors is at least a fraction of 𝐿. The work of [54] addressed a similar model, where multisets are received at the decoder, instead of sets. Each sequence a𝑖, 𝑖 ∈ [𝑀] is represented by the walking path. 𝑗=1) to the sibling node of the node𝑎𝑖,ℓ, i.e., the node that shares the same parent node with𝑎𝑖,ℓ.

Coding over Sets for DNA Storage: Substitution Errors

Introduction

In one, the newly created string already exists in the set of strings, and so the reconstructed origin will constitute a set of 𝑀 −1 strings. The surprising result of this chapter is that the slicing operation does not cause a significant increase in the amount of excess bits needed to correct these 𝑘 substitutions.

Preliminaries

In section 5.6 the construction for a single substitution is generalized to multiple substitutions, and it is shown that it is order optimal when the number of substitutions is constant. For 𝑘 = 1, the redundancy of this construction is asymptotically at most six times greater than the optimal one (Sec.5.5), when𝐿goes to infinity and𝑀 ≥ 4.

Previous Work

Furthermore, using a reduction to vector codes with constant Hamming weight, it is shown that there exists a code that can correct errors in each of the sequences with redundancy 𝑒log(𝐿+1). In that channel a vector is sent over a certain alphabet, and its symbols are received at the decoder under a certain permutation.

Bounds

When the minimum Hamming distance between the strings in a codeword is not large enough, different substitution errors can result in identical words, and the size of the ball is smaller than the given upper bound. After establishing the relation between B1(𝑊) and the bound of 𝑓𝑊, the following Fourier analytic claim will help to prove a lower bound.

Codes for a Single Substitution

By applying a Hamming decoder to one of the s𝑖’s, the decoder obtains possible candidates for {a𝑖}𝑀. Therefore, the decoder can apply a majority vote of the candidates based on decoding each ′𝑖, and the winning values are {a𝑖}𝑀.

Figure 5.1: Illustration of single-substitution correcting codes over unordered sets.

Codes for Multiple Substitutions

This approach results in redundancy of 3 log𝑀 𝐿+log 3𝑀 +𝑂(1), and a similar approach can also be used in the next section. To state the equivalent of (5.9): for a binary stringt, let 𝑅 𝑆𝑘(t) be the redundancy bits resulting from k-substitution-correcting RS encoding of t, in its binary representation7.

Conclusions and Future Work

Appendix

We briefly present an improved construction of a single surrogate code that achieves a redundancy of 2 log𝑀 𝐿+log 2𝑀+𝑂(1). We use the bit 𝑏𝑒 = e·x⊕ mod 2 to indicate in which part the substitution error occurs.

Generating a𝑖, 𝑖 ∈ [𝑀] in the encoding procedure can be intuitively characterized as walking on a complete binary tree of 𝐿′+1 layers. To prove that the decryption is correct, we prove that the stringa𝑖,𝑖 ∈ [𝑀] generated in the encryption procedure satisfies. 6.12).

Robust Indexing: Optimal Codes Correcting Deletion/Insertion

Introduction

Very few works have studied the problem of correcting a combination of deletion/insertion errors, the most related being [62], which dealt with the correction of a single deletion. 4, we can present a code that corrects a combination of 𝑘 deletions and insertions with 𝑂(𝑘log𝑀 𝐿) redundancy, which is our second main result.

Preliminaries

Similarly, a 𝑘 deletion/insertion error is an operation that removes and/or inserts at most 𝑘 bits in the codeword. Our code constructions make use of the well-known Reed-Solomon code, which is capable of correcting 𝑘 substitutions in a length 𝑛 codeword over an alphabet of size 𝑞, with 2𝑘log𝑞 bit redundancy, as long as 𝑞 ≥ 𝑛]− 1 [79] .

Robust Indexing for Codes over Sets: Substitution Errors

Define S𝐻 as the set of all length codes with cardinality𝑀and a minimum Hamming distance of at least 2𝑘+1, containing 1𝐿′, i.e. Since the number of strings with Hamming distance is at most 2𝑘 of at least one ofa1,.

Computing 𝐹 𝐻

Furthermore, we have 0 < 𝑞 ≤ 𝑤(𝑎𝑖,ℓ) after the ℓ-th inner for loop in the 𝑖-th outer for loop. This is formalized in the following lemma, which can be used to prove that Eq

Robust Indexing for Deletion/Insertion Errors

Since the computation of 𝑁𝐻(a, 𝐴) has polynomial complexity, the complexity of the encoding/decoding procedure is polynomial in and 𝐿′. To prove the correctness of the algorithm, we need to show that 𝑁𝐷(a, 𝐴) satisfies the two properties similar to those in Eq.

Conclusion

Appendix