Introduction - Correcting Deletions/Insertions-Generalizations

Chapter IV: Correcting Deletions/Insertions-Generalizations

4.1 Introduction

In the early 1960s, the problem of constructing codes for the deletion channel was introduced through the seminal works of Sellers and Levenshtein [82,64]. Despite the inherent difficulty of coding for this channel, the initial results were promising.

It was shown in [64] that the Varshmov-Tenengolts code can correct either a single deletion or a single insertion over a binary alphabet. Furthermore, [64] showed that log𝑛+𝑂(1)bits of redundancy are necessary for a single deletion code which implies that the Varshamov-Tenengolts code is nearly optimal. For the case where the channel deletes a burst of symbols has length 1 or 2, a nearly optimal (and very elegant) construction was provided in [63] by Levenshtein. In the early 1980s Tenengolts constructed a nearly optimal code for a single deletion over a non-binary alphabet based upon a clever mapping between a binary and a non-binary sequence [93].

Unfortunately, up until recently, progress was slow and results on the subject were limited. Some notable exceptions include the work of Schulman and Zuckerman [81] which showed the existence of efficient codes that are capable of correcting a constant fraction of deletions and the work of Helberg and Ferreira [47] which attempted to extend the Varshamov-Tenengolts code to handle more than a single deletion. Despite these efforts, up until a recent work by Brakensiek, Guruswami, and Zbarsky [12], no efficiently constructable codes were known that could correct even two deletions that required less than√

𝑛bits of redundancy.

In [12], Brakensiek et al. constructed 𝑘 deletion correcting codes that require

𝑂(𝑘²log𝑘log𝑛) bits of redundancy. Several works quickly followed [12] that dramatically improved upon this result. Haeupler [41] gives an explicit systematic 𝑞-ary code construction which requiresΘ(𝑘^log

2𝑛 𝑘

log𝑞 +𝑘)bits of redundancy. In [22], another construction was derived by Cheng et al., which is not systematic, but is order optimal in the sense that it requires𝑂(𝑘log𝑛) bits of redundancy.

Independent of [22], in Ch. 3, we presented explicit constructions for 𝑘 deletion correcting codes that have redundancy 8𝑘log𝑛+𝑜(log𝑛), which for the first time achieves asymptotically four times the Gilbert-Varshamov lower bound in the Lev- enshtein metric. In this chapter, we generalize one of the underlying techniques in Ch. 3, which we refer to as syndrome compression. We show that using the syndrome compression technique, it is possible to construct codes for channels that are within a factor of two from the Gilbert-Varshamov lower bound in the Leven- shtein metric. Note that the above multi-deletion correcting codes are not systematic except for the ones in [41] and [22], which have redundancy Θ(𝑘log²^𝑛_𝑘 + 𝑘). We furthermore present a systematic deletion correcting code and show that it achieves asymptotically twice the Gilbert-Varshamov lower bound, which is asymptotically the same amount of redundancy as our non-systematic construction, by using the syndrome compression technique. We then apply this technique to three different types of combinatorial deletion channels: (a) the binary adversarial deletion channel that can delete at most 𝑘 symbols, (b) the binary adversarial deletion channel which can cause at most 𝑘 bursts of deletions (each of length at most 𝑡_𝐿), and (c) the non-binary adversarial deletion channel which can delete at most 𝑘 symbols.

We note that although our focus is on deletion channels, we believe the syndrome compression technique can yield meaningful results when applied to other types of channels (particular those that are traditionally considered difficult to code for, such as the deletion channel). In addition, we introduce new ideas for the systematic deletion code and the non-binary deletion correcting codes, as will be presented later, which might be of independent interest.

In the following, we highlight our contributions in variations of deletion channels.

Codes (non-systematic and systematic) correcting 𝑘 deletions

We begin with a non-systematic𝑘deletion correcting code construction that achieves asymptotically 4𝑘log𝑛+𝑂(log𝑛) +𝑜(log𝑛) bits of redundancy, as a result of the syndrome compression technique.

Theorem 4.1.1. Let 𝑘 be a constant with respect to 𝑛. Then, there exists a

non-systematic, efficiently encodable/decodable binary 𝑘-deletion correcting code of length 𝑛 that requires at most 4𝑘(1+ 𝜖)log𝑛 bits of redundancy where 𝜖 = 𝑂(1/𝑘) +𝑂(𝑘²log𝑘/log log𝑛) is a small number that depends on 𝑘 and 𝑛. The encoding/decoding complexity is𝑂(𝑛²^𝑘⁺¹).

In addition to the non-systematic deletion correcting codes, we are interested in systematic construction of deletion codes, motivated by the document exchange application [25], where Alice wishes to send a sequence 𝑋 to Bob, who has the knowledge of a sequence𝑌that is within limited edit distance of𝑋. The edit distance between 𝑋 and𝑌 is measured by the minimum number of deletions, insertions, or substitutions needed to change 𝑋 into 𝑌. It turns out that Alice can send the sequence 𝑋 to Bob by transmitting a hash of 𝑋 since Bob knows𝑌. Such a hash implies a systematic deletion correcting code and vice versa. Note that systematic deletion codes also has applications in storage systems with deletion errors, where concatenated code constructions are used.

The document exchange or systematic deletion code construction problem has been studied in both random and adversarial settings. For random settings, where a sequence can be recovered with high probability, document exchange algorithms with hash sizes𝑂(𝑘log²𝑛)and𝑂(𝑘log²𝑛log^∗𝑛)1were presented in [50] and [51], respectively. The results for randomized settings were improved to 𝑂(𝑘²log𝑛) in [13] and to 𝑂(𝑘log𝑛) in [7]. In [44], a randomized systematic 𝑘-deletion correcting code with𝑂(𝑘²log𝑛)bits of redundancy was presented. For adversarial settings, where every sequence needs to be recovered in the worst case, the state of the art document exchange schemes [6, 22, 41] have hash size𝑂(𝑘log²𝑛), which is bounded away from the lower bound 𝑘log𝑛+𝑜(log𝑛). Our next result gives a systematic deletion correcting codes with 4𝑘log𝑛+𝑜(log𝑛)bits of redundancy and thus a document exchange scheme with hash size 4𝑘log𝑛+𝑜(log𝑛). Note that this improves the redundancy of Theorem4.1.1by𝑂(log𝑛).

Theorem 4.1.2. For any sequencec ∈ {0,1}^𝑛, there exists a hash function𝐻 𝑎 𝑠 ℎ_𝑘 : {0,1}^𝑛 → {0,1}⁴^𝑘^log^𝑛⁺^𝑜^(log^𝑛⁾, computable in𝑂(𝑛²^𝑘⁺¹)time, such that

{(c, 𝐻 𝑎 𝑠 ℎ_𝑘(c)) :c∈ {0,1}^𝑛}

forms a𝑘-deletion correcting code. The encoding/decoding complexity of the code is𝑂(𝑛^𝑘⁺¹).

1log^∗𝑛 is the minimum number of times the logarithm needs to be iteratively applied before getting𝑛to a result at most 1.

To the best of our knowledge, our results achieve asymptotically the best redundancy.

The work in [90] proposed a non-systematic construction with (4𝑘 − 1)log𝑛 + 𝑜(log𝑛) bits of redundancy, combining the syndrome compression technique and the VT code construction. The result is log𝑛smaller than the result in this chapter.

Yet, both results achieve the same order of redundancy asymptotically.

Codes correcting bursts of deletions

The results in this subsection pertain to the case where one wants to design codes capable of correcting bursts of deletions. The following presents a systematic code correcting a burst of at most𝑘 deletions.

Theorem 4.1.3. Suppose 𝑘 is a constant with respect to 𝑛. Then, there exists a systematic code of length𝑛capable of correcting a burst of consecutive deletions of length at most𝑘 that requires at most4 log𝑛+𝑜(log𝑛)bits of redundancy.

Prior to this, codes capable of correcting a burst of consecutive deletions were provided in [35], where a construction is given that requires log𝑘log𝑛+

𝑂(𝑘²log𝑘log log𝑛) bits of redundancy so that our work represents a significant improvement. Furthermore, our construction is systematic whereas the one from [35] is not. Recently, the work of [60] presented a non-systematic code correcting a burst of 𝑘 deletions with redundancy log𝑛+ 𝑘(𝑘 +1)/2 log log𝑛+𝐶. We also consider the case where the deletions do not occur consecutively as well as the case where more than one burst occurs, and we provide constructions that improve upon prior art [80] in these cases as well. The following result shows that we can correct 𝑘 bursts each of at most length𝑡_𝐿 deletions, with redundancy at most 4𝑘log𝑛+𝑜(log𝑛).

Theorem 4.1.4. Let𝑘and𝑡_𝐿 be two constant integers with respect to𝑛. Then, there exists a systematic code of length𝑛 capable of correcting 𝑘 bursts of consecutive deletions each of length at most 𝑡_𝐿 that requires at most 4 log𝑛+𝑜(log𝑛) bits of redundancy.

Non-binary deletion correcting codes

Our next two results pertain to correcting deletions over non-binary alphabets, which has applications in DNA storage with alphabet size 4 and network transmission with packet loss. Despite the recent progress in binary deletion codes, the problem of constructing codes for the 𝑞-ary deletion channel has received significantly less attention. Tenengolts constructed a nearly optimal code for the case of a single

deletion [93]. The main idea in [93] is to use a parity code to identify the symbol which was deleted and an associated Levenshtein code to determine the location of the deletion. For the case of multiple deletions, the Helberg codes [47], which were originally proposed for the binary deletion channel, were adapted and shown to produce non-binary deletion correcting codes [59]. The primary drawback to this class of codes is their low rate [59]. Even for the case of two deletions, the codes have rates that do not approach 1 as 𝑛 becomes large. It was shown in [65] that the optimal redundancy of a𝑞-ary 𝑘 deletion code asymptotically falls between𝑘log𝑛+𝑘log𝑞+𝑜(log𝑞 𝑛)and 2𝑘log𝑛+𝑘log𝑞+𝑜(log𝑞 𝑛).

We attempt to bring the existing results for non-binary codes closer to the results obtained in the binary domain.

Theorem 4.1.5. Let𝑘 be a constant with respect to𝑛 and suppose that𝑞 ≤ log𝑛. Then, there exists a systematic𝑘-deletion code of message length𝑛over an alphabet of size𝑞that requires at most4𝑘log𝑛+𝑜(log𝑛)bits of redundancy. Whenlog𝑛 <

𝑞 < 𝑛, there exists a non-systematic 𝑘-deletion code of message length𝑛 over an alphabet of size𝑞that requires2𝑘(1+𝜖) (2 log𝑛+log𝑞)+𝑜(log𝑛)bits of redundancy.

For the case where𝑞 > 𝑛, we use a different technique than syndrome compression that extends to the case where𝑘 is a constant fraction of𝑛.

Theorem 4.1.6. Let𝑘 be a constant with respect to𝑛and suppose that𝑞 ≥ 𝑛. Then, there exists a non-systematic, efficiently encodable/decodable 𝑘-deletion code of message length 𝑛 over an alphabet of size 𝑞 that requires at most (30𝑘 +1)log𝑞 bits of redundancy.

As will be explained in more detail in Sec.4.7, the result stated in Theorem 4.1.6 also extends to the case where 𝑘 is a constant fraction of code length 𝑛, when𝑞 is large enough. A similar case when 𝑘 is a fraction of 𝑛and 𝑞 is a polynomial of𝑛 was solved in [42].

We make use of two different approaches to obtain the results. For Theorem4.1.5, we rely on the syndrome compression technique and some ideas that generalize the construction of [93]. For Theorem4.1.6, we make use of results from repeat- free sequences from [28]. To the best of the authors’ knowledge, the best known constructions of non-binary codes for the deletion channel can be found in [59] and so our results represent a significant improvement over existing work.

Organization

This chapter is organized as follows. In Sec. 4.2, we introduce the syndrome compression technique and prove its correctness. Sec.4.3 describes our results for binary codes capable of correcting𝑘deletions stated in Theorem4.1.1and Theorem 4.1.2. Sec.4.5provides constructions for the burst deletion channel. Sec.s4.6and 4.7construct codes for the non-binary deletion channel. Finally Sec.4.8concludes this chapter.

Dalam dokumen CorrectingErrorsinDNAStorage - California Institute of Technology (Halaman 100-105)