Chapter IV: Correcting Deletions/Insertions-Generalizations
4.1 Introduction
In the early 1960s, the problem of constructing codes for the deletion channel was introduced through the seminal works of Sellers and Levenshtein [82,64]. Despite the inherent difficulty of coding for this channel, the initial results were promising.
It was shown in [64] that the Varshmov-Tenengolts code can correct either a single deletion or a single insertion over a binary alphabet. Furthermore, [64] showed that logπ+π(1)bits of redundancy are necessary for a single deletion code which implies that the Varshamov-Tenengolts code is nearly optimal. For the case where the channel deletes a burst of symbols has length 1 or 2, a nearly optimal (and very elegant) construction was provided in [63] by Levenshtein. In the early 1980s Tenengolts constructed a nearly optimal code for a single deletion over a non-binary alphabet based upon a clever mapping between a binary and a non-binary sequence [93].
Unfortunately, up until recently, progress was slow and results on the subject were limited. Some notable exceptions include the work of Schulman and Zuckerman [81] which showed the existence of efficient codes that are capable of correcting a constant fraction of deletions and the work of Helberg and Ferreira [47] which attempted to extend the Varshamov-Tenengolts code to handle more than a single deletion. Despite these efforts, up until a recent work by Brakensiek, Guruswami, and Zbarsky [12], no efficiently constructable codes were known that could correct even two deletions that required less thanβ
πbits of redundancy.
In [12], Brakensiek et al. constructed π deletion correcting codes that require
π(π2logπlogπ) bits of redundancy. Several works quickly followed [12] that dramatically improved upon this result. Haeupler [41] gives an explicit systematic π-ary code construction which requiresΞ(πlog
2π π
logπ +π)bits of redundancy. In [22], another construction was derived by Cheng et al., which is not systematic, but is order optimal in the sense that it requiresπ(πlogπ) bits of redundancy.
Independent of [22], in Ch. 3, we presented explicit constructions for π deletion correcting codes that have redundancy 8πlogπ+π(logπ), which for the first time achieves asymptotically four times the Gilbert-Varshamov lower bound in the Lev- enshtein metric. In this chapter, we generalize one of the underlying techniques in Ch. 3, which we refer to as syndrome compression. We show that using the syndrome compression technique, it is possible to construct codes for channels that are within a factor of two from the Gilbert-Varshamov lower bound in the Leven- shtein metric. Note that the above multi-deletion correcting codes are not systematic except for the ones in [41] and [22], which have redundancy Ξ(πlog2ππ + π). We furthermore present a systematic deletion correcting code and show that it achieves asymptotically twice the Gilbert-Varshamov lower bound, which is asymptotically the same amount of redundancy as our non-systematic construction, by using the syndrome compression technique. We then apply this technique to three different types of combinatorial deletion channels: (a) the binary adversarial deletion chan- nel that can delete at most π symbols, (b) the binary adversarial deletion channel which can cause at most π bursts of deletions (each of length at most π‘πΏ), and (c) the non-binary adversarial deletion channel which can delete at most π symbols.
We note that although our focus is on deletion channels, we believe the syndrome compression technique can yield meaningful results when applied to other types of channels (particular those that are traditionally considered difficult to code for, such as the deletion channel). In addition, we introduce new ideas for the systematic deletion code and the non-binary deletion correcting codes, as will be presented later, which might be of independent interest.
In the following, we highlight our contributions in variations of deletion channels.
Codes (non-systematic and systematic) correcting π deletions
We begin with a non-systematicπdeletion correcting code construction that achieves asymptotically 4πlogπ+π(logπ) +π(logπ) bits of redundancy, as a result of the syndrome compression technique.
Theorem 4.1.1. Let π be a constant with respect to π. Then, there exists a
non-systematic, efficiently encodable/decodable binary π-deletion correcting code of length π that requires at most 4π(1+ π)logπ bits of redundancy where π = π(1/π) +π(π2logπ/log logπ) is a small number that depends on π and π. The encoding/decoding complexity isπ(π2π+1).
In addition to the non-systematic deletion correcting codes, we are interested in systematic construction of deletion codes, motivated by the document exchange application [25], where Alice wishes to send a sequence π to Bob, who has the knowledge of a sequenceπthat is within limited edit distance ofπ. The edit distance between π andπ is measured by the minimum number of deletions, insertions, or substitutions needed to change π into π. It turns out that Alice can send the sequence π to Bob by transmitting a hash of π since Bob knowsπ. Such a hash implies a systematic deletion correcting code and vice versa. Note that systematic deletion codes also has applications in storage systems with deletion errors, where concatenated code constructions are used.
The document exchange or systematic deletion code construction problem has been studied in both random and adversarial settings. For random settings, where a sequence can be recovered with high probability, document exchange algorithms with hash sizesπ(πlog2π)andπ(πlog2πlogβπ)1were presented in [50] and [51], respectively. The results for randomized settings were improved to π(π2logπ) in [13] and to π(πlogπ) in [7]. In [44], a randomized systematic π-deletion correcting code withπ(π2logπ)bits of redundancy was presented. For adversarial settings, where every sequence needs to be recovered in the worst case, the state of the art document exchange schemes [6, 22, 41] have hash sizeπ(πlog2π), which is bounded away from the lower bound πlogπ+π(logπ). Our next result gives a systematic deletion correcting codes with 4πlogπ+π(logπ)bits of redundancy and thus a document exchange scheme with hash size 4πlogπ+π(logπ). Note that this improves the redundancy of Theorem4.1.1byπ(logπ).
Theorem 4.1.2. For any sequencec β {0,1}π, there exists a hash functionπ» π π βπ : {0,1}π β {0,1}4πlogπ+π(logπ), computable inπ(π2π+1)time, such that
{(c, π» π π βπ(c)) :cβ {0,1}π}
forms aπ-deletion correcting code. The encoding/decoding complexity of the code isπ(ππ+1).
1logβπ is the minimum number of times the logarithm needs to be iteratively applied before gettingπto a result at most 1.
To the best of our knowledge, our results achieve asymptotically the best redundancy.
The work in [90] proposed a non-systematic construction with (4π β 1)logπ + π(logπ) bits of redundancy, combining the syndrome compression technique and the VT code construction. The result is logπsmaller than the result in this chapter.
Yet, both results achieve the same order of redundancy asymptotically.
Codes correcting bursts of deletions
The results in this subsection pertain to the case where one wants to design codes capable of correcting bursts of deletions. The following presents a systematic code correcting a burst of at mostπ deletions.
Theorem 4.1.3. Suppose π is a constant with respect to π. Then, there exists a systematic code of lengthπcapable of correcting a burst of consecutive deletions of length at mostπ that requires at most4 logπ+π(logπ)bits of redundancy.
Prior to this, codes capable of correcting a burst of consecutive deletions were provided in [35], where a construction is given that requires logπlogπ+
π(π2logπlog logπ) bits of redundancy so that our work represents a significant improvement. Furthermore, our construction is systematic whereas the one from [35] is not. Recently, the work of [60] presented a non-systematic code correcting a burst of π deletions with redundancy logπ+ π(π +1)/2 log logπ+πΆ. We also consider the case where the deletions do not occur consecutively as well as the case where more than one burst occurs, and we provide constructions that improve upon prior art [80] in these cases as well. The following result shows that we can correct π bursts each of at most lengthπ‘πΏ deletions, with redundancy at most 4πlogπ+π(logπ).
Theorem 4.1.4. Letπandπ‘πΏ be two constant integers with respect toπ. Then, there exists a systematic code of lengthπ capable of correcting π bursts of consecutive deletions each of length at most π‘πΏ that requires at most 4 logπ+π(logπ) bits of redundancy.
Non-binary deletion correcting codes
Our next two results pertain to correcting deletions over non-binary alphabets, which has applications in DNA storage with alphabet size 4 and network transmission with packet loss. Despite the recent progress in binary deletion codes, the problem of constructing codes for the π-ary deletion channel has received significantly less attention. Tenengolts constructed a nearly optimal code for the case of a single
deletion [93]. The main idea in [93] is to use a parity code to identify the symbol which was deleted and an associated Levenshtein code to determine the location of the deletion. For the case of multiple deletions, the Helberg codes [47], which were originally proposed for the binary deletion channel, were adapted and shown to produce non-binary deletion correcting codes [59]. The primary drawback to this class of codes is their low rate [59]. Even for the case of two deletions, the codes have rates that do not approach 1 as π becomes large. It was shown in [65] that the optimal redundancy of aπ-ary π deletion code asymptotically falls betweenπlogπ+πlogπ+π(logπ π)and 2πlogπ+πlogπ+π(logπ π).
We attempt to bring the existing results for non-binary codes closer to the results obtained in the binary domain.
Theorem 4.1.5. Letπ be a constant with respect toπ and suppose thatπ β€ logπ. Then, there exists a systematicπ-deletion code of message lengthπover an alphabet of sizeπthat requires at most4πlogπ+π(logπ)bits of redundancy. Whenlogπ <
π < π, there exists a non-systematic π-deletion code of message lengthπ over an alphabet of sizeπthat requires2π(1+π) (2 logπ+logπ)+π(logπ)bits of redundancy.
For the case whereπ > π, we use a different technique than syndrome compression that extends to the case whereπ is a constant fraction ofπ.
Theorem 4.1.6. Letπ be a constant with respect toπand suppose thatπ β₯ π. Then, there exists a non-systematic, efficiently encodable/decodable π-deletion code of message length π over an alphabet of size π that requires at most (30π +1)logπ bits of redundancy.
As will be explained in more detail in Sec.4.7, the result stated in Theorem 4.1.6 also extends to the case where π is a constant fraction of code length π, whenπ is large enough. A similar case when π is a fraction of πand π is a polynomial ofπ was solved in [42].
We make use of two different approaches to obtain the results. For Theorem4.1.5, we rely on the syndrome compression technique and some ideas that generalize the construction of [93]. For Theorem4.1.6, we make use of results from repeat- free sequences from [28]. To the best of the authorsβ knowledge, the best known constructions of non-binary codes for the deletion channel can be found in [59] and so our results represent a significant improvement over existing work.
Organization
This chapter is organized as follows. In Sec. 4.2, we introduce the syndrome compression technique and prove its correctness. Sec.4.3 describes our results for binary codes capable of correctingπdeletions stated in Theorem4.1.1and Theorem 4.1.2. Sec.4.5provides constructions for the burst deletion channel. Sec.s4.6and 4.7construct codes for the non-binary deletion channel. Finally Sec.4.8concludes this chapter.