Chapter III: Binary Codes Correcting π Deletions/Insertions
3.1 Introduction
The problem of constructing efficient π-deletion codes (defined in Ch. 2) has long been unsettled. Before [12], which proposed a π-deletion correcting code construction withπ(π2logπlogπ)redundancy, no deletion codes correcting more than a single error with rate approaching 1 was known.
After [12], the work in [44] considered correction with high probability and proposed a π-deletion correcting code construction with redundancy (π +1) (2π+1)logπ+ π(logπ) and encoding/decoding complexityπ(ππ+1/logπβ1π). The result for this randomized coding setting was improved in [41], where redundancyπ(πlog(π/π)) and complexity π ππ π¦(π, π) were achieved. However, finding a deterministic π- deletion correcting code construction that achieves the optimal order redundancy π(πlogπ) remained elusive. Note that the optimality of the code is redundancy- wise rather than cardinality-wise, namely, the focus is on the asymptotic rather than exact size of the code.
In this chapter, we provide a solution to this longstanding open problem for constant π: We present a code construction that achievesπ(8πlogπ+π(logπ))redundancy whenπ =π(βοΈ
log logπ)andπ(π2π+1) encoding/decoding computational complex- ity. Note that the complexity is polynomial inπwhenπis a constant. The following theorem summarizes our main result.
Theorem 3.1.1. Let π and π be two integers satisfying π = π(βοΈ
log logπ). For integerπ =π+8πlogπ+π(logπ), there exists an encoding functionE :{0,1}π β {0,1}π, computed in π(π2π+1) time, and a decoding function D : {0,1}πβπ β {0,1}π, computed inπ(ππ+1) = π(ππ+1) time, such that for any c β {0,1}π and subsequencedβ {0,1}πβπ ofE (c), we have thatD (d) =c.
An independent work [22] proposed a π-deletion correcting code withπ(πlogπ) redundancy and better complexity ofπ ππ π¦(π, π). In contrast to redundancy 8πlogπ
in our construction, the redundancy in [22] is not specified, and it is estimated to be at least 200πlogπ. Moreover, the techniques in [22] and in our constructions are different. Our techniques can be applied to obtain a systematicπ-deletion code construction with 4πlogπ+π(logπ) bits of redundancy, as well as deletion codes for other settings, which will be given in the next chapter.
Here are the key building blocks in our code construction: (i)generalizing the VT constructiontoπdeletions by considering constrained sequences, (ii) separating the encoded vector to blocks and using concatenated codes and (iii) a novel strategy to separate the vector to blocks by a single pattern. The following gives a more specific description of these ideas.
In the previous chapter, wegeneralized the VT construction. In particular, we proved that while the higher order parity checksΓπ
π=1ππππ mod (ππ +1), π =0,1, . . . , π‘ do not work in general, those parity checks work in the two-deletion case, when the sequences are constrained to have no consecutive 1βs. In this chapter we generalize this idea, specifically, the higher order parity checks can be used to correctπ =π‘/2 deletions in sequences that satisfy the following constraint: The index distance between any two 1βs is at leastπ, i.e., there is a 0 run of length at leastπβ1 between any two 1βs.
The fact that we can correctπdeletions using the generalization of the VT construc- tion on constrained sequences, enables aconcatenated code construction, which is the underlying structure of ourπ-deletion correcting codes. In concatenated codes, each codewordcis split into blocks of small length, by certain markers at the bound- aries between adjacent blocks. Each block is protected by an inner deletion code that can be efficiently computed when the block length is small. The block boundary markers are chosen such thatπ deletions in a codeword result in at mostπ(π)block errors. Then, it suffices to use an outer code, such as a Reed-Solomon code, to correct the block errors.
Concatenated codes were used in a number of π-deletion correcting code construc- tions, [12, 40, 81]. One of the main differences among these constructions is the choice of the block boundary markers that separate the codewords. The construc- tions in [81,40] insert extra bits as markers between blocks, which introducesπ(π) bits of redundancy. In [12], occurrences of short subsequences in the codeword, called patterns, were used as markers. The construction in [12] requiresπ(πlogπ) different patterns where each pattern introducesπ(πlogπ) bits of redundancy.
We improve the redundancy in [12] by using a single type of pattern for block boundary markers. Specifically, we choose a pattern such that its indicator vector, a binary vector in which the 1 entries indicate the occurrences of the pattern in c, is aconstrained sequence that is immune to deletions. Then, the generalized VT construction can be used to protect the indicator vector and thus the locations of the block boundary markers. Knowing the boundary between blocks, we can recover the blocks with at most 2πblock errors, which then can be corrected using a combination of short deletion correcting codes and Reed-Solomon codes.
The concatenated code construction consists of short blocks, hence, the pattern has to occur frequently in the codeword such that all consecutive bits of certain length contains at least one occurrence of the pattern. Namely, the first step of the encoding is a mapping of an arbitrary binary sequence to a sequence with frequent synchronization patterns. The constructions in [12, 22] compute such mappings using randomized algorithms. In this chapter, we provide a deterministic algorithm to generate sequences with frequent synchronization patterns.
We now formally define the synchronization pattern and its indicator vector.
Definition 3.1.1. Asynchronization pattern, is a length3π+ βlogπβ +4sequencea= (π1, . . . , π3π+βlogπβ+4)satisfying
β’ π3π+π =1forπ β [0,βlogπβ +4], where[0,βlogπβ +4] ={0, . . . ,βlogπβ +4}.
β’ There does not exist a π β [1,3πβ1], such thatππ+π =1forπ β [0,βlogπβ +4]. Namely, a synchronization pattern is a length3π+ βlogπβ +4sequence that ends withβlogπβ +5consecutive1βs and no other 1-run with length βlogπβ +5exists.
For a sequencec= (π1, . . . , ππ), the indicator vector of the synchronization pattern, referred to assynchronization vector1π π¦ππ(c) β {0,1}π, is defined by
1π π¦ππ(c)π =



ο£²



ο£³
1, if(ππβ3π+1, ππβ3π+2, . . . , ππ+βlogπβ+4) is a synchronization pattern, 0, else.
(3.1)
Note that 1π π¦ππ(c)π = 0 for π β [1,3π β1] and for π β [πβ βlogπβ β3, π]. The entry1π π¦ππ(c) =1 if and only ifππis the first bit of the final 1 run in a synchronization pattern. It can be seen from the definition that any two 1 entries in 1π π¦ππ(c) have
index distance at least 3π, i.e., any two 1 entries are separated by a 0-run of length at least 3π β1. Hence,1π π¦ππ(c)is a constrained sequence described above.
Example 3.1.1. Let integers π = 2andπ =35. Then the sequence (1,0,0,1,0,1, 1,1,1,1,1)is a synchronization pattern. Let the lengthπsequence
c=(1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1, 1,0,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1). Then the synchronization vector1π π¦ππ(c)is given by
1π π¦ππ(c) = (0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0).
Next, we present the definition of the generalized VT code. Define the integer vectors
m(π) β (1π,1π+2π, . . . ,
π
βοΈ
π=1
ππ) (3.2)
forπ β [0,6π], where theπ-th entrymπ(π) ofm(π) is the sum of theπ-th powers of the firstπ positive integers. Given a sequencec β {0,1}π, we compute the generalized VT redundancy π(c) of dimension 6π+1 as follows:
π(c)π β cΒ·m(π) mod 3π ππ+1
=
π
βοΈ
π=1
ππmπ(π) mod 3π ππ+1, (3.3) for π β [0,6π], where π(c)π is the π-th component of π(c) and Β· denotes inner product over integers. Our generalization of the VT code construction shows that the vector π(1π π¦ππ(c)) helps protect the synchronization vector 1π π¦ππ(c) from π deletions inc.
The rest of the chapter is organized as follows. Sec.3.2provides an outline of our construction and some of the basic lemmas. Based on results of the basic lemmas, Sec. 3.3 presents the encoding and decoding procedures of our code. Sec. 3.4 presents our VT generalization for recovering the synchronization vector. Sec.3.5 explains how to correct π deletions based on the synchronization vector, when the synchronization patterns appear frequently. Sec. 3.6describes an algorithm to transform a sequence into one with frequently occurred synchronization patterns.
Sec.3.7concludes this chapter.