• Tidak ada hasil yang ditemukan

Chapter III: Binary Codes Correcting π‘˜ Deletions/Insertions

3.1 Introduction

The problem of constructing efficient π‘˜-deletion codes (defined in Ch. 2) has long been unsettled. Before [12], which proposed a π‘˜-deletion correcting code construction with𝑂(π‘˜2logπ‘˜log𝑛)redundancy, no deletion codes correcting more than a single error with rate approaching 1 was known.

After [12], the work in [44] considered correction with high probability and proposed a π‘˜-deletion correcting code construction with redundancy (π‘˜ +1) (2π‘˜+1)log𝑛+ π‘œ(log𝑛) and encoding/decoding complexity𝑂(π‘›π‘˜+1/logπ‘˜βˆ’1𝑛). The result for this randomized coding setting was improved in [41], where redundancy𝑂(π‘˜log(𝑛/π‘˜)) and complexity 𝑝 π‘œπ‘™ 𝑦(𝑛, π‘˜) were achieved. However, finding a deterministic π‘˜- deletion correcting code construction that achieves the optimal order redundancy 𝑂(π‘˜log𝑛) remained elusive. Note that the optimality of the code is redundancy- wise rather than cardinality-wise, namely, the focus is on the asymptotic rather than exact size of the code.

In this chapter, we provide a solution to this longstanding open problem for constant π‘˜: We present a code construction that achieves𝑂(8π‘˜log𝑛+π‘œ(log𝑛))redundancy whenπ‘˜ =π‘œ(√︁

log log𝑛)and𝑂(𝑛2π‘˜+1) encoding/decoding computational complex- ity. Note that the complexity is polynomial in𝑛whenπ‘˜is a constant. The following theorem summarizes our main result.

Theorem 3.1.1. Let π‘˜ and 𝑛 be two integers satisfying π‘˜ = π‘œ(√︁

log log𝑛). For integer𝑁 =𝑛+8π‘˜log𝑛+π‘œ(log𝑛), there exists an encoding functionE :{0,1}𝑛 β†’ {0,1}𝑁, computed in 𝑂(𝑛2π‘˜+1) time, and a decoding function D : {0,1}π‘βˆ’π‘˜ β†’ {0,1}𝑛, computed in𝑂(π‘›π‘˜+1) = 𝑂(π‘π‘˜+1) time, such that for any c ∈ {0,1}𝑛 and subsequenced∈ {0,1}π‘βˆ’π‘˜ ofE (c), we have thatD (d) =c.

An independent work [22] proposed a π‘˜-deletion correcting code with𝑂(π‘˜log𝑛) redundancy and better complexity of𝑝 π‘œπ‘™ 𝑦(𝑛, π‘˜). In contrast to redundancy 8π‘˜log𝑛

in our construction, the redundancy in [22] is not specified, and it is estimated to be at least 200π‘˜log𝑛. Moreover, the techniques in [22] and in our constructions are different. Our techniques can be applied to obtain a systematicπ‘˜-deletion code construction with 4π‘˜log𝑛+π‘œ(log𝑛) bits of redundancy, as well as deletion codes for other settings, which will be given in the next chapter.

Here are the key building blocks in our code construction: (i)generalizing the VT constructiontoπ‘˜deletions by considering constrained sequences, (ii) separating the encoded vector to blocks and using concatenated codes and (iii) a novel strategy to separate the vector to blocks by a single pattern. The following gives a more specific description of these ideas.

In the previous chapter, wegeneralized the VT construction. In particular, we proved that while the higher order parity checksÍ𝑁

𝑖=1𝑖𝑗𝑐𝑖 mod (𝑁𝑗 +1), 𝑗 =0,1, . . . , 𝑑 do not work in general, those parity checks work in the two-deletion case, when the sequences are constrained to have no consecutive 1’s. In this chapter we generalize this idea, specifically, the higher order parity checks can be used to correctπ‘˜ =𝑑/2 deletions in sequences that satisfy the following constraint: The index distance between any two 1’s is at leastπ‘˜, i.e., there is a 0 run of length at leastπ‘˜βˆ’1 between any two 1’s.

The fact that we can correctπ‘˜deletions using the generalization of the VT construc- tion on constrained sequences, enables aconcatenated code construction, which is the underlying structure of ourπ‘˜-deletion correcting codes. In concatenated codes, each codewordcis split into blocks of small length, by certain markers at the bound- aries between adjacent blocks. Each block is protected by an inner deletion code that can be efficiently computed when the block length is small. The block boundary markers are chosen such thatπ‘˜ deletions in a codeword result in at most𝑂(π‘˜)block errors. Then, it suffices to use an outer code, such as a Reed-Solomon code, to correct the block errors.

Concatenated codes were used in a number of π‘˜-deletion correcting code construc- tions, [12, 40, 81]. One of the main differences among these constructions is the choice of the block boundary markers that separate the codewords. The construc- tions in [81,40] insert extra bits as markers between blocks, which introduces𝑂(𝑁) bits of redundancy. In [12], occurrences of short subsequences in the codeword, called patterns, were used as markers. The construction in [12] requires𝑂(π‘˜logπ‘˜) different patterns where each pattern introduces𝑂(π‘˜log𝑁) bits of redundancy.

We improve the redundancy in [12] by using a single type of pattern for block boundary markers. Specifically, we choose a pattern such that its indicator vector, a binary vector in which the 1 entries indicate the occurrences of the pattern in c, is aconstrained sequence that is immune to deletions. Then, the generalized VT construction can be used to protect the indicator vector and thus the locations of the block boundary markers. Knowing the boundary between blocks, we can recover the blocks with at most 2π‘˜block errors, which then can be corrected using a combination of short deletion correcting codes and Reed-Solomon codes.

The concatenated code construction consists of short blocks, hence, the pattern has to occur frequently in the codeword such that all consecutive bits of certain length contains at least one occurrence of the pattern. Namely, the first step of the encoding is a mapping of an arbitrary binary sequence to a sequence with frequent synchronization patterns. The constructions in [12, 22] compute such mappings using randomized algorithms. In this chapter, we provide a deterministic algorithm to generate sequences with frequent synchronization patterns.

We now formally define the synchronization pattern and its indicator vector.

Definition 3.1.1. Asynchronization pattern, is a length3π‘˜+ ⌈logπ‘˜βŒ‰ +4sequencea= (π‘Ž1, . . . , π‘Ž3π‘˜+⌈logπ‘˜βŒ‰+4)satisfying

β€’ π‘Ž3π‘˜+𝑖 =1for𝑖 ∈ [0,⌈logπ‘˜βŒ‰ +4], where[0,⌈logπ‘˜βŒ‰ +4] ={0, . . . ,⌈logπ‘˜βŒ‰ +4}.

β€’ There does not exist a 𝑗 ∈ [1,3π‘˜βˆ’1], such thatπ‘Žπ‘—+𝑖 =1for𝑖 ∈ [0,⌈logπ‘˜βŒ‰ +4]. Namely, a synchronization pattern is a length3π‘˜+ ⌈logπ‘˜βŒ‰ +4sequence that ends with⌈logπ‘˜βŒ‰ +5consecutive1’s and no other 1-run with length ⌈logπ‘˜βŒ‰ +5exists.

For a sequencec= (𝑐1, . . . , 𝑐𝑛), the indicator vector of the synchronization pattern, referred to assynchronization vector1𝑠 𝑦𝑛𝑐(c) ∈ {0,1}𝑛, is defined by

1𝑠 𝑦𝑛𝑐(c)𝑖 =







ο£²







ο£³

1, if(π‘π‘–βˆ’3π‘˜+1, π‘π‘–βˆ’3π‘˜+2, . . . , 𝑐𝑖+⌈logπ‘˜βŒ‰+4) is a synchronization pattern, 0, else.

(3.1)

Note that 1𝑠 𝑦𝑛𝑐(c)𝑖 = 0 for 𝑖 ∈ [1,3π‘˜ βˆ’1] and for 𝑖 ∈ [π‘›βˆ’ ⌈logπ‘˜βŒ‰ βˆ’3, 𝑛]. The entry1𝑠 𝑦𝑛𝑐(c) =1 if and only if𝑐𝑖is the first bit of the final 1 run in a synchronization pattern. It can be seen from the definition that any two 1 entries in 1𝑠 𝑦𝑛𝑐(c) have

index distance at least 3π‘˜, i.e., any two 1 entries are separated by a 0-run of length at least 3π‘˜ βˆ’1. Hence,1𝑠 𝑦𝑛𝑐(c)is a constrained sequence described above.

Example 3.1.1. Let integers π‘˜ = 2and𝑛 =35. Then the sequence (1,0,0,1,0,1, 1,1,1,1,1)is a synchronization pattern. Let the length𝑛sequence

c=(1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1, 1,0,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1). Then the synchronization vector1𝑠 𝑦𝑛𝑐(c)is given by

1𝑠 𝑦𝑛𝑐(c) = (0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0).

Next, we present the definition of the generalized VT code. Define the integer vectors

m(𝑒) β‰œ (1𝑒,1𝑒+2𝑒, . . . ,

𝑛

βˆ‘οΈ

𝑗=1

𝑗𝑒) (3.2)

for𝑒 ∈ [0,6π‘˜], where the𝑖-th entrym𝑖(𝑒) ofm(𝑒) is the sum of the𝑒-th powers of the first𝑖 positive integers. Given a sequencec ∈ {0,1}𝑛, we compute the generalized VT redundancy 𝑓(c) of dimension 6π‘˜+1 as follows:

𝑓(c)𝑒 β‰œ cΒ·m(𝑒) mod 3π‘˜ 𝑛𝑒+1

=

𝑛

βˆ‘οΈ

𝑖=1

𝑐𝑖m𝑖(𝑒) mod 3π‘˜ 𝑛𝑒+1, (3.3) for 𝑒 ∈ [0,6π‘˜], where 𝑓(c)𝑒 is the 𝑒-th component of 𝑓(c) and Β· denotes inner product over integers. Our generalization of the VT code construction shows that the vector 𝑓(1𝑠 𝑦𝑛𝑐(c)) helps protect the synchronization vector 1𝑠 𝑦𝑛𝑐(c) from π‘˜ deletions inc.

The rest of the chapter is organized as follows. Sec.3.2provides an outline of our construction and some of the basic lemmas. Based on results of the basic lemmas, Sec. 3.3 presents the encoding and decoding procedures of our code. Sec. 3.4 presents our VT generalization for recovering the synchronization vector. Sec.3.5 explains how to correct π‘˜ deletions based on the synchronization vector, when the synchronization patterns appear frequently. Sec. 3.6describes an algorithm to transform a sequence into one with frequently occurred synchronization patterns.

Sec.3.7concludes this chapter.