• Tidak ada hasil yang ditemukan

Chapter II: Binary 2-Deletion/Insertion Correcting Codes

2.6 Appendix

→𝑖=7, 𝑗 =14, 𝑀(d, 𝑖, 𝑗) =(0,1,0,0,0,1,1), π‘₯0=9, π‘₯1=36, π‘₯2=180

→𝑖=7, 𝑗 =13, 𝑀(d, 𝑖, 𝑗) =(0,1,0,0,0,1,1), π‘₯0=9, π‘₯1=36, π‘₯2=180

→𝑖=7, 𝑗 =12, 𝑀(d, 𝑖, 𝑗) =(0,1,0,0,1,0,1), π‘₯0=8, π‘₯1=30, π‘₯2=144

Recover the original sequence c

Let (𝑖, 𝑗) be the output of Algorithm2, for which we have that𝑀(d, 𝑖, 𝑗) =110(c).

Letcβ€²be a length𝑛supersequence after two insertions todsuch that110(cβ€²) =110(c).

If 𝑏𝑖 = 1, then inserting 𝑏𝑖 to 110(d) corresponds to either inserting a 0 to d as the𝑝𝑖+1-th bit incβ€²or inserting a 1 todas the𝑝𝑖-th bit incβ€²(see Table2.1). If𝑏𝑖 =0, then inserting𝑏𝑖 to110(d) corresponds to inserting a 0 or 1 in the first 0 run or 1 run respectively after theπ‘˜β€²-th bit incβ€², where π‘˜β€² =maxπ‘˜{𝑀(d, 𝑖, 𝑗)π‘˜ = 1, π‘˜ < 𝑝𝑖} is the index of the last 10-pattern that occurs before the 𝑝𝑖-th bit in cβ€². The same arguments hold for the insertion of𝑏𝑗.

Therefore, given the (𝑖, 𝑗) pair that Algorithm 2 returns, there are at most four possible cβ€² supersequences of d such that 110(cβ€²) = 110(c). One can check if thecβ€²sequences satisfy β„Ž(c). According to Theorem2.1.2, there is a unique such sequence, the original sequencecthat satisfies both 𝑓(c)andβ„Ž(c)simultaneously.

π‘˜1βˆ’1

βˆ‘οΈ

𝑑=π‘˜2

(110(cβ€²)𝑑+1βˆ’110(cβ€²)𝑑) Β· (m(𝑒))𝑑

= (110(c)β„“2βˆ’110(cβ€²)β„“2) Β· (m(𝑒))β„“2+ (110(c)π‘˜1 βˆ’110(cβ€²)π‘˜1) Β· (m(𝑒))π‘˜1+

β„“2βˆ’1

βˆ‘οΈ

𝑑=β„“1

110(c)𝑑· (m(𝑒))π‘‘βˆ’

β„“2

βˆ‘οΈ

β„“1+1

110(c)𝑑 Β· (m(𝑒))π‘‘βˆ’1

+

π‘˜1

βˆ‘οΈ

𝑑=π‘˜2+1

110(cβ€²)𝑑· (m(𝑒))π‘‘βˆ’1βˆ’

π‘˜1βˆ’1

βˆ‘οΈ

𝑑=π‘˜2

110(cβ€²)𝑑· (m(𝑒))𝑑

= (110(c)β„“2βˆ’110(cβ€²)β„“2) Β· (m(𝑒))β„“2+ (110(c)π‘˜1 βˆ’110(cβ€²)π‘˜1) Β· (m(𝑒))π‘˜1+

110(c)β„“1 Β· (m(𝑒))β„“1βˆ’110(c)β„“2 Β· (m(𝑒))β„“2βˆ’1+

β„“2βˆ’1

βˆ‘οΈ

𝑑=β„“1+1

110(c)𝑑 ·𝑑𝑒+110(cβ€²)π‘˜1Β· (m(𝑒))π‘˜1βˆ’1

βˆ’110(cβ€²)π‘˜2Β· (m(𝑒))π‘˜2βˆ’

π‘˜1βˆ’1

βˆ‘οΈ

𝑑=π‘˜2+1

110(cβ€²)𝑑·𝑑𝑒

= (βˆ’110(cβ€²)β„“2) Β· (m(𝑒))β„“2+ (110(c)π‘˜1) Β· (m(𝑒))π‘˜1+ 110(c)β„“1 Β· (m(𝑒))β„“1+

β„“2

βˆ‘οΈ

𝑑=β„“1+1

110(c)π‘‘Β·π‘‘π‘’βˆ’

110(cβ€²)π‘˜2 Β· (m(𝑒))π‘˜2 βˆ’

π‘˜1

βˆ‘οΈ

𝑑=π‘˜2+1

110(cβ€²)𝑑 ·𝑑𝑒

=110(c)β„“1 Β· (m(𝑒))β„“1+110(c)π‘˜1Β· (m(𝑒))π‘˜1+

β„“2

βˆ‘οΈ

𝑑=β„“1+1

110(c)𝑑 Β·π‘‘π‘’βˆ’

π‘˜1

βˆ‘οΈ

𝑑=π‘˜2+1

110(cβ€²)π‘‘Β·π‘‘π‘’βˆ’

110(cβ€²)β„“2 Β· (m(𝑒))β„“2 +110(cβ€²)π‘˜2 Β· (m(𝑒))π‘˜2

=𝑔

m(𝑒),β„“1(110(c)β„“1, . . . ,110(c)β„“2,110(cβ€²)β„“2)βˆ’

𝑔m(𝑒), π‘˜2(110(cβ€²)π‘˜2, . . . ,110(cβ€²)π‘˜1,110(c)π‘˜1) Proof of(2.14)(Case (b))

(110(c) βˆ’110(cβ€²)) Β·m(𝑒)

=

β„“2

βˆ‘οΈ

𝑑=β„“1

(110(c)π‘‘βˆ’110(cβ€²)𝑑) Β· (m(𝑒))𝑑+

π‘˜2

βˆ‘οΈ

𝑑=π‘˜1

(110(c)π‘‘βˆ’110(cβ€²)𝑑) Β· (m(𝑒))𝑑

= (110(c)β„“2 βˆ’110(cβ€²)β„“2) Β· (m(𝑒))β„“2+ (110(c)π‘˜2 βˆ’110(cβ€²)π‘˜2) Β· (m(𝑒))π‘˜2+

β„“2βˆ’1

βˆ‘οΈ

𝑑=β„“1

(110(c)π‘‘βˆ’110(c)𝑑+1) Β· (m(𝑒))𝑑+

π‘˜2βˆ’1

βˆ‘οΈ

𝑑=π‘˜1

(110(c)π‘‘βˆ’110(c)𝑑+1) Β· (m(𝑒))𝑑

= (110(c)β„“2 βˆ’110(cβ€²)β„“2) Β· (m(𝑒))β„“2+ (110(c)π‘˜2 βˆ’110(cβ€²)π‘˜2) Β· (m(𝑒))π‘˜2+

β„“2βˆ’1

βˆ‘οΈ

𝑑=β„“1

110(c)𝑑 Β· (m(𝑒))π‘‘βˆ’

β„“2

βˆ‘οΈ

β„“1+1

110(c)𝑑· (m(𝑒))π‘‘βˆ’1+

π‘˜2βˆ’1

βˆ‘οΈ

𝑑=π‘˜1

110(c)𝑑 Β· (m(𝑒))π‘‘βˆ’

π‘˜2

βˆ‘οΈ

𝑑=π‘˜1+1

110(c)𝑑· (m(𝑒))π‘‘βˆ’1

= (110(c)β„“2 βˆ’110(cβ€²)β„“2) Β· (m(𝑒))β„“2+ (110(c)π‘˜2 βˆ’110(cβ€²)π‘˜2) Β· (m(𝑒))π‘˜2+

110(c)β„“1Β· (m(𝑒))β„“1βˆ’110(c)β„“2 Β· (m(𝑒))β„“2βˆ’1+

β„“2βˆ’1

βˆ‘οΈ

𝑑=β„“1+1

110(c)𝑑 ·𝑑𝑒+110(c)π‘˜1 Β· (m(𝑒))π‘˜1

βˆ’110(c)π‘˜2 Β· (m(𝑒))π‘˜2βˆ’1+

π‘˜2βˆ’1

βˆ‘οΈ

𝑑=π‘˜1+1

110(c)𝑑·𝑑𝑒

= (βˆ’110(cβ€²)β„“2) Β· (m(𝑒))β„“2 + (βˆ’110(cβ€²)π‘˜2) Β· (m(𝑒))π‘˜2+ 110(c)β„“1Β· (m(𝑒))β„“1+

β„“2

βˆ‘οΈ

𝑑=β„“1+1

110(c)𝑑·𝑑𝑒+

110(c)π‘˜1Β· (m(𝑒))π‘˜1+

π‘˜2

βˆ‘οΈ

𝑑=π‘˜1+1

110(c)𝑑·𝑑𝑒

=110(c)β„“1Β· (m(𝑒))β„“1+110(c)π‘˜1 Β· (m(𝑒))π‘˜1βˆ’

110(cβ€²)β„“2 Β· (m(𝑒))β„“2+110(cβ€²)π‘˜2 Β· (m(𝑒))π‘˜2

+

β„“2

βˆ‘οΈ

𝑑=β„“1+1

110(c)𝑑 ·𝑑𝑒+

π‘˜2

βˆ‘οΈ

𝑑=π‘˜1+1

110(c)𝑑·𝑑𝑒

=𝑔

m(𝑒),β„“1(110(c)β„“1, . . . ,110(c)β„“2,110(cβ€²)β„“2)+

𝑔m(𝑒), π‘˜1(110(c)π‘˜1, . . . ,110(c)π‘˜2,110(cβ€²)π‘˜2) Proof of(2.15)(Case (c))

(110(c) βˆ’110(cβ€²)) Β·m(𝑒)

=

π‘˜1βˆ’2

βˆ‘οΈ

𝑑=β„“1

(110(c)π‘‘βˆ’110(cβ€²)𝑑) Β· (m(𝑒))𝑑+

β„“2βˆ’1

βˆ‘οΈ

𝑑=π‘˜1βˆ’1

(110(c)π‘‘βˆ’110(cβ€²)𝑑) Β· (m(𝑒))𝑑+

π‘˜2

βˆ‘οΈ

𝑑=β„“2

(110(c)π‘‘βˆ’110(cβ€²)𝑑) Β· (m(𝑒))𝑑

=

π‘˜1βˆ’2

βˆ‘οΈ

𝑑=β„“1

(110(c)π‘‘βˆ’110(c)𝑑+1) Β· (m(𝑒))𝑑+

β„“2βˆ’1

βˆ‘οΈ

𝑑=π‘˜1βˆ’1

(110(c)π‘‘βˆ’110(c)𝑑+2) Β· (m(𝑒))𝑑+ (110(c)β„“2 βˆ’110(cβ€²)β„“2) Β· (m(𝑒))β„“2+ (110(c)π‘˜2 βˆ’110(cβ€²)π‘˜2) Β· (m(𝑒))π‘˜2+

π‘˜2βˆ’1

βˆ‘οΈ

𝑑=β„“2+1

(110(c)π‘‘βˆ’110(c)𝑑+1) Β· (m(𝑒))𝑑

=

π‘˜1βˆ’2

βˆ‘οΈ

𝑑=β„“1

110(c)𝑑 Β· (m(𝑒))π‘‘βˆ’

π‘˜1βˆ’1

βˆ‘οΈ

𝑑=β„“1+1

110(c)𝑑· (m(𝑒))π‘‘βˆ’1+

β„“2

βˆ‘οΈ

𝑑=π‘˜1βˆ’1

110(c)𝑑· (m(𝑒))π‘‘βˆ’

β„“2+1

βˆ‘οΈ

𝑑=π‘˜1+1

110(c)𝑑 Β· (m(𝑒))π‘‘βˆ’2+ (βˆ’110(cβ€²)β„“2) Β· (m(𝑒))β„“2 + (βˆ’110(cβ€²)π‘˜2) Β· (m(𝑒))π‘˜2+

π‘˜2

βˆ‘οΈ

𝑑=β„“2+1

110(c)𝑑· (m(𝑒))π‘‘βˆ’

π‘˜2

βˆ‘οΈ

𝑑=β„“2+2

110(c)𝑑 Β· (m(𝑒))π‘‘βˆ’1

=110(c)β„“1(m(𝑒))β„“1 βˆ’110(c)π‘˜1βˆ’1(m(𝑒))π‘˜1βˆ’2+

π‘˜1βˆ’2

βˆ‘οΈ

𝑑=β„“1+1

110(c)𝑑·𝑑𝑒+110(c)π‘˜1βˆ’1(m(𝑒))π‘˜1βˆ’1+ 110(c)π‘˜1(m(𝑒))π‘˜1 βˆ’110(c)β„“2+1(m(𝑒))β„“2βˆ’1+

β„“2

βˆ‘οΈ

𝑑=π‘˜1+1

110(c)𝑑(𝑑𝑒+ (π‘‘βˆ’1)𝑒) + (βˆ’110(cβ€²)β„“2) Β· (m(𝑒))β„“2+ (βˆ’110(cβ€²)π‘˜2) Β· (m(𝑒))π‘˜2+110(c)β„“2+1(m(𝑒))β„“2+1+

π‘˜2

βˆ‘οΈ

𝑑=β„“2+2

110(c)𝑑𝑑𝑒

=110(c)β„“1(m(𝑒))β„“1 +110(c)π‘˜1(m(𝑒))π‘˜1βˆ’ (110(cβ€²)β„“2 Β· (m(𝑒))β„“2 +110(cβ€²)π‘˜2 Β· (m(𝑒))π‘˜2)+

π‘˜1βˆ’1

βˆ‘οΈ

𝑑=β„“1+1

110(c)𝑑·𝑑𝑒+

β„“2+1

βˆ‘οΈ

𝑑=π‘˜1+1

110(c)𝑑(𝑑𝑒+ (π‘‘βˆ’1)𝑒)+

π‘˜2

βˆ‘οΈ

𝑑=β„“2+2

110(c)𝑑𝑑𝑒

=110(c)β„“1(m(𝑒))β„“1 +110(c)π‘˜1(m(𝑒))π‘˜1βˆ’ (110(cβ€²)β„“2 Β· (m(𝑒))β„“2 +110(cβ€²)π‘˜2 Β· (m(𝑒))π‘˜2)+

π‘˜1βˆ’1

βˆ‘οΈ

𝑑=β„“1+1

110(c)𝑑·𝑑𝑒+

β„“2

βˆ‘οΈ

𝑑=π‘˜1

110(c)𝑑+1𝑑𝑒+

π‘˜2

βˆ‘οΈ

𝑑=π‘˜1+1

110(c)𝑑𝑑𝑒

=𝑔

m(𝑒),β„“1(110(c)β„“1, . . . ,110(c)π‘˜1βˆ’1,110(c)π‘˜1+1, . . . , 110(c)β„“2+1,110(cβ€²)β„“2)+

𝑔m(𝑒), π‘˜1(110(c)π‘˜1, . . . ,110(c)π‘˜2,110(cβ€²)π‘˜2)

C h a p t e r 3

BINARY CODES CORRECTING π‘˜ DELETIONS/INSERTIONS

Following the construction in Ch. 2, this chapter presents binary codes correcting any constant number of deletion/inssertion errors.

3.1 Introduction

The problem of constructing efficient π‘˜-deletion codes (defined in Ch. 2) has long been unsettled. Before [12], which proposed a π‘˜-deletion correcting code construction with𝑂(π‘˜2logπ‘˜log𝑛)redundancy, no deletion codes correcting more than a single error with rate approaching 1 was known.

After [12], the work in [44] considered correction with high probability and proposed a π‘˜-deletion correcting code construction with redundancy (π‘˜ +1) (2π‘˜+1)log𝑛+ π‘œ(log𝑛) and encoding/decoding complexity𝑂(π‘›π‘˜+1/logπ‘˜βˆ’1𝑛). The result for this randomized coding setting was improved in [41], where redundancy𝑂(π‘˜log(𝑛/π‘˜)) and complexity 𝑝 π‘œπ‘™ 𝑦(𝑛, π‘˜) were achieved. However, finding a deterministic π‘˜- deletion correcting code construction that achieves the optimal order redundancy 𝑂(π‘˜log𝑛) remained elusive. Note that the optimality of the code is redundancy- wise rather than cardinality-wise, namely, the focus is on the asymptotic rather than exact size of the code.

In this chapter, we provide a solution to this longstanding open problem for constant π‘˜: We present a code construction that achieves𝑂(8π‘˜log𝑛+π‘œ(log𝑛))redundancy whenπ‘˜ =π‘œ(√︁

log log𝑛)and𝑂(𝑛2π‘˜+1) encoding/decoding computational complex- ity. Note that the complexity is polynomial in𝑛whenπ‘˜is a constant. The following theorem summarizes our main result.

Theorem 3.1.1. Let π‘˜ and 𝑛 be two integers satisfying π‘˜ = π‘œ(√︁

log log𝑛). For integer𝑁 =𝑛+8π‘˜log𝑛+π‘œ(log𝑛), there exists an encoding functionE :{0,1}𝑛 β†’ {0,1}𝑁, computed in 𝑂(𝑛2π‘˜+1) time, and a decoding function D : {0,1}π‘βˆ’π‘˜ β†’ {0,1}𝑛, computed in𝑂(π‘›π‘˜+1) = 𝑂(π‘π‘˜+1) time, such that for any c ∈ {0,1}𝑛 and subsequenced∈ {0,1}π‘βˆ’π‘˜ ofE (c), we have thatD (d) =c.

An independent work [22] proposed a π‘˜-deletion correcting code with𝑂(π‘˜log𝑛) redundancy and better complexity of𝑝 π‘œπ‘™ 𝑦(𝑛, π‘˜). In contrast to redundancy 8π‘˜log𝑛

in our construction, the redundancy in [22] is not specified, and it is estimated to be at least 200π‘˜log𝑛. Moreover, the techniques in [22] and in our constructions are different. Our techniques can be applied to obtain a systematicπ‘˜-deletion code construction with 4π‘˜log𝑛+π‘œ(log𝑛) bits of redundancy, as well as deletion codes for other settings, which will be given in the next chapter.

Here are the key building blocks in our code construction: (i)generalizing the VT constructiontoπ‘˜deletions by considering constrained sequences, (ii) separating the encoded vector to blocks and using concatenated codes and (iii) a novel strategy to separate the vector to blocks by a single pattern. The following gives a more specific description of these ideas.

In the previous chapter, wegeneralized the VT construction. In particular, we proved that while the higher order parity checksÍ𝑁

𝑖=1𝑖𝑗𝑐𝑖 mod (𝑁𝑗 +1), 𝑗 =0,1, . . . , 𝑑 do not work in general, those parity checks work in the two-deletion case, when the sequences are constrained to have no consecutive 1’s. In this chapter we generalize this idea, specifically, the higher order parity checks can be used to correctπ‘˜ =𝑑/2 deletions in sequences that satisfy the following constraint: The index distance between any two 1’s is at leastπ‘˜, i.e., there is a 0 run of length at leastπ‘˜βˆ’1 between any two 1’s.

The fact that we can correctπ‘˜deletions using the generalization of the VT construc- tion on constrained sequences, enables aconcatenated code construction, which is the underlying structure of ourπ‘˜-deletion correcting codes. In concatenated codes, each codewordcis split into blocks of small length, by certain markers at the bound- aries between adjacent blocks. Each block is protected by an inner deletion code that can be efficiently computed when the block length is small. The block boundary markers are chosen such thatπ‘˜ deletions in a codeword result in at most𝑂(π‘˜)block errors. Then, it suffices to use an outer code, such as a Reed-Solomon code, to correct the block errors.

Concatenated codes were used in a number of π‘˜-deletion correcting code construc- tions, [12, 40, 81]. One of the main differences among these constructions is the choice of the block boundary markers that separate the codewords. The construc- tions in [81,40] insert extra bits as markers between blocks, which introduces𝑂(𝑁) bits of redundancy. In [12], occurrences of short subsequences in the codeword, called patterns, were used as markers. The construction in [12] requires𝑂(π‘˜logπ‘˜) different patterns where each pattern introduces𝑂(π‘˜log𝑁) bits of redundancy.

We improve the redundancy in [12] by using a single type of pattern for block boundary markers. Specifically, we choose a pattern such that its indicator vector, a binary vector in which the 1 entries indicate the occurrences of the pattern in c, is aconstrained sequence that is immune to deletions. Then, the generalized VT construction can be used to protect the indicator vector and thus the locations of the block boundary markers. Knowing the boundary between blocks, we can recover the blocks with at most 2π‘˜block errors, which then can be corrected using a combination of short deletion correcting codes and Reed-Solomon codes.

The concatenated code construction consists of short blocks, hence, the pattern has to occur frequently in the codeword such that all consecutive bits of certain length contains at least one occurrence of the pattern. Namely, the first step of the encoding is a mapping of an arbitrary binary sequence to a sequence with frequent synchronization patterns. The constructions in [12, 22] compute such mappings using randomized algorithms. In this chapter, we provide a deterministic algorithm to generate sequences with frequent synchronization patterns.

We now formally define the synchronization pattern and its indicator vector.

Definition 3.1.1. Asynchronization pattern, is a length3π‘˜+ ⌈logπ‘˜βŒ‰ +4sequencea= (π‘Ž1, . . . , π‘Ž3π‘˜+⌈logπ‘˜βŒ‰+4)satisfying

β€’ π‘Ž3π‘˜+𝑖 =1for𝑖 ∈ [0,⌈logπ‘˜βŒ‰ +4], where[0,⌈logπ‘˜βŒ‰ +4] ={0, . . . ,⌈logπ‘˜βŒ‰ +4}.

β€’ There does not exist a 𝑗 ∈ [1,3π‘˜βˆ’1], such thatπ‘Žπ‘—+𝑖 =1for𝑖 ∈ [0,⌈logπ‘˜βŒ‰ +4]. Namely, a synchronization pattern is a length3π‘˜+ ⌈logπ‘˜βŒ‰ +4sequence that ends with⌈logπ‘˜βŒ‰ +5consecutive1’s and no other 1-run with length ⌈logπ‘˜βŒ‰ +5exists.

For a sequencec= (𝑐1, . . . , 𝑐𝑛), the indicator vector of the synchronization pattern, referred to assynchronization vector1𝑠 𝑦𝑛𝑐(c) ∈ {0,1}𝑛, is defined by

1𝑠 𝑦𝑛𝑐(c)𝑖 =







ο£²







ο£³

1, if(π‘π‘–βˆ’3π‘˜+1, π‘π‘–βˆ’3π‘˜+2, . . . , 𝑐𝑖+⌈logπ‘˜βŒ‰+4) is a synchronization pattern, 0, else.

(3.1)

Note that 1𝑠 𝑦𝑛𝑐(c)𝑖 = 0 for 𝑖 ∈ [1,3π‘˜ βˆ’1] and for 𝑖 ∈ [π‘›βˆ’ ⌈logπ‘˜βŒ‰ βˆ’3, 𝑛]. The entry1𝑠 𝑦𝑛𝑐(c) =1 if and only if𝑐𝑖is the first bit of the final 1 run in a synchronization pattern. It can be seen from the definition that any two 1 entries in 1𝑠 𝑦𝑛𝑐(c) have

index distance at least 3π‘˜, i.e., any two 1 entries are separated by a 0-run of length at least 3π‘˜ βˆ’1. Hence,1𝑠 𝑦𝑛𝑐(c)is a constrained sequence described above.

Example 3.1.1. Let integers π‘˜ = 2and𝑛 =35. Then the sequence (1,0,0,1,0,1, 1,1,1,1,1)is a synchronization pattern. Let the length𝑛sequence

c=(1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1, 1,0,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1). Then the synchronization vector1𝑠 𝑦𝑛𝑐(c)is given by

1𝑠 𝑦𝑛𝑐(c) = (0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0).

Next, we present the definition of the generalized VT code. Define the integer vectors

m(𝑒) β‰œ (1𝑒,1𝑒+2𝑒, . . . ,

𝑛

βˆ‘οΈ

𝑗=1

𝑗𝑒) (3.2)

for𝑒 ∈ [0,6π‘˜], where the𝑖-th entrym𝑖(𝑒) ofm(𝑒) is the sum of the𝑒-th powers of the first𝑖 positive integers. Given a sequencec ∈ {0,1}𝑛, we compute the generalized VT redundancy 𝑓(c) of dimension 6π‘˜+1 as follows:

𝑓(c)𝑒 β‰œ cΒ·m(𝑒) mod 3π‘˜ 𝑛𝑒+1

=

𝑛

βˆ‘οΈ

𝑖=1

𝑐𝑖m𝑖(𝑒) mod 3π‘˜ 𝑛𝑒+1, (3.3) for 𝑒 ∈ [0,6π‘˜], where 𝑓(c)𝑒 is the 𝑒-th component of 𝑓(c) and Β· denotes inner product over integers. Our generalization of the VT code construction shows that the vector 𝑓(1𝑠 𝑦𝑛𝑐(c)) helps protect the synchronization vector 1𝑠 𝑦𝑛𝑐(c) from π‘˜ deletions inc.

The rest of the chapter is organized as follows. Sec.3.2provides an outline of our construction and some of the basic lemmas. Based on results of the basic lemmas, Sec. 3.3 presents the encoding and decoding procedures of our code. Sec. 3.4 presents our VT generalization for recovering the synchronization vector. Sec.3.5 explains how to correct π‘˜ deletions based on the synchronization vector, when the synchronization patterns appear frequently. Sec. 3.6describes an algorithm to transform a sequence into one with frequently occurred synchronization patterns.

Sec.3.7concludes this chapter.

3.2 Outline and Preliminaries

In this section, we outline the ingredients that constitute our code construction and present notations that will be used throughout this chapter. A summary of these no- tations is provided in Table3.1. We begin with an overview of the code construction, which has a concatenated code structure as described in Sec.3.1. In our construc- tion, each codeword c is split into blocks, with the boundaries between adjacent blocks given by synchronization patterns. Specifically, let𝑑1, . . . , 𝑑𝐽 be the indices of the synchronization patterns inc, i.e., the indices of the 1 entries in1𝑠 𝑦𝑛𝑐(c). Then the blocks are given by (𝑐𝑑

π‘—βˆ’1+1, . . . , 𝑐𝑑

π‘—βˆ’1) for 𝑗 ∈ [0, 𝐽+1], where𝑑𝑗 =0 if 𝑗 =0 and𝑑𝑗 =𝑛+1 if 𝑗 =𝐽+1. The key idea of our construction is to use the VT general- ization 𝑓 (see Eq. (3.3) for definition) as parity checks to protect the synchronization vector1c, and thus identify the indices of block boundaries. The proof details will be given in Lemma 3.2.1. Note that the parity 𝑓(c) in (3.3) has size𝑂(π‘˜2log𝑛). To compress the size of 𝑓(c) to the targeted𝑂(π‘˜log𝑛), in Lemma3.2.2we apply a modulo operation on the function 𝑓. Given the block boundaries, Lemma3.2.3 provides an algorithm to protect the codewords. Specifically, we show that given the synchronization vector1𝑠 𝑦𝑛𝑐(c), it is possible to recover most of the blocks inc, with up to 2π‘˜ block errors. These block errors can be corrected with the help of their deletion correcting hashes, which can be exhaustively computed and will be presented in Lemma8.2.3. Hence, to correct the block errors, it suffices to protect the sequence of deletion correcting hashes using, for example, Reed-Solomon codes.

Lemma 3.2.2 and Lemma 3.2.3 together define a hash function that protects c from π‘˜ deletions. However, in order to exhaustively compute the block hash given in Lemma8.2.3and obtain the desired size of the redundancy, the block length has to be of order 𝑂(𝑝 π‘œπ‘™ 𝑦(π‘˜)log𝑛). Hence the synchronization pattern must appear frequently enough inc. Such sequencecwill be defined in the following as aπ‘˜-dense sequence. To encode for any given sequencec∈ {0,1}𝑛, in Lemma3.2.4we present an invertible mapping that takes any sequencec∈ {0,1}𝑛as input and outputs aπ‘˜- dense sequence. Then, any binary sequence can be encoded into aπ‘˜-dense sequence and protected.

Finally, the hash function defined by Lemma 3.2.2 and Lemma 3.2.3 is subject to deletion errors and has to be protected. To this end, we use an additional hash that encodes the deletion correcting hash of the hash function defined in Lemma 3.2.2 and Lemma 3.2.3. The additional hash is protected by a (π‘˜ +1)- fold repetition code. Similar additional hash technique was also given in [12].

Table 3.1: Summary of Notations in Ch. 3 Bπ‘˜(c) β‰œ The set of sequences that share a

lengthπ‘›βˆ’π‘˜

subsequence withc ∈ {0,1}𝑛. Rπ‘š β‰œ The set of sequences where any

two 1 entries are

separated by at leastπ‘šβˆ’1 zeros.

1𝑠 𝑦𝑛𝑐(c) β‰œ Synchronization vector defined in (3.1).

m(𝑒) β‰œ Weights of order𝑒in the VT gener- alization, defined

in (3.2).

𝑓(c) β‰œ Generalized VT redundancy defined in (3.3).

𝐿 β‰œ Maximal length of zero runs in the synchronization

vector of a π‘˜-dense sequence, de- fined in (3.4).

𝑝(c) β‰œ The function used to compute the redundancy

protecting the synchronization vec- tor1𝑠 𝑦𝑛𝑐(c),

defined in Lemma3.2.2.

𝐻 π‘Ž 𝑠 β„Žπ‘˜(c) β‰œ The deletion correcting hash func- tion forπ‘˜-dense

sequences, defined in Lemma3.2.3.

𝑇(c) β‰œ The function that generatesπ‘˜-dense sequences,

defined in Lemma3.2.4.

𝐻(c) β‰œ Deletion correcting hash function for any sequence,

defined in Lemma8.2.3.

sequencec

Mapcto𝑇(c)(see Lemma3.2.4for definition of𝑇)

π‘˜-dense sequence𝑇(c)

pattern synchronization

00. . .011111 block 1

Append Redundancy 1=(𝑓(1𝑠 𝑦 𝑛𝑐(𝑇(c)))mod𝑝(c), 𝑝(c)) (See Lemma3.2.2for definitions of𝑓 and𝑝) pattern

synchronization 01. . .011111 . . . block𝐽

protecting1𝑠 𝑦 𝑛𝑐(c) 00. . .011111 01. . .011111 Redundancy 1 Append Redundancy 2=𝐻 π‘Ž 𝑠 β„Žπ‘˜(c) (See Lemma3.2.3for definition of𝐻 π‘Ž 𝑠 β„Žπ‘˜) protecting

blocks 1, . . . , 𝐽 00. . .011111 01. . .011111 Redundancy 1 Redundancy 2 Append Redundancy 3=(π‘˜+1)-fold repetition

of𝐻(Redundancy1,Redundancy2)(See Lemma8.2.3for definition of𝐻)

protecting

Redundancy 1 and Redundancy 2

01. . .011111

00. . .011111 Redundancy 1 Redundancy 2 Redundancy 3

Figure 3.1: Illustrating the encoding procedure of our π‘˜-deletion correcting code construction.

The final encoding/decoding algorithm of our code, which combines the results in Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4, is provided in Sec. 3.3. The encoding is illustrated in Fig.3.1.

Before presenting the lemmas, we give necessary definitions and notations. For a sequencec ∈ {0,1}𝑛, define its deletion ball Bπ‘˜(c) as the collection of sequences that share a lengthπ‘›βˆ’π‘˜ subsequence withc.

Definition 3.2.1. A sequencec ∈ {0,1}𝑛is said to beπ‘˜-denseif the lengths of the0 runs in1𝑠 𝑦𝑛𝑐(c)is at most

𝐿 β‰œ( ⌈logπ‘˜βŒ‰ +5)2⌈logπ‘˜βŒ‰+9⌈logπ‘›βŒ‰

+ (3π‘˜ + ⌈logπ‘˜βŒ‰ +4) ( ⌈logπ‘›βŒ‰ +9+ ⌈logπ‘˜βŒ‰). (3.4) Forπ‘˜-densec, the index distance between any two1entries in1𝑠 𝑦𝑛𝑐(c)is at most𝐿+1, i.e., the0-runs between two1entries have length at most 𝐿+1.

Note that 𝑓(c)is an integer vector that can be presented by log(3π‘˜)6π‘˜+1𝑛(3π‘˜+1) (6π‘˜+1) bits or by an integer in the range [0,(3π‘˜)6π‘˜+1𝑛(3π‘˜+1) (6π‘˜+1) βˆ’ 1]. In this chapter,

we interchangeably use 𝑓(c) to denote its binary presentation or integer presenta- tion. The following lemma shows that the synchronization vector1𝑠 𝑦𝑛𝑐(c) can be recovered fromπ‘˜ deletions with the help of 𝑓(1𝑠 𝑦𝑛𝑐(c)). Its proof will be given in Sec.3.4.

Lemma 3.2.1. For integers𝑛andπ‘˜and sequencesc,cβ€², ifcβ€² ∈ Bπ‘˜(c)and 𝑓(1𝑠 𝑦𝑛𝑐(c)) = 𝑓(1𝑠 𝑦𝑛𝑐(cβ€²)), then1𝑠 𝑦𝑛𝑐(c) =1𝑠 𝑦𝑛𝑐(cβ€²).

By virtue of Lemma 3.2.1, the synchronization vector 1𝑠 𝑦𝑛𝑐(c) can be recovered from π‘˜ deletions inc, with the help of a hash 𝑓(1𝑠 𝑦𝑛𝑐(c)) of size𝑂(π‘˜2log𝑛) bits.

To further reduce the size of the hash to𝑂(π‘˜log𝑛)bits, we apply modulo operations on 𝑓(1𝑠 𝑦𝑛𝑐(c))in the following lemma, the proof of which will be proved in Sec.3.4.

Lemma 3.2.2. For integers 𝑛 and π‘˜ = π‘œ(√︁

log log𝑛), there exists a function 𝑝 : {0,1}𝑛 β†’ [1,22π‘˜log𝑛+π‘œ(log𝑛)], such that if 𝑓(1𝑠 𝑦𝑛𝑐(c)) ≑ 𝑓(1𝑠 𝑦𝑛𝑐(cβ€²)) mod 𝑝(c) for two sequencesc∈ {0,1}𝑛andcβ€² ∈ Bπ‘˜(c), then1𝑠 𝑦𝑛𝑐(c) =1𝑠 𝑦𝑛𝑐(cβ€²). Hence if

(𝑓(1𝑠 𝑦𝑛𝑐(c)) mod 𝑝(c), 𝑝(c))

=(𝑓(1𝑠 𝑦𝑛𝑐(cβ€²))mod 𝑝(cβ€²), 𝑝(cβ€²)) andcβ€²βˆˆ Bπ‘˜(c), we have that1𝑠 𝑦𝑛𝑐(c)=1𝑠 𝑦𝑛𝑐(cβ€²).

Lemma3.2.2presents a hash of size 4π‘˜log𝑛+π‘œ(log𝑛)bits for correcting1𝑠 𝑦𝑛𝑐(c). With the knowledge of the synchronization vector1𝑠 𝑦𝑛𝑐(c), the next lemma shows that the sequenceccan be further recovered using another 4π‘˜log+π‘œ(log𝑛)bit hash, whencis π‘˜-dense, i.e., when the synchronization patter occurs frequently enough inc. The proof of Lemma3.2.3will be given in Sec.3.5.

Lemma 3.2.3. For integers𝑛andπ‘˜ =π‘œ(√︁

log log𝑛), there exists a function𝐻 π‘Ž 𝑠 β„Žπ‘˜ : {0,1}𝑛 β†’ {0,1}4π‘˜log𝑛+π‘œ(log𝑛), such that everyπ‘˜-dense sequencec∈ {0,1}𝑛can be recovered, given its synchronization vector1𝑠 𝑦𝑛𝑐(c), its lengthπ‘›βˆ’π‘˜subsequenced, and𝐻 π‘Ž 𝑠 β„Žπ‘˜(c).

Combining Lemma3.2.2and Lemma3.2.3, we obtain size𝑂(π‘˜log𝑛)hash function to correct deletions for a π‘˜-dense sequence. To encode for arbitrary sequencec ∈ {0,1}𝑛, a mapping that transforms any sequence to aπ‘˜-dense sequence is given in the following lemma. The details will be given in Sec.3.6.

Lemma 3.2.4. For integers π‘˜ and 𝑛 > π‘˜, there exists a map 𝑇 : {0,1}𝑛 β†’ {0,1}𝑛+3π‘˜+3⌈logπ‘˜βŒ‰+15, computable in 𝑝 π‘œπ‘™ 𝑦(𝑛, π‘˜) time, such that 𝑇(c) is a π‘˜-dense sequence forc∈ {0,1}𝑛. Moreover, the sequenceccan be recovered from𝑇(c). The next three lemmas (Lemma 8.2.3, Lemma 3.2.6, and Lemma 3.2.7) present existence results that are necessary to prove Lemma 3.2.2 and Lemma 3.2.3.

Lemma8.2.3gives aπ‘˜-deletion correcting hash function for short sequences, which is the block hash described above in proving Lemma3.2.3. It is an extension of the result in [12]. Lemma3.2.6is a slight variation of the result in [64]. It shows the equivalence between correcting deletions and correcting deletions and insertions.

Lemma3.2.6 will be used in the proof of Lemma3.2.1, where we need an upper bound on the number of deletions/insertions in1𝑠 𝑦𝑛𝑐(c) caused by a deletion in c.

Lemma3.2.7(see [71]) gives an upper bound on the number of divisors of a positive integer𝑛. With Lemma3.2.7, we show that the VT generalization in Lemma3.2.1 can be compressed by taking modulo operations. The details will be given in the proof of Lemma3.2.2.

Lemma 3.2.5. For any integers 𝑀, 𝑛, and π‘˜, there exists a hash function 𝐻 : {0,1}𝑀 β†’

{0,1}⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰ (2π‘˜log log𝑛+𝑂(1)), computable inπ‘‚π‘˜( (𝑀/log𝑛)𝑛log2π‘˜π‘›) time, such that any sequencec∈ {0,1}𝑀can be recovered from its lengthπ‘€βˆ’π‘˜ subsequenced and the hash𝐻(c).

Proof. We first show by counting arguments the existence of a hash function𝐻′ : {0,1}⌈logπ‘›βŒ‰ β†’ {0,1}2π‘˜log log𝑛+𝑂(1), exhaustively computable inπ‘‚π‘˜(𝑛log2π‘˜π‘›)time, such that𝐻′(s) β‰  𝐻′(sβ€²)for alls∈ {0,1}⌈logπ‘›βŒ‰andsβ€²βˆˆ Bπ‘˜(s)\{s}. The hash𝐻′(cβ€²) protects the sequence s ∈ {0,1}⌈logπ‘›βŒ‰ from π‘˜ deletions. Note that |Bπ‘˜(cβ€²) | ≀

⌈logπ‘›βŒ‰ π‘˜

2

2π‘˜ ≀ 2⌈logπ‘›βŒ‰2π‘˜. Hence it suffices to use brute force and greedily assign a hash value for each sequences ∈ {0,1}⌈logπ‘›βŒ‰ such that 𝐻′(𝑠) β‰  𝐻′(sβ€²) for all sβ€² ∈ Bπ‘˜(s)\{s}. Since the size of Bπ‘˜(s) is upper bounded by 2⌈logπ‘›βŒ‰2π‘˜, there always exists such a hash 𝐻′(s) ∈ {0,1}log(2⌈logπ‘›βŒ‰2π‘˜+1), that has no conflict with the hash values of sequences inBπ‘˜(s). The total complexity isπ‘‚π‘˜(𝑛log2π‘˜π‘›)and the size of the hash value𝐻′(s) is 2π‘˜log log𝑛+𝑂(1).

Now splitcinto ⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰ blocks𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰,

𝑖 ∈ [1,⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰] of length ⌈logπ‘›βŒ‰. If the length of the last block is less than ⌈logπ‘›βŒ‰, add zeros to the end of the last block such that its length is ⌈logπ‘›βŒ‰. Assign a hash value h𝑖 = 𝐻′( (𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰)), 𝑖 ∈ [1,⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰]

for each block. Let 𝐻(c) = (h1, . . . ,h⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰) be the concatenation of h𝑖

for𝑖 ∈ [1,⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰].

We show that 𝐻(c) protects c from π‘˜ deletions. Let d be a length π‘›βˆ’ π‘˜ subse- quence of c. Note that𝑑(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1 and π‘‘π‘–βŒˆlogπ‘›βŒ‰βˆ’π‘˜ come from bits 𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1+π‘₯ and π‘π‘–βŒˆlogπ‘›βŒ‰βˆ’π‘˜+𝑦 respectively after deletions in c, where the integers π‘₯ , 𝑦 ∈ [0, π‘˜]. Therefore, (𝑑(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘‘π‘–βŒˆlogπ‘›βŒ‰βˆ’π‘˜) is a length ⌈logπ‘›βŒ‰ βˆ’ π‘˜ subsequence of (𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1+π‘₯, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰βˆ’π‘˜+𝑦), and thus a subsequence of the block(𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰). Hence the 𝑖-th block (𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰) can be recovered fromh𝑖 =𝐻′(𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰)and (𝑑(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘‘π‘–βŒˆlogπ‘›βŒ‰βˆ’π‘˜). There- fore,c can be recovered given dand 𝐻(c). The length of 𝐻(c) is ⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰Β·

(2π‘˜log log𝑛+𝑂(1))and the complexity of 𝐻(c)isπ‘‚π‘˜( (𝑀/log𝑛)𝑛log2π‘˜π‘›). β–‘ Lemma 3.2.6. Letπ‘Ÿ,𝑠, and π‘˜be integers satisfyingπ‘Ÿ+𝑠 ≀ π‘˜. For sequencesc,cβ€²βˆˆ {0,1}𝑛, if cβ€² and c share a common resulting sequence after π‘Ÿ deletions and 𝑠 insertions in both, thencβ€²βˆˆ Bπ‘˜(c).

Lemma 3.2.7. For a positive integer𝑛 β‰₯ 3, the number of divisors of 𝑛is upper bounded by21.6 ln𝑛/(ln ln𝑛).

The proofs of Lemma 3.2.1, Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4 rely on several propositions, the details of which will be presented in the next sections.

For convenience, a dependency graph for the theorem, lemmas, and propositions is given in Fig.3.2.

3.3 Proof of Theorem3.1.1

Based on the lemmas stated in Sec. 3.2, in this section we present the encoding functionEand the decoding functionDof ourπ‘˜-deletion correcting code, and prove Theorem 3.1.1. Given any sequence c ∈ {0,1}𝑛, let the function E : {0,1}𝑛 β†’ {0,1}𝑛+8π‘˜log𝑛+π‘œ(log𝑛) be given by

E (c) =(𝑇(c), 𝑅′(c), 𝑅′′(c)), where

𝑅′(c) =(𝑓(1𝑠 𝑦𝑛𝑐(𝑇(c))) mod 𝑝(𝑇(c)), 𝑝(𝑇(c)), 𝐻 π‘Ž 𝑠 β„Žπ‘˜(𝑇(c))), and 𝑅′′(c) =𝑅 𝑒 π‘π‘˜+1(𝐻(𝑅′(c))).

Function 𝑅 𝑒 π‘π‘˜+1(𝐻(𝑅′(c))) is the (π‘˜+1)-fold repetition of the bits in 𝐻(𝑅′(c)), where function 𝐻(𝑅′(c)) is defined Lemma8.2.3 and protects 𝑅′(c) fromπ‘˜ dele-

Lemma 3.2.6

Prop.3.4.1 Prop.3.4.2

Lemma 3.2.1 Lemma3.2.7 Lemma 8.2.3 Prop.3.6.1 Prop.3.6.3 Prop.3.6.4

Lemma 3.2.2 Lemma 3.2.3 Lemma 3.2.4

Thm.3.1.1

Figure 3.2: Dependencies of the claims in Ch. 3.

tions. Function 𝑇(c) is defined in Lemma 3.2.4 and transforms c into a π‘˜- dense sequence (see Definition 3.2.1for definition of a π‘˜-dense sequence). Func- tion 𝑓(1𝑠 𝑦𝑛𝑐(𝑇(c))) in 𝑅′(c) is represented by an integer and protects the syn- chronization vector 1𝑠 𝑦𝑛𝑐(c) by Lemma 3.2.1. Function 𝑝(𝑇(c)) is defined in Lemma3.2.2, which compresses the hash 𝑓(1𝑠 𝑦𝑛𝑐(𝑇(c))). Function𝐻 π‘Ž 𝑠 β„Žπ‘˜(𝑇(c)) is defined in Lemma3.2.3and protects aπ‘˜-dense sequence fromπ‘˜ deletions.

Note that π‘˜ = π‘œ(√︁

log log𝑛). Hence, according to Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4, the length of 𝑅′(c) is 𝑁1 = 8π‘˜log(𝑛+3π‘˜ +3⌈logπ‘˜βŒ‰ +15) + π‘œ(log(𝑛+3π‘˜ +3⌈logπ‘˜βŒ‰ +15)) =8π‘˜log𝑛+π‘œ(log𝑛). The length of𝑅′′(c) is𝑁2= 2π‘˜(π‘˜+1) (𝑁1/⌈logπ‘›βŒ‰)log log𝑛=π‘œ(log𝑛). The length of𝑇(c)is𝑛+𝑁0=𝑛+3π‘˜+ 3⌈logπ‘˜βŒ‰+15. Therefore, the length ofE (c)is𝑛+𝑁0+𝑁1+𝑁2 =𝑛+8π‘˜log𝑛+π‘œ(log𝑛). The redundancy of the code is 8π‘˜log𝑛+π‘œ(log𝑛).

To show how the sequenceccan be recovered from a length𝑁 βˆ’π‘˜ subsequenced ofE (c) and implement the computation of the decoding function D (d), we prove that

1. Statement 1: The redundancy𝑅′(c)can be recovered given(𝑑𝑛+𝑁

0+𝑁1+1, . . . , 𝑑𝑛+𝑁

0+𝑁1+𝑁2βˆ’π‘˜).

2. Statement 2: The sequence c can be recovered given (𝑑1, . . . , 𝑑𝑛+𝑁

0βˆ’π‘˜) and𝑅′(c).

We first prove Statement 1. Note that 𝑑𝑛++𝑁

0+𝑁1+1 and 𝑑𝑛+𝑁

0+𝑁1+𝑁2βˆ’π‘˜ come from bits E (c)𝑛+𝑁0+𝑁1+1+π‘₯ and E (c)𝑛+𝑁0+𝑁1+𝑁2βˆ’π‘˜+𝑦 respectively after deletions in E (c), whereπ‘₯ , 𝑦 ∈ [0, π‘˜]. Hence(𝑑𝑛+𝑁

0+𝑁1+1, . . . , 𝑑𝑛+𝑁

0+𝑁1+𝑁2βˆ’π‘˜)is a length𝑁2βˆ’π‘˜sub- sequence of (E (c)𝑛+𝑁0+𝑁1+1+π‘₯, . . . ,E (c)𝑛+𝑁0+𝑁1+𝑁2βˆ’π‘˜+𝑦), and thus a subsequence of (E (c)𝑛+𝑁0+𝑁1+1, . . . ,E (c)𝑛+𝑁0+𝑁1+𝑁2) = 𝑅′′(c). Since 𝑅′′(c) is π‘˜ +1-fold repe- tition code and thus a π‘˜-deletion correcting code that protects 𝐻(𝑅′(c)), the hash function𝐻(𝑅′(c))can be recovered from(𝑑𝑛+𝑁

0+𝑁1+1, . . . , 𝑑𝑛+𝑁

0+𝑁1+𝑁2βˆ’π‘˜).

Similarly,(𝑑𝑛+𝑁

0+1, . . . , 𝑑𝑛+𝑁

0+𝑁1βˆ’π‘˜)is a length𝑁1βˆ’π‘˜subsequence of(E (c)𝑛+𝑁0+1, . . . ,E (c)𝑛+𝑁0+𝑁1) =𝑅′(c). From Lemma8.2.3, the function𝑅′(c)can be recovered from𝐻(𝑅′(c))and(𝑑𝑛+𝑁

0+1, . . . , 𝑑𝑛+𝑁

0+𝑁1βˆ’π‘˜). Hence Statement 1 holds.

We now prove Statement 2. Note that𝑅′(c) contains hashes(𝑓(1𝑠 𝑦𝑛𝑐(𝑇(c)))mod 𝑝(𝑇(c)), 𝑝(𝑇(c))) and𝐻 π‘Ž 𝑠 β„Žπ‘˜(𝑇(c)). Moreover, (𝑑1, . . . , 𝑑𝑛+𝑁

0βˆ’π‘˜) is a length𝑛+ 𝑁0βˆ’ π‘˜ subsequence of(E (c)1, . . . ,E (c)𝑛+𝑁0) =𝑇(c). According to Lemma3.2.2, the synchronization vector1𝑠 𝑦𝑛𝑐(𝑇(c)) can be recovered from hash

(𝑓(1𝑠 𝑦𝑛𝑐(𝑇(c))) mod 𝑝(𝑇(c)), 𝑝(𝑇(c))) and (𝑑1, . . . , 𝑑𝑛+𝑁

0βˆ’π‘˜), by exhaustively searching over all length 𝑛 + 𝑁0 supersequence cβ€² of (𝑑1, . . . , 𝑑𝑛+𝑁

0βˆ’π‘˜) such that 𝑓(1𝑠 𝑦𝑛𝑐(cβ€²)) = 𝑓(1𝑠 𝑦𝑛𝑐(𝑇(c))). Then we have that 1𝑠 𝑦𝑛𝑐(𝑇(c)) = 1+ 𝑠 𝑦𝑛𝑐(cβ€²). Since by Lemma 3.2.4, 𝑇(c) is a π‘˜-dense sequence, by Lemma 3.2.3, it can be recovered from 1𝑠 𝑦𝑛𝑐(𝑇(c)), 𝐻 π‘Ž 𝑠 β„Žπ‘˜(𝑇(c)), and the length 𝑛 + 𝑁0 βˆ’ π‘˜ sub- sequence (𝑑1, . . . , 𝑑𝑛+𝑁

0βˆ’π‘˜) of 𝑇(c). Finally, the sequence c can be recovered from𝑇(c)by Lemma3.2.4. Hence Statement 2 holds andccan be recovered.

The encoding complexity ofE (c)is𝑂(𝑛2π‘˜+1), which comes from brute force search for integer 𝑝(𝑇(c)). The decoding complexity is 𝑂(π‘›π‘˜ +1), which comes from brute force search for the correct1𝑠 𝑦𝑛𝑐(𝑇(c)), given 𝑓(1𝑠 𝑦𝑛𝑐(𝑇(c)))mod 𝑝(𝑇(c)) and 𝑝(𝑇(c)).

3.4 Protecting the Synchronization Vectors

In this section we present a hash function with size 4π‘˜log𝑛+π‘œ(log𝑛) to protect the synchronization vector1𝑠 𝑦𝑛𝑐(c) from π‘˜ deletions in cand prove Lemma3.2.2.

We first prove Lemma3.2.1, which is decomposed to Proposition3.4.1and Propo- sition 3.4.2. In Proposition 3.4.1 we present an upper bound on the radius of the deletion ball for the synchronization vector. In Proposition3.4.2, we prove that the higher order parity check helps correct multiple deletions for sequences in which the there is a 0-run of length at least 3π‘˜βˆ’1 between any two 1’s. Since1𝑠 𝑦𝑛𝑐(c)is such a sequence, we conclude that the higher order parity check helps recover1𝑠 𝑦𝑛𝑐(c).