Chapter II: Binary 2-Deletion/Insertion Correcting Codes
2.6 Appendix
βπ=7, π =14, π€(d, π, π) =(0,1,0,0,0,1,1), π₯0=9, π₯1=36, π₯2=180
βπ=7, π =13, π€(d, π, π) =(0,1,0,0,0,1,1), π₯0=9, π₯1=36, π₯2=180
βπ=7, π =12, π€(d, π, π) =(0,1,0,0,1,0,1), π₯0=8, π₯1=30, π₯2=144
Recover the original sequence c
Let (π, π) be the output of Algorithm2, for which we have thatπ€(d, π, π) =110(c).
Letcβ²be a lengthπsupersequence after two insertions todsuch that110(cβ²) =110(c).
If ππ = 1, then inserting ππ to 110(d) corresponds to either inserting a 0 to d as theππ+1-th bit incβ²or inserting a 1 todas theππ-th bit incβ²(see Table2.1). Ifππ =0, then insertingππ to110(d) corresponds to inserting a 0 or 1 in the first 0 run or 1 run respectively after theπβ²-th bit incβ², where πβ² =maxπ{π€(d, π, π)π = 1, π < ππ} is the index of the last 10-pattern that occurs before the ππ-th bit in cβ². The same arguments hold for the insertion ofππ.
Therefore, given the (π, π) pair that Algorithm 2 returns, there are at most four possible cβ² supersequences of d such that 110(cβ²) = 110(c). One can check if thecβ²sequences satisfy β(c). According to Theorem2.1.2, there is a unique such sequence, the original sequencecthat satisfies both π(c)andβ(c)simultaneously.
π1β1
βοΈ
π‘=π2
(110(cβ²)π‘+1β110(cβ²)π‘) Β· (m(π))π‘
= (110(c)β2β110(cβ²)β2) Β· (m(π))β2+ (110(c)π1 β110(cβ²)π1) Β· (m(π))π1+
β2β1
βοΈ
π‘=β1
110(c)π‘Β· (m(π))π‘β
β2
βοΈ
β1+1
110(c)π‘ Β· (m(π))π‘β1
+
π1
βοΈ
π‘=π2+1
110(cβ²)π‘Β· (m(π))π‘β1β
π1β1
βοΈ
π‘=π2
110(cβ²)π‘Β· (m(π))π‘
= (110(c)β2β110(cβ²)β2) Β· (m(π))β2+ (110(c)π1 β110(cβ²)π1) Β· (m(π))π1+
110(c)β1 Β· (m(π))β1β110(c)β2 Β· (m(π))β2β1+
β2β1
βοΈ
π‘=β1+1
110(c)π‘ Β·π‘π+110(cβ²)π1Β· (m(π))π1β1
β110(cβ²)π2Β· (m(π))π2β
π1β1
βοΈ
π‘=π2+1
110(cβ²)π‘Β·π‘π
= (β110(cβ²)β2) Β· (m(π))β2+ (110(c)π1) Β· (m(π))π1+ 110(c)β1 Β· (m(π))β1+
β2
βοΈ
π‘=β1+1
110(c)π‘Β·π‘πβ
110(cβ²)π2 Β· (m(π))π2 β
π1
βοΈ
π‘=π2+1
110(cβ²)π‘ Β·π‘π
=110(c)β1 Β· (m(π))β1+110(c)π1Β· (m(π))π1+
β2
βοΈ
π‘=β1+1
110(c)π‘ Β·π‘πβ
π1
βοΈ
π‘=π2+1
110(cβ²)π‘Β·π‘πβ
110(cβ²)β2 Β· (m(π))β2 +110(cβ²)π2 Β· (m(π))π2
=π
m(π),β1(110(c)β1, . . . ,110(c)β2,110(cβ²)β2)β
πm(π), π2(110(cβ²)π2, . . . ,110(cβ²)π1,110(c)π1) Proof of(2.14)(Case (b))
(110(c) β110(cβ²)) Β·m(π)
=
β2
βοΈ
π‘=β1
(110(c)π‘β110(cβ²)π‘) Β· (m(π))π‘+
π2
βοΈ
π‘=π1
(110(c)π‘β110(cβ²)π‘) Β· (m(π))π‘
= (110(c)β2 β110(cβ²)β2) Β· (m(π))β2+ (110(c)π2 β110(cβ²)π2) Β· (m(π))π2+
β2β1
βοΈ
π‘=β1
(110(c)π‘β110(c)π‘+1) Β· (m(π))π‘+
π2β1
βοΈ
π‘=π1
(110(c)π‘β110(c)π‘+1) Β· (m(π))π‘
= (110(c)β2 β110(cβ²)β2) Β· (m(π))β2+ (110(c)π2 β110(cβ²)π2) Β· (m(π))π2+
β2β1
βοΈ
π‘=β1
110(c)π‘ Β· (m(π))π‘β
β2
βοΈ
β1+1
110(c)π‘Β· (m(π))π‘β1+
π2β1
βοΈ
π‘=π1
110(c)π‘ Β· (m(π))π‘β
π2
βοΈ
π‘=π1+1
110(c)π‘Β· (m(π))π‘β1
= (110(c)β2 β110(cβ²)β2) Β· (m(π))β2+ (110(c)π2 β110(cβ²)π2) Β· (m(π))π2+
110(c)β1Β· (m(π))β1β110(c)β2 Β· (m(π))β2β1+
β2β1
βοΈ
π‘=β1+1
110(c)π‘ Β·π‘π+110(c)π1 Β· (m(π))π1
β110(c)π2 Β· (m(π))π2β1+
π2β1
βοΈ
π‘=π1+1
110(c)π‘Β·π‘π
= (β110(cβ²)β2) Β· (m(π))β2 + (β110(cβ²)π2) Β· (m(π))π2+ 110(c)β1Β· (m(π))β1+
β2
βοΈ
π‘=β1+1
110(c)π‘Β·π‘π+
110(c)π1Β· (m(π))π1+
π2
βοΈ
π‘=π1+1
110(c)π‘Β·π‘π
=110(c)β1Β· (m(π))β1+110(c)π1 Β· (m(π))π1β
110(cβ²)β2 Β· (m(π))β2+110(cβ²)π2 Β· (m(π))π2
+
β2
βοΈ
π‘=β1+1
110(c)π‘ Β·π‘π+
π2
βοΈ
π‘=π1+1
110(c)π‘Β·π‘π
=π
m(π),β1(110(c)β1, . . . ,110(c)β2,110(cβ²)β2)+
πm(π), π1(110(c)π1, . . . ,110(c)π2,110(cβ²)π2) Proof of(2.15)(Case (c))
(110(c) β110(cβ²)) Β·m(π)
=
π1β2
βοΈ
π‘=β1
(110(c)π‘β110(cβ²)π‘) Β· (m(π))π‘+
β2β1
βοΈ
π‘=π1β1
(110(c)π‘β110(cβ²)π‘) Β· (m(π))π‘+
π2
βοΈ
π‘=β2
(110(c)π‘β110(cβ²)π‘) Β· (m(π))π‘
=
π1β2
βοΈ
π‘=β1
(110(c)π‘β110(c)π‘+1) Β· (m(π))π‘+
β2β1
βοΈ
π‘=π1β1
(110(c)π‘β110(c)π‘+2) Β· (m(π))π‘+ (110(c)β2 β110(cβ²)β2) Β· (m(π))β2+ (110(c)π2 β110(cβ²)π2) Β· (m(π))π2+
π2β1
βοΈ
π‘=β2+1
(110(c)π‘β110(c)π‘+1) Β· (m(π))π‘
=
π1β2
βοΈ
π‘=β1
110(c)π‘ Β· (m(π))π‘β
π1β1
βοΈ
π‘=β1+1
110(c)π‘Β· (m(π))π‘β1+
β2
βοΈ
π‘=π1β1
110(c)π‘Β· (m(π))π‘β
β2+1
βοΈ
π‘=π1+1
110(c)π‘ Β· (m(π))π‘β2+ (β110(cβ²)β2) Β· (m(π))β2 + (β110(cβ²)π2) Β· (m(π))π2+
π2
βοΈ
π‘=β2+1
110(c)π‘Β· (m(π))π‘β
π2
βοΈ
π‘=β2+2
110(c)π‘ Β· (m(π))π‘β1
=110(c)β1(m(π))β1 β110(c)π1β1(m(π))π1β2+
π1β2
βοΈ
π‘=β1+1
110(c)π‘Β·π‘π+110(c)π1β1(m(π))π1β1+ 110(c)π1(m(π))π1 β110(c)β2+1(m(π))β2β1+
β2
βοΈ
π‘=π1+1
110(c)π‘(π‘π+ (π‘β1)π) + (β110(cβ²)β2) Β· (m(π))β2+ (β110(cβ²)π2) Β· (m(π))π2+110(c)β2+1(m(π))β2+1+
π2
βοΈ
π‘=β2+2
110(c)π‘π‘π
=110(c)β1(m(π))β1 +110(c)π1(m(π))π1β (110(cβ²)β2 Β· (m(π))β2 +110(cβ²)π2 Β· (m(π))π2)+
π1β1
βοΈ
π‘=β1+1
110(c)π‘Β·π‘π+
β2+1
βοΈ
π‘=π1+1
110(c)π‘(π‘π+ (π‘β1)π)+
π2
βοΈ
π‘=β2+2
110(c)π‘π‘π
=110(c)β1(m(π))β1 +110(c)π1(m(π))π1β (110(cβ²)β2 Β· (m(π))β2 +110(cβ²)π2 Β· (m(π))π2)+
π1β1
βοΈ
π‘=β1+1
110(c)π‘Β·π‘π+
β2
βοΈ
π‘=π1
110(c)π‘+1π‘π+
π2
βοΈ
π‘=π1+1
110(c)π‘π‘π
=π
m(π),β1(110(c)β1, . . . ,110(c)π1β1,110(c)π1+1, . . . , 110(c)β2+1,110(cβ²)β2)+
πm(π), π1(110(c)π1, . . . ,110(c)π2,110(cβ²)π2)
C h a p t e r 3
BINARY CODES CORRECTING π DELETIONS/INSERTIONS
Following the construction in Ch. 2, this chapter presents binary codes correcting any constant number of deletion/inssertion errors.
3.1 Introduction
The problem of constructing efficient π-deletion codes (defined in Ch. 2) has long been unsettled. Before [12], which proposed a π-deletion correcting code construction withπ(π2logπlogπ)redundancy, no deletion codes correcting more than a single error with rate approaching 1 was known.
After [12], the work in [44] considered correction with high probability and proposed a π-deletion correcting code construction with redundancy (π +1) (2π+1)logπ+ π(logπ) and encoding/decoding complexityπ(ππ+1/logπβ1π). The result for this randomized coding setting was improved in [41], where redundancyπ(πlog(π/π)) and complexity π ππ π¦(π, π) were achieved. However, finding a deterministic π- deletion correcting code construction that achieves the optimal order redundancy π(πlogπ) remained elusive. Note that the optimality of the code is redundancy- wise rather than cardinality-wise, namely, the focus is on the asymptotic rather than exact size of the code.
In this chapter, we provide a solution to this longstanding open problem for constant π: We present a code construction that achievesπ(8πlogπ+π(logπ))redundancy whenπ =π(βοΈ
log logπ)andπ(π2π+1) encoding/decoding computational complex- ity. Note that the complexity is polynomial inπwhenπis a constant. The following theorem summarizes our main result.
Theorem 3.1.1. Let π and π be two integers satisfying π = π(βοΈ
log logπ). For integerπ =π+8πlogπ+π(logπ), there exists an encoding functionE :{0,1}π β {0,1}π, computed in π(π2π+1) time, and a decoding function D : {0,1}πβπ β {0,1}π, computed inπ(ππ+1) = π(ππ+1) time, such that for any c β {0,1}π and subsequencedβ {0,1}πβπ ofE (c), we have thatD (d) =c.
An independent work [22] proposed a π-deletion correcting code withπ(πlogπ) redundancy and better complexity ofπ ππ π¦(π, π). In contrast to redundancy 8πlogπ
in our construction, the redundancy in [22] is not specified, and it is estimated to be at least 200πlogπ. Moreover, the techniques in [22] and in our constructions are different. Our techniques can be applied to obtain a systematicπ-deletion code construction with 4πlogπ+π(logπ) bits of redundancy, as well as deletion codes for other settings, which will be given in the next chapter.
Here are the key building blocks in our code construction: (i)generalizing the VT constructiontoπdeletions by considering constrained sequences, (ii) separating the encoded vector to blocks and using concatenated codes and (iii) a novel strategy to separate the vector to blocks by a single pattern. The following gives a more specific description of these ideas.
In the previous chapter, wegeneralized the VT construction. In particular, we proved that while the higher order parity checksΓπ
π=1ππππ mod (ππ +1), π =0,1, . . . , π‘ do not work in general, those parity checks work in the two-deletion case, when the sequences are constrained to have no consecutive 1βs. In this chapter we generalize this idea, specifically, the higher order parity checks can be used to correctπ =π‘/2 deletions in sequences that satisfy the following constraint: The index distance between any two 1βs is at leastπ, i.e., there is a 0 run of length at leastπβ1 between any two 1βs.
The fact that we can correctπdeletions using the generalization of the VT construc- tion on constrained sequences, enables aconcatenated code construction, which is the underlying structure of ourπ-deletion correcting codes. In concatenated codes, each codewordcis split into blocks of small length, by certain markers at the bound- aries between adjacent blocks. Each block is protected by an inner deletion code that can be efficiently computed when the block length is small. The block boundary markers are chosen such thatπ deletions in a codeword result in at mostπ(π)block errors. Then, it suffices to use an outer code, such as a Reed-Solomon code, to correct the block errors.
Concatenated codes were used in a number of π-deletion correcting code construc- tions, [12, 40, 81]. One of the main differences among these constructions is the choice of the block boundary markers that separate the codewords. The construc- tions in [81,40] insert extra bits as markers between blocks, which introducesπ(π) bits of redundancy. In [12], occurrences of short subsequences in the codeword, called patterns, were used as markers. The construction in [12] requiresπ(πlogπ) different patterns where each pattern introducesπ(πlogπ) bits of redundancy.
We improve the redundancy in [12] by using a single type of pattern for block boundary markers. Specifically, we choose a pattern such that its indicator vector, a binary vector in which the 1 entries indicate the occurrences of the pattern in c, is aconstrained sequence that is immune to deletions. Then, the generalized VT construction can be used to protect the indicator vector and thus the locations of the block boundary markers. Knowing the boundary between blocks, we can recover the blocks with at most 2πblock errors, which then can be corrected using a combination of short deletion correcting codes and Reed-Solomon codes.
The concatenated code construction consists of short blocks, hence, the pattern has to occur frequently in the codeword such that all consecutive bits of certain length contains at least one occurrence of the pattern. Namely, the first step of the encoding is a mapping of an arbitrary binary sequence to a sequence with frequent synchronization patterns. The constructions in [12, 22] compute such mappings using randomized algorithms. In this chapter, we provide a deterministic algorithm to generate sequences with frequent synchronization patterns.
We now formally define the synchronization pattern and its indicator vector.
Definition 3.1.1. Asynchronization pattern, is a length3π+ βlogπβ +4sequencea= (π1, . . . , π3π+βlogπβ+4)satisfying
β’ π3π+π =1forπ β [0,βlogπβ +4], where[0,βlogπβ +4] ={0, . . . ,βlogπβ +4}.
β’ There does not exist a π β [1,3πβ1], such thatππ+π =1forπ β [0,βlogπβ +4]. Namely, a synchronization pattern is a length3π+ βlogπβ +4sequence that ends withβlogπβ +5consecutive1βs and no other 1-run with length βlogπβ +5exists.
For a sequencec= (π1, . . . , ππ), the indicator vector of the synchronization pattern, referred to assynchronization vector1π π¦ππ(c) β {0,1}π, is defined by
1π π¦ππ(c)π =



ο£²



ο£³
1, if(ππβ3π+1, ππβ3π+2, . . . , ππ+βlogπβ+4) is a synchronization pattern, 0, else.
(3.1)
Note that 1π π¦ππ(c)π = 0 for π β [1,3π β1] and for π β [πβ βlogπβ β3, π]. The entry1π π¦ππ(c) =1 if and only ifππis the first bit of the final 1 run in a synchronization pattern. It can be seen from the definition that any two 1 entries in 1π π¦ππ(c) have
index distance at least 3π, i.e., any two 1 entries are separated by a 0-run of length at least 3π β1. Hence,1π π¦ππ(c)is a constrained sequence described above.
Example 3.1.1. Let integers π = 2andπ =35. Then the sequence (1,0,0,1,0,1, 1,1,1,1,1)is a synchronization pattern. Let the lengthπsequence
c=(1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1, 1,0,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1). Then the synchronization vector1π π¦ππ(c)is given by
1π π¦ππ(c) = (0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0).
Next, we present the definition of the generalized VT code. Define the integer vectors
m(π) β (1π,1π+2π, . . . ,
π
βοΈ
π=1
ππ) (3.2)
forπ β [0,6π], where theπ-th entrymπ(π) ofm(π) is the sum of theπ-th powers of the firstπ positive integers. Given a sequencec β {0,1}π, we compute the generalized VT redundancy π(c) of dimension 6π+1 as follows:
π(c)π β cΒ·m(π) mod 3π ππ+1
=
π
βοΈ
π=1
ππmπ(π) mod 3π ππ+1, (3.3) for π β [0,6π], where π(c)π is the π-th component of π(c) and Β· denotes inner product over integers. Our generalization of the VT code construction shows that the vector π(1π π¦ππ(c)) helps protect the synchronization vector 1π π¦ππ(c) from π deletions inc.
The rest of the chapter is organized as follows. Sec.3.2provides an outline of our construction and some of the basic lemmas. Based on results of the basic lemmas, Sec. 3.3 presents the encoding and decoding procedures of our code. Sec. 3.4 presents our VT generalization for recovering the synchronization vector. Sec.3.5 explains how to correct π deletions based on the synchronization vector, when the synchronization patterns appear frequently. Sec. 3.6describes an algorithm to transform a sequence into one with frequently occurred synchronization patterns.
Sec.3.7concludes this chapter.
3.2 Outline and Preliminaries
In this section, we outline the ingredients that constitute our code construction and present notations that will be used throughout this chapter. A summary of these no- tations is provided in Table3.1. We begin with an overview of the code construction, which has a concatenated code structure as described in Sec.3.1. In our construc- tion, each codeword c is split into blocks, with the boundaries between adjacent blocks given by synchronization patterns. Specifically, letπ‘1, . . . , π‘π½ be the indices of the synchronization patterns inc, i.e., the indices of the 1 entries in1π π¦ππ(c). Then the blocks are given by (ππ‘
πβ1+1, . . . , ππ‘
πβ1) for π β [0, π½+1], whereπ‘π =0 if π =0 andπ‘π =π+1 if π =π½+1. The key idea of our construction is to use the VT general- ization π (see Eq. (3.3) for definition) as parity checks to protect the synchronization vector1c, and thus identify the indices of block boundaries. The proof details will be given in Lemma 3.2.1. Note that the parity π(c) in (3.3) has sizeπ(π2logπ). To compress the size of π(c) to the targetedπ(πlogπ), in Lemma3.2.2we apply a modulo operation on the function π. Given the block boundaries, Lemma3.2.3 provides an algorithm to protect the codewords. Specifically, we show that given the synchronization vector1π π¦ππ(c), it is possible to recover most of the blocks inc, with up to 2π block errors. These block errors can be corrected with the help of their deletion correcting hashes, which can be exhaustively computed and will be presented in Lemma8.2.3. Hence, to correct the block errors, it suffices to protect the sequence of deletion correcting hashes using, for example, Reed-Solomon codes.
Lemma 3.2.2 and Lemma 3.2.3 together define a hash function that protects c from π deletions. However, in order to exhaustively compute the block hash given in Lemma8.2.3and obtain the desired size of the redundancy, the block length has to be of order π(π ππ π¦(π)logπ). Hence the synchronization pattern must appear frequently enough inc. Such sequencecwill be defined in the following as aπ-dense sequence. To encode for any given sequencecβ {0,1}π, in Lemma3.2.4we present an invertible mapping that takes any sequencecβ {0,1}πas input and outputs aπ- dense sequence. Then, any binary sequence can be encoded into aπ-dense sequence and protected.
Finally, the hash function defined by Lemma 3.2.2 and Lemma 3.2.3 is subject to deletion errors and has to be protected. To this end, we use an additional hash that encodes the deletion correcting hash of the hash function defined in Lemma 3.2.2 and Lemma 3.2.3. The additional hash is protected by a (π +1)- fold repetition code. Similar additional hash technique was also given in [12].
Table 3.1: Summary of Notations in Ch. 3 Bπ(c) β The set of sequences that share a
lengthπβπ
subsequence withc β {0,1}π. Rπ β The set of sequences where any
two 1 entries are
separated by at leastπβ1 zeros.
1π π¦ππ(c) β Synchronization vector defined in (3.1).
m(π) β Weights of orderπin the VT gener- alization, defined
in (3.2).
π(c) β Generalized VT redundancy defined in (3.3).
πΏ β Maximal length of zero runs in the synchronization
vector of a π-dense sequence, de- fined in (3.4).
π(c) β The function used to compute the redundancy
protecting the synchronization vec- tor1π π¦ππ(c),
defined in Lemma3.2.2.
π» π π βπ(c) β The deletion correcting hash func- tion forπ-dense
sequences, defined in Lemma3.2.3.
π(c) β The function that generatesπ-dense sequences,
defined in Lemma3.2.4.
π»(c) β Deletion correcting hash function for any sequence,
defined in Lemma8.2.3.
sequencec
Mapctoπ(c)(see Lemma3.2.4for definition ofπ)
π-dense sequenceπ(c)
pattern synchronization
00. . .011111 block 1
Append Redundancy 1=(π(1π π¦ ππ(π(c)))modπ(c), π(c)) (See Lemma3.2.2for definitions ofπ andπ) pattern
synchronization 01. . .011111 . . . blockπ½
protecting1π π¦ ππ(c) 00. . .011111 01. . .011111 Redundancy 1 Append Redundancy 2=π» π π βπ(c) (See Lemma3.2.3for definition ofπ» π π βπ) protecting
blocks 1, . . . , π½ 00. . .011111 01. . .011111 Redundancy 1 Redundancy 2 Append Redundancy 3=(π+1)-fold repetition
ofπ»(Redundancy1,Redundancy2)(See Lemma8.2.3for definition ofπ»)
protecting
Redundancy 1 and Redundancy 2
01. . .011111
00. . .011111 Redundancy 1 Redundancy 2 Redundancy 3
Figure 3.1: Illustrating the encoding procedure of our π-deletion correcting code construction.
The final encoding/decoding algorithm of our code, which combines the results in Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4, is provided in Sec. 3.3. The encoding is illustrated in Fig.3.1.
Before presenting the lemmas, we give necessary definitions and notations. For a sequencec β {0,1}π, define its deletion ball Bπ(c) as the collection of sequences that share a lengthπβπ subsequence withc.
Definition 3.2.1. A sequencec β {0,1}πis said to beπ-denseif the lengths of the0 runs in1π π¦ππ(c)is at most
πΏ β( βlogπβ +5)2βlogπβ+9βlogπβ
+ (3π + βlogπβ +4) ( βlogπβ +9+ βlogπβ). (3.4) Forπ-densec, the index distance between any two1entries in1π π¦ππ(c)is at mostπΏ+1, i.e., the0-runs between two1entries have length at most πΏ+1.
Note that π(c)is an integer vector that can be presented by log(3π)6π+1π(3π+1) (6π+1) bits or by an integer in the range [0,(3π)6π+1π(3π+1) (6π+1) β 1]. In this chapter,
we interchangeably use π(c) to denote its binary presentation or integer presenta- tion. The following lemma shows that the synchronization vector1π π¦ππ(c) can be recovered fromπ deletions with the help of π(1π π¦ππ(c)). Its proof will be given in Sec.3.4.
Lemma 3.2.1. For integersπandπand sequencesc,cβ², ifcβ² β Bπ(c)and π(1π π¦ππ(c)) = π(1π π¦ππ(cβ²)), then1π π¦ππ(c) =1π π¦ππ(cβ²).
By virtue of Lemma 3.2.1, the synchronization vector 1π π¦ππ(c) can be recovered from π deletions inc, with the help of a hash π(1π π¦ππ(c)) of sizeπ(π2logπ) bits.
To further reduce the size of the hash toπ(πlogπ)bits, we apply modulo operations on π(1π π¦ππ(c))in the following lemma, the proof of which will be proved in Sec.3.4.
Lemma 3.2.2. For integers π and π = π(βοΈ
log logπ), there exists a function π : {0,1}π β [1,22πlogπ+π(logπ)], such that if π(1π π¦ππ(c)) β‘ π(1π π¦ππ(cβ²)) mod π(c) for two sequencescβ {0,1}πandcβ² β Bπ(c), then1π π¦ππ(c) =1π π¦ππ(cβ²). Hence if
(π(1π π¦ππ(c)) mod π(c), π(c))
=(π(1π π¦ππ(cβ²))mod π(cβ²), π(cβ²)) andcβ²β Bπ(c), we have that1π π¦ππ(c)=1π π¦ππ(cβ²).
Lemma3.2.2presents a hash of size 4πlogπ+π(logπ)bits for correcting1π π¦ππ(c). With the knowledge of the synchronization vector1π π¦ππ(c), the next lemma shows that the sequenceccan be further recovered using another 4πlog+π(logπ)bit hash, whencis π-dense, i.e., when the synchronization patter occurs frequently enough inc. The proof of Lemma3.2.3will be given in Sec.3.5.
Lemma 3.2.3. For integersπandπ =π(βοΈ
log logπ), there exists a functionπ» π π βπ : {0,1}π β {0,1}4πlogπ+π(logπ), such that everyπ-dense sequencecβ {0,1}πcan be recovered, given its synchronization vector1π π¦ππ(c), its lengthπβπsubsequenced, andπ» π π βπ(c).
Combining Lemma3.2.2and Lemma3.2.3, we obtain sizeπ(πlogπ)hash function to correct deletions for a π-dense sequence. To encode for arbitrary sequencec β {0,1}π, a mapping that transforms any sequence to aπ-dense sequence is given in the following lemma. The details will be given in Sec.3.6.
Lemma 3.2.4. For integers π and π > π, there exists a map π : {0,1}π β {0,1}π+3π+3βlogπβ+15, computable in π ππ π¦(π, π) time, such that π(c) is a π-dense sequence forcβ {0,1}π. Moreover, the sequenceccan be recovered fromπ(c). The next three lemmas (Lemma 8.2.3, Lemma 3.2.6, and Lemma 3.2.7) present existence results that are necessary to prove Lemma 3.2.2 and Lemma 3.2.3.
Lemma8.2.3gives aπ-deletion correcting hash function for short sequences, which is the block hash described above in proving Lemma3.2.3. It is an extension of the result in [12]. Lemma3.2.6is a slight variation of the result in [64]. It shows the equivalence between correcting deletions and correcting deletions and insertions.
Lemma3.2.6 will be used in the proof of Lemma3.2.1, where we need an upper bound on the number of deletions/insertions in1π π¦ππ(c) caused by a deletion in c.
Lemma3.2.7(see [71]) gives an upper bound on the number of divisors of a positive integerπ. With Lemma3.2.7, we show that the VT generalization in Lemma3.2.1 can be compressed by taking modulo operations. The details will be given in the proof of Lemma3.2.2.
Lemma 3.2.5. For any integers π€, π, and π, there exists a hash function π» : {0,1}π€ β
{0,1}β(π€/βlogπβ)β (2πlog logπ+π(1)), computable inππ( (π€/logπ)πlog2ππ) time, such that any sequencecβ {0,1}π€can be recovered from its lengthπ€βπ subsequenced and the hashπ»(c).
Proof. We first show by counting arguments the existence of a hash functionπ»β² : {0,1}βlogπβ β {0,1}2πlog logπ+π(1), exhaustively computable inππ(πlog2ππ)time, such thatπ»β²(s) β π»β²(sβ²)for allsβ {0,1}βlogπβandsβ²β Bπ(s)\{s}. The hashπ»β²(cβ²) protects the sequence s β {0,1}βlogπβ from π deletions. Note that |Bπ(cβ²) | β€
βlogπβ π
2
2π β€ 2βlogπβ2π. Hence it suffices to use brute force and greedily assign a hash value for each sequences β {0,1}βlogπβ such that π»β²(π ) β π»β²(sβ²) for all sβ² β Bπ(s)\{s}. Since the size of Bπ(s) is upper bounded by 2βlogπβ2π, there always exists such a hash π»β²(s) β {0,1}log(2βlogπβ2π+1), that has no conflict with the hash values of sequences inBπ(s). The total complexity isππ(πlog2ππ)and the size of the hash valueπ»β²(s) is 2πlog logπ+π(1).
Now splitcinto β(π€/βlogπβ)β blocksπ(πβ1) βlogπβ+1, . . . , ππβlogπβ,
π β [1,β(π€/βlogπβ)β] of length βlogπβ. If the length of the last block is less than βlogπβ, add zeros to the end of the last block such that its length is βlogπβ. Assign a hash value hπ = π»β²( (π(πβ1) βlogπβ+1, . . . , ππβlogπβ)), π β [1,β(π€/βlogπβ)β]
for each block. Let π»(c) = (h1, . . . ,hβ(π€/βlogπβ)β) be the concatenation of hπ
forπ β [1,β(π€/βlogπβ)β].
We show that π»(c) protects c from π deletions. Let d be a length πβ π subse- quence of c. Note thatπ(πβ1) βlogπβ+1 and ππβlogπββπ come from bits π(πβ1) βlogπβ+1+π₯ and ππβlogπββπ+π¦ respectively after deletions in c, where the integers π₯ , π¦ β [0, π]. Therefore, (π(πβ1) βlogπβ+1, . . . , ππβlogπββπ) is a length βlogπβ β π subsequence of (π(πβ1) βlogπβ+1+π₯, . . . , ππβlogπββπ+π¦), and thus a subsequence of the block(π(πβ1) βlogπβ+1, . . . , ππβlogπβ). Hence the π-th block (π(πβ1) βlogπβ+1, . . . , ππβlogπβ) can be recovered fromhπ =π»β²(π(πβ1) βlogπβ+1, . . . , ππβlogπβ)and (π(πβ1) βlogπβ+1, . . . , ππβlogπββπ). There- fore,c can be recovered given dand π»(c). The length of π»(c) is β(π€/βlogπβ)βΒ·
(2πlog logπ+π(1))and the complexity of π»(c)isππ( (π€/logπ)πlog2ππ). β‘ Lemma 3.2.6. Letπ,π , and πbe integers satisfyingπ+π β€ π. For sequencesc,cβ²β {0,1}π, if cβ² and c share a common resulting sequence after π deletions and π insertions in both, thencβ²β Bπ(c).
Lemma 3.2.7. For a positive integerπ β₯ 3, the number of divisors of πis upper bounded by21.6 lnπ/(ln lnπ).
The proofs of Lemma 3.2.1, Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4 rely on several propositions, the details of which will be presented in the next sections.
For convenience, a dependency graph for the theorem, lemmas, and propositions is given in Fig.3.2.
3.3 Proof of Theorem3.1.1
Based on the lemmas stated in Sec. 3.2, in this section we present the encoding functionEand the decoding functionDof ourπ-deletion correcting code, and prove Theorem 3.1.1. Given any sequence c β {0,1}π, let the function E : {0,1}π β {0,1}π+8πlogπ+π(logπ) be given by
E (c) =(π(c), π β²(c), π β²β²(c)), where
π β²(c) =(π(1π π¦ππ(π(c))) mod π(π(c)), π(π(c)), π» π π βπ(π(c))), and π β²β²(c) =π π ππ+1(π»(π β²(c))).
Function π π ππ+1(π»(π β²(c))) is the (π+1)-fold repetition of the bits in π»(π β²(c)), where function π»(π β²(c)) is defined Lemma8.2.3 and protects π β²(c) fromπ dele-
Lemma 3.2.6
Prop.3.4.1 Prop.3.4.2
Lemma 3.2.1 Lemma3.2.7 Lemma 8.2.3 Prop.3.6.1 Prop.3.6.3 Prop.3.6.4
Lemma 3.2.2 Lemma 3.2.3 Lemma 3.2.4
Thm.3.1.1
Figure 3.2: Dependencies of the claims in Ch. 3.
tions. Function π(c) is defined in Lemma 3.2.4 and transforms c into a π- dense sequence (see Definition 3.2.1for definition of a π-dense sequence). Func- tion π(1π π¦ππ(π(c))) in π β²(c) is represented by an integer and protects the syn- chronization vector 1π π¦ππ(c) by Lemma 3.2.1. Function π(π(c)) is defined in Lemma3.2.2, which compresses the hash π(1π π¦ππ(π(c))). Functionπ» π π βπ(π(c)) is defined in Lemma3.2.3and protects aπ-dense sequence fromπ deletions.
Note that π = π(βοΈ
log logπ). Hence, according to Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4, the length of π β²(c) is π1 = 8πlog(π+3π +3βlogπβ +15) + π(log(π+3π +3βlogπβ +15)) =8πlogπ+π(logπ). The length ofπ β²β²(c) isπ2= 2π(π+1) (π1/βlogπβ)log logπ=π(logπ). The length ofπ(c)isπ+π0=π+3π+ 3βlogπβ+15. Therefore, the length ofE (c)isπ+π0+π1+π2 =π+8πlogπ+π(logπ). The redundancy of the code is 8πlogπ+π(logπ).
To show how the sequenceccan be recovered from a lengthπ βπ subsequenced ofE (c) and implement the computation of the decoding function D (d), we prove that
1. Statement 1: The redundancyπ β²(c)can be recovered given(ππ+π
0+π1+1, . . . , ππ+π
0+π1+π2βπ).
2. Statement 2: The sequence c can be recovered given (π1, . . . , ππ+π
0βπ) andπ β²(c).
We first prove Statement 1. Note that ππ++π
0+π1+1 and ππ+π
0+π1+π2βπ come from bits E (c)π+π0+π1+1+π₯ and E (c)π+π0+π1+π2βπ+π¦ respectively after deletions in E (c), whereπ₯ , π¦ β [0, π]. Hence(ππ+π
0+π1+1, . . . , ππ+π
0+π1+π2βπ)is a lengthπ2βπsub- sequence of (E (c)π+π0+π1+1+π₯, . . . ,E (c)π+π0+π1+π2βπ+π¦), and thus a subsequence of (E (c)π+π0+π1+1, . . . ,E (c)π+π0+π1+π2) = π β²β²(c). Since π β²β²(c) is π +1-fold repe- tition code and thus a π-deletion correcting code that protects π»(π β²(c)), the hash functionπ»(π β²(c))can be recovered from(ππ+π
0+π1+1, . . . , ππ+π
0+π1+π2βπ).
Similarly,(ππ+π
0+1, . . . , ππ+π
0+π1βπ)is a lengthπ1βπsubsequence of(E (c)π+π0+1, . . . ,E (c)π+π0+π1) =π β²(c). From Lemma8.2.3, the functionπ β²(c)can be recovered fromπ»(π β²(c))and(ππ+π
0+1, . . . , ππ+π
0+π1βπ). Hence Statement 1 holds.
We now prove Statement 2. Note thatπ β²(c) contains hashes(π(1π π¦ππ(π(c)))mod π(π(c)), π(π(c))) andπ» π π βπ(π(c)). Moreover, (π1, . . . , ππ+π
0βπ) is a lengthπ+ π0β π subsequence of(E (c)1, . . . ,E (c)π+π0) =π(c). According to Lemma3.2.2, the synchronization vector1π π¦ππ(π(c)) can be recovered from hash
(π(1π π¦ππ(π(c))) mod π(π(c)), π(π(c))) and (π1, . . . , ππ+π
0βπ), by exhaustively searching over all length π + π0 supersequence cβ² of (π1, . . . , ππ+π
0βπ) such that π(1π π¦ππ(cβ²)) = π(1π π¦ππ(π(c))). Then we have that 1π π¦ππ(π(c)) = 1+ π π¦ππ(cβ²). Since by Lemma 3.2.4, π(c) is a π-dense sequence, by Lemma 3.2.3, it can be recovered from 1π π¦ππ(π(c)), π» π π βπ(π(c)), and the length π + π0 β π sub- sequence (π1, . . . , ππ+π
0βπ) of π(c). Finally, the sequence c can be recovered fromπ(c)by Lemma3.2.4. Hence Statement 2 holds andccan be recovered.
The encoding complexity ofE (c)isπ(π2π+1), which comes from brute force search for integer π(π(c)). The decoding complexity is π(ππ +1), which comes from brute force search for the correct1π π¦ππ(π(c)), given π(1π π¦ππ(π(c)))mod π(π(c)) and π(π(c)).
3.4 Protecting the Synchronization Vectors
In this section we present a hash function with size 4πlogπ+π(logπ) to protect the synchronization vector1π π¦ππ(c) from π deletions in cand prove Lemma3.2.2.
We first prove Lemma3.2.1, which is decomposed to Proposition3.4.1and Propo- sition 3.4.2. In Proposition 3.4.1 we present an upper bound on the radius of the deletion ball for the synchronization vector. In Proposition3.4.2, we prove that the higher order parity check helps correct multiple deletions for sequences in which the there is a 0-run of length at least 3πβ1 between any two 1βs. Since1π π¦ππ(c)is such a sequence, we conclude that the higher order parity check helps recover1π π¦ππ(c).