Chapter III: Binary Codes Correcting π Deletions/Insertions
3.2 Outline and Preliminaries
Table 3.1: Summary of Notations in Ch. 3 Bπ(c) β The set of sequences that share a
lengthπβπ
subsequence withc β {0,1}π. Rπ β The set of sequences where any
two 1 entries are
separated by at leastπβ1 zeros.
1π π¦ππ(c) β Synchronization vector defined in (3.1).
m(π) β Weights of orderπin the VT gener- alization, defined
in (3.2).
π(c) β Generalized VT redundancy defined in (3.3).
πΏ β Maximal length of zero runs in the synchronization
vector of a π-dense sequence, de- fined in (3.4).
π(c) β The function used to compute the redundancy
protecting the synchronization vec- tor1π π¦ππ(c),
defined in Lemma3.2.2.
π» π π βπ(c) β The deletion correcting hash func- tion forπ-dense
sequences, defined in Lemma3.2.3.
π(c) β The function that generatesπ-dense sequences,
defined in Lemma3.2.4.
π»(c) β Deletion correcting hash function for any sequence,
defined in Lemma8.2.3.
sequencec
Mapctoπ(c)(see Lemma3.2.4for definition ofπ)
π-dense sequenceπ(c)
pattern synchronization
00. . .011111 block 1
Append Redundancy 1=(π(1π π¦ ππ(π(c)))modπ(c), π(c)) (See Lemma3.2.2for definitions ofπ andπ) pattern
synchronization 01. . .011111 . . . blockπ½
protecting1π π¦ ππ(c) 00. . .011111 01. . .011111 Redundancy 1 Append Redundancy 2=π» π π βπ(c) (See Lemma3.2.3for definition ofπ» π π βπ) protecting
blocks 1, . . . , π½ 00. . .011111 01. . .011111 Redundancy 1 Redundancy 2 Append Redundancy 3=(π+1)-fold repetition
ofπ»(Redundancy1,Redundancy2)(See Lemma8.2.3for definition ofπ»)
protecting
Redundancy 1 and Redundancy 2
01. . .011111
00. . .011111 Redundancy 1 Redundancy 2 Redundancy 3
Figure 3.1: Illustrating the encoding procedure of our π-deletion correcting code construction.
The final encoding/decoding algorithm of our code, which combines the results in Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4, is provided in Sec. 3.3. The encoding is illustrated in Fig.3.1.
Before presenting the lemmas, we give necessary definitions and notations. For a sequencec β {0,1}π, define its deletion ball Bπ(c) as the collection of sequences that share a lengthπβπ subsequence withc.
Definition 3.2.1. A sequencec β {0,1}πis said to beπ-denseif the lengths of the0 runs in1π π¦ππ(c)is at most
πΏ β( βlogπβ +5)2βlogπβ+9βlogπβ
+ (3π + βlogπβ +4) ( βlogπβ +9+ βlogπβ). (3.4) Forπ-densec, the index distance between any two1entries in1π π¦ππ(c)is at mostπΏ+1, i.e., the0-runs between two1entries have length at most πΏ+1.
Note that π(c)is an integer vector that can be presented by log(3π)6π+1π(3π+1) (6π+1) bits or by an integer in the range [0,(3π)6π+1π(3π+1) (6π+1) β 1]. In this chapter,
we interchangeably use π(c) to denote its binary presentation or integer presenta- tion. The following lemma shows that the synchronization vector1π π¦ππ(c) can be recovered fromπ deletions with the help of π(1π π¦ππ(c)). Its proof will be given in Sec.3.4.
Lemma 3.2.1. For integersπandπand sequencesc,cβ², ifcβ² β Bπ(c)and π(1π π¦ππ(c)) = π(1π π¦ππ(cβ²)), then1π π¦ππ(c) =1π π¦ππ(cβ²).
By virtue of Lemma 3.2.1, the synchronization vector 1π π¦ππ(c) can be recovered from π deletions inc, with the help of a hash π(1π π¦ππ(c)) of sizeπ(π2logπ) bits.
To further reduce the size of the hash toπ(πlogπ)bits, we apply modulo operations on π(1π π¦ππ(c))in the following lemma, the proof of which will be proved in Sec.3.4.
Lemma 3.2.2. For integers π and π = π(βοΈ
log logπ), there exists a function π : {0,1}π β [1,22πlogπ+π(logπ)], such that if π(1π π¦ππ(c)) β‘ π(1π π¦ππ(cβ²)) mod π(c) for two sequencescβ {0,1}πandcβ² β Bπ(c), then1π π¦ππ(c) =1π π¦ππ(cβ²). Hence if
(π(1π π¦ππ(c)) mod π(c), π(c))
=(π(1π π¦ππ(cβ²))mod π(cβ²), π(cβ²)) andcβ²β Bπ(c), we have that1π π¦ππ(c)=1π π¦ππ(cβ²).
Lemma3.2.2presents a hash of size 4πlogπ+π(logπ)bits for correcting1π π¦ππ(c). With the knowledge of the synchronization vector1π π¦ππ(c), the next lemma shows that the sequenceccan be further recovered using another 4πlog+π(logπ)bit hash, whencis π-dense, i.e., when the synchronization patter occurs frequently enough inc. The proof of Lemma3.2.3will be given in Sec.3.5.
Lemma 3.2.3. For integersπandπ =π(βοΈ
log logπ), there exists a functionπ» π π βπ : {0,1}π β {0,1}4πlogπ+π(logπ), such that everyπ-dense sequencecβ {0,1}πcan be recovered, given its synchronization vector1π π¦ππ(c), its lengthπβπsubsequenced, andπ» π π βπ(c).
Combining Lemma3.2.2and Lemma3.2.3, we obtain sizeπ(πlogπ)hash function to correct deletions for a π-dense sequence. To encode for arbitrary sequencec β {0,1}π, a mapping that transforms any sequence to aπ-dense sequence is given in the following lemma. The details will be given in Sec.3.6.
Lemma 3.2.4. For integers π and π > π, there exists a map π : {0,1}π β {0,1}π+3π+3βlogπβ+15, computable in π ππ π¦(π, π) time, such that π(c) is a π-dense sequence forcβ {0,1}π. Moreover, the sequenceccan be recovered fromπ(c). The next three lemmas (Lemma 8.2.3, Lemma 3.2.6, and Lemma 3.2.7) present existence results that are necessary to prove Lemma 3.2.2 and Lemma 3.2.3.
Lemma8.2.3gives aπ-deletion correcting hash function for short sequences, which is the block hash described above in proving Lemma3.2.3. It is an extension of the result in [12]. Lemma3.2.6is a slight variation of the result in [64]. It shows the equivalence between correcting deletions and correcting deletions and insertions.
Lemma3.2.6 will be used in the proof of Lemma3.2.1, where we need an upper bound on the number of deletions/insertions in1π π¦ππ(c) caused by a deletion in c.
Lemma3.2.7(see [71]) gives an upper bound on the number of divisors of a positive integerπ. With Lemma3.2.7, we show that the VT generalization in Lemma3.2.1 can be compressed by taking modulo operations. The details will be given in the proof of Lemma3.2.2.
Lemma 3.2.5. For any integers π€, π, and π, there exists a hash function π» : {0,1}π€ β
{0,1}β(π€/βlogπβ)β (2πlog logπ+π(1)), computable inππ( (π€/logπ)πlog2ππ) time, such that any sequencecβ {0,1}π€can be recovered from its lengthπ€βπ subsequenced and the hashπ»(c).
Proof. We first show by counting arguments the existence of a hash functionπ»β² : {0,1}βlogπβ β {0,1}2πlog logπ+π(1), exhaustively computable inππ(πlog2ππ)time, such thatπ»β²(s) β π»β²(sβ²)for allsβ {0,1}βlogπβandsβ²β Bπ(s)\{s}. The hashπ»β²(cβ²) protects the sequence s β {0,1}βlogπβ from π deletions. Note that |Bπ(cβ²) | β€
βlogπβ π
2
2π β€ 2βlogπβ2π. Hence it suffices to use brute force and greedily assign a hash value for each sequences β {0,1}βlogπβ such that π»β²(π ) β π»β²(sβ²) for all sβ² β Bπ(s)\{s}. Since the size of Bπ(s) is upper bounded by 2βlogπβ2π, there always exists such a hash π»β²(s) β {0,1}log(2βlogπβ2π+1), that has no conflict with the hash values of sequences inBπ(s). The total complexity isππ(πlog2ππ)and the size of the hash valueπ»β²(s) is 2πlog logπ+π(1).
Now splitcinto β(π€/βlogπβ)β blocksπ(πβ1) βlogπβ+1, . . . , ππβlogπβ,
π β [1,β(π€/βlogπβ)β] of length βlogπβ. If the length of the last block is less than βlogπβ, add zeros to the end of the last block such that its length is βlogπβ. Assign a hash value hπ = π»β²( (π(πβ1) βlogπβ+1, . . . , ππβlogπβ)), π β [1,β(π€/βlogπβ)β]
for each block. Let π»(c) = (h1, . . . ,hβ(π€/βlogπβ)β) be the concatenation of hπ
forπ β [1,β(π€/βlogπβ)β].
We show that π»(c) protects c from π deletions. Let d be a length πβ π subse- quence of c. Note thatπ(πβ1) βlogπβ+1 and ππβlogπββπ come from bits π(πβ1) βlogπβ+1+π₯ and ππβlogπββπ+π¦ respectively after deletions in c, where the integers π₯ , π¦ β [0, π]. Therefore, (π(πβ1) βlogπβ+1, . . . , ππβlogπββπ) is a length βlogπβ β π subsequence of (π(πβ1) βlogπβ+1+π₯, . . . , ππβlogπββπ+π¦), and thus a subsequence of the block(π(πβ1) βlogπβ+1, . . . , ππβlogπβ). Hence the π-th block (π(πβ1) βlogπβ+1, . . . , ππβlogπβ) can be recovered fromhπ =π»β²(π(πβ1) βlogπβ+1, . . . , ππβlogπβ)and (π(πβ1) βlogπβ+1, . . . , ππβlogπββπ). There- fore,c can be recovered given dand π»(c). The length of π»(c) is β(π€/βlogπβ)βΒ·
(2πlog logπ+π(1))and the complexity of π»(c)isππ( (π€/logπ)πlog2ππ). β‘ Lemma 3.2.6. Letπ,π , and πbe integers satisfyingπ+π β€ π. For sequencesc,cβ²β {0,1}π, if cβ² and c share a common resulting sequence after π deletions and π insertions in both, thencβ²β Bπ(c).
Lemma 3.2.7. For a positive integerπ β₯ 3, the number of divisors of πis upper bounded by21.6 lnπ/(ln lnπ).
The proofs of Lemma 3.2.1, Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4 rely on several propositions, the details of which will be presented in the next sections.
For convenience, a dependency graph for the theorem, lemmas, and propositions is given in Fig.3.2.