• Tidak ada hasil yang ditemukan

Chapter III: Binary Codes Correcting π‘˜ Deletions/Insertions

3.2 Outline and Preliminaries

Table 3.1: Summary of Notations in Ch. 3 Bπ‘˜(c) β‰œ The set of sequences that share a

lengthπ‘›βˆ’π‘˜

subsequence withc ∈ {0,1}𝑛. Rπ‘š β‰œ The set of sequences where any

two 1 entries are

separated by at leastπ‘šβˆ’1 zeros.

1𝑠 𝑦𝑛𝑐(c) β‰œ Synchronization vector defined in (3.1).

m(𝑒) β‰œ Weights of order𝑒in the VT gener- alization, defined

in (3.2).

𝑓(c) β‰œ Generalized VT redundancy defined in (3.3).

𝐿 β‰œ Maximal length of zero runs in the synchronization

vector of a π‘˜-dense sequence, de- fined in (3.4).

𝑝(c) β‰œ The function used to compute the redundancy

protecting the synchronization vec- tor1𝑠 𝑦𝑛𝑐(c),

defined in Lemma3.2.2.

𝐻 π‘Ž 𝑠 β„Žπ‘˜(c) β‰œ The deletion correcting hash func- tion forπ‘˜-dense

sequences, defined in Lemma3.2.3.

𝑇(c) β‰œ The function that generatesπ‘˜-dense sequences,

defined in Lemma3.2.4.

𝐻(c) β‰œ Deletion correcting hash function for any sequence,

defined in Lemma8.2.3.

sequencec

Mapcto𝑇(c)(see Lemma3.2.4for definition of𝑇)

π‘˜-dense sequence𝑇(c)

pattern synchronization

00. . .011111 block 1

Append Redundancy 1=(𝑓(1𝑠 𝑦 𝑛𝑐(𝑇(c)))mod𝑝(c), 𝑝(c)) (See Lemma3.2.2for definitions of𝑓 and𝑝) pattern

synchronization 01. . .011111 . . . block𝐽

protecting1𝑠 𝑦 𝑛𝑐(c) 00. . .011111 01. . .011111 Redundancy 1 Append Redundancy 2=𝐻 π‘Ž 𝑠 β„Žπ‘˜(c) (See Lemma3.2.3for definition of𝐻 π‘Ž 𝑠 β„Žπ‘˜) protecting

blocks 1, . . . , 𝐽 00. . .011111 01. . .011111 Redundancy 1 Redundancy 2 Append Redundancy 3=(π‘˜+1)-fold repetition

of𝐻(Redundancy1,Redundancy2)(See Lemma8.2.3for definition of𝐻)

protecting

Redundancy 1 and Redundancy 2

01. . .011111

00. . .011111 Redundancy 1 Redundancy 2 Redundancy 3

Figure 3.1: Illustrating the encoding procedure of our π‘˜-deletion correcting code construction.

The final encoding/decoding algorithm of our code, which combines the results in Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4, is provided in Sec. 3.3. The encoding is illustrated in Fig.3.1.

Before presenting the lemmas, we give necessary definitions and notations. For a sequencec ∈ {0,1}𝑛, define its deletion ball Bπ‘˜(c) as the collection of sequences that share a lengthπ‘›βˆ’π‘˜ subsequence withc.

Definition 3.2.1. A sequencec ∈ {0,1}𝑛is said to beπ‘˜-denseif the lengths of the0 runs in1𝑠 𝑦𝑛𝑐(c)is at most

𝐿 β‰œ( ⌈logπ‘˜βŒ‰ +5)2⌈logπ‘˜βŒ‰+9⌈logπ‘›βŒ‰

+ (3π‘˜ + ⌈logπ‘˜βŒ‰ +4) ( ⌈logπ‘›βŒ‰ +9+ ⌈logπ‘˜βŒ‰). (3.4) Forπ‘˜-densec, the index distance between any two1entries in1𝑠 𝑦𝑛𝑐(c)is at most𝐿+1, i.e., the0-runs between two1entries have length at most 𝐿+1.

Note that 𝑓(c)is an integer vector that can be presented by log(3π‘˜)6π‘˜+1𝑛(3π‘˜+1) (6π‘˜+1) bits or by an integer in the range [0,(3π‘˜)6π‘˜+1𝑛(3π‘˜+1) (6π‘˜+1) βˆ’ 1]. In this chapter,

we interchangeably use 𝑓(c) to denote its binary presentation or integer presenta- tion. The following lemma shows that the synchronization vector1𝑠 𝑦𝑛𝑐(c) can be recovered fromπ‘˜ deletions with the help of 𝑓(1𝑠 𝑦𝑛𝑐(c)). Its proof will be given in Sec.3.4.

Lemma 3.2.1. For integers𝑛andπ‘˜and sequencesc,cβ€², ifcβ€² ∈ Bπ‘˜(c)and 𝑓(1𝑠 𝑦𝑛𝑐(c)) = 𝑓(1𝑠 𝑦𝑛𝑐(cβ€²)), then1𝑠 𝑦𝑛𝑐(c) =1𝑠 𝑦𝑛𝑐(cβ€²).

By virtue of Lemma 3.2.1, the synchronization vector 1𝑠 𝑦𝑛𝑐(c) can be recovered from π‘˜ deletions inc, with the help of a hash 𝑓(1𝑠 𝑦𝑛𝑐(c)) of size𝑂(π‘˜2log𝑛) bits.

To further reduce the size of the hash to𝑂(π‘˜log𝑛)bits, we apply modulo operations on 𝑓(1𝑠 𝑦𝑛𝑐(c))in the following lemma, the proof of which will be proved in Sec.3.4.

Lemma 3.2.2. For integers 𝑛 and π‘˜ = π‘œ(√︁

log log𝑛), there exists a function 𝑝 : {0,1}𝑛 β†’ [1,22π‘˜log𝑛+π‘œ(log𝑛)], such that if 𝑓(1𝑠 𝑦𝑛𝑐(c)) ≑ 𝑓(1𝑠 𝑦𝑛𝑐(cβ€²)) mod 𝑝(c) for two sequencesc∈ {0,1}𝑛andcβ€² ∈ Bπ‘˜(c), then1𝑠 𝑦𝑛𝑐(c) =1𝑠 𝑦𝑛𝑐(cβ€²). Hence if

(𝑓(1𝑠 𝑦𝑛𝑐(c)) mod 𝑝(c), 𝑝(c))

=(𝑓(1𝑠 𝑦𝑛𝑐(cβ€²))mod 𝑝(cβ€²), 𝑝(cβ€²)) andcβ€²βˆˆ Bπ‘˜(c), we have that1𝑠 𝑦𝑛𝑐(c)=1𝑠 𝑦𝑛𝑐(cβ€²).

Lemma3.2.2presents a hash of size 4π‘˜log𝑛+π‘œ(log𝑛)bits for correcting1𝑠 𝑦𝑛𝑐(c). With the knowledge of the synchronization vector1𝑠 𝑦𝑛𝑐(c), the next lemma shows that the sequenceccan be further recovered using another 4π‘˜log+π‘œ(log𝑛)bit hash, whencis π‘˜-dense, i.e., when the synchronization patter occurs frequently enough inc. The proof of Lemma3.2.3will be given in Sec.3.5.

Lemma 3.2.3. For integers𝑛andπ‘˜ =π‘œ(√︁

log log𝑛), there exists a function𝐻 π‘Ž 𝑠 β„Žπ‘˜ : {0,1}𝑛 β†’ {0,1}4π‘˜log𝑛+π‘œ(log𝑛), such that everyπ‘˜-dense sequencec∈ {0,1}𝑛can be recovered, given its synchronization vector1𝑠 𝑦𝑛𝑐(c), its lengthπ‘›βˆ’π‘˜subsequenced, and𝐻 π‘Ž 𝑠 β„Žπ‘˜(c).

Combining Lemma3.2.2and Lemma3.2.3, we obtain size𝑂(π‘˜log𝑛)hash function to correct deletions for a π‘˜-dense sequence. To encode for arbitrary sequencec ∈ {0,1}𝑛, a mapping that transforms any sequence to aπ‘˜-dense sequence is given in the following lemma. The details will be given in Sec.3.6.

Lemma 3.2.4. For integers π‘˜ and 𝑛 > π‘˜, there exists a map 𝑇 : {0,1}𝑛 β†’ {0,1}𝑛+3π‘˜+3⌈logπ‘˜βŒ‰+15, computable in 𝑝 π‘œπ‘™ 𝑦(𝑛, π‘˜) time, such that 𝑇(c) is a π‘˜-dense sequence forc∈ {0,1}𝑛. Moreover, the sequenceccan be recovered from𝑇(c). The next three lemmas (Lemma 8.2.3, Lemma 3.2.6, and Lemma 3.2.7) present existence results that are necessary to prove Lemma 3.2.2 and Lemma 3.2.3.

Lemma8.2.3gives aπ‘˜-deletion correcting hash function for short sequences, which is the block hash described above in proving Lemma3.2.3. It is an extension of the result in [12]. Lemma3.2.6is a slight variation of the result in [64]. It shows the equivalence between correcting deletions and correcting deletions and insertions.

Lemma3.2.6 will be used in the proof of Lemma3.2.1, where we need an upper bound on the number of deletions/insertions in1𝑠 𝑦𝑛𝑐(c) caused by a deletion in c.

Lemma3.2.7(see [71]) gives an upper bound on the number of divisors of a positive integer𝑛. With Lemma3.2.7, we show that the VT generalization in Lemma3.2.1 can be compressed by taking modulo operations. The details will be given in the proof of Lemma3.2.2.

Lemma 3.2.5. For any integers 𝑀, 𝑛, and π‘˜, there exists a hash function 𝐻 : {0,1}𝑀 β†’

{0,1}⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰ (2π‘˜log log𝑛+𝑂(1)), computable inπ‘‚π‘˜( (𝑀/log𝑛)𝑛log2π‘˜π‘›) time, such that any sequencec∈ {0,1}𝑀can be recovered from its lengthπ‘€βˆ’π‘˜ subsequenced and the hash𝐻(c).

Proof. We first show by counting arguments the existence of a hash function𝐻′ : {0,1}⌈logπ‘›βŒ‰ β†’ {0,1}2π‘˜log log𝑛+𝑂(1), exhaustively computable inπ‘‚π‘˜(𝑛log2π‘˜π‘›)time, such that𝐻′(s) β‰  𝐻′(sβ€²)for alls∈ {0,1}⌈logπ‘›βŒ‰andsβ€²βˆˆ Bπ‘˜(s)\{s}. The hash𝐻′(cβ€²) protects the sequence s ∈ {0,1}⌈logπ‘›βŒ‰ from π‘˜ deletions. Note that |Bπ‘˜(cβ€²) | ≀

⌈logπ‘›βŒ‰ π‘˜

2

2π‘˜ ≀ 2⌈logπ‘›βŒ‰2π‘˜. Hence it suffices to use brute force and greedily assign a hash value for each sequences ∈ {0,1}⌈logπ‘›βŒ‰ such that 𝐻′(𝑠) β‰  𝐻′(sβ€²) for all sβ€² ∈ Bπ‘˜(s)\{s}. Since the size of Bπ‘˜(s) is upper bounded by 2⌈logπ‘›βŒ‰2π‘˜, there always exists such a hash 𝐻′(s) ∈ {0,1}log(2⌈logπ‘›βŒ‰2π‘˜+1), that has no conflict with the hash values of sequences inBπ‘˜(s). The total complexity isπ‘‚π‘˜(𝑛log2π‘˜π‘›)and the size of the hash value𝐻′(s) is 2π‘˜log log𝑛+𝑂(1).

Now splitcinto ⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰ blocks𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰,

𝑖 ∈ [1,⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰] of length ⌈logπ‘›βŒ‰. If the length of the last block is less than ⌈logπ‘›βŒ‰, add zeros to the end of the last block such that its length is ⌈logπ‘›βŒ‰. Assign a hash value h𝑖 = 𝐻′( (𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰)), 𝑖 ∈ [1,⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰]

for each block. Let 𝐻(c) = (h1, . . . ,h⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰) be the concatenation of h𝑖

for𝑖 ∈ [1,⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰].

We show that 𝐻(c) protects c from π‘˜ deletions. Let d be a length π‘›βˆ’ π‘˜ subse- quence of c. Note that𝑑(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1 and π‘‘π‘–βŒˆlogπ‘›βŒ‰βˆ’π‘˜ come from bits 𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1+π‘₯ and π‘π‘–βŒˆlogπ‘›βŒ‰βˆ’π‘˜+𝑦 respectively after deletions in c, where the integers π‘₯ , 𝑦 ∈ [0, π‘˜]. Therefore, (𝑑(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘‘π‘–βŒˆlogπ‘›βŒ‰βˆ’π‘˜) is a length ⌈logπ‘›βŒ‰ βˆ’ π‘˜ subsequence of (𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1+π‘₯, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰βˆ’π‘˜+𝑦), and thus a subsequence of the block(𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰). Hence the 𝑖-th block (𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰) can be recovered fromh𝑖 =𝐻′(𝑐(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘π‘–βŒˆlogπ‘›βŒ‰)and (𝑑(π‘–βˆ’1) ⌈logπ‘›βŒ‰+1, . . . , π‘‘π‘–βŒˆlogπ‘›βŒ‰βˆ’π‘˜). There- fore,c can be recovered given dand 𝐻(c). The length of 𝐻(c) is ⌈(𝑀/⌈logπ‘›βŒ‰)βŒ‰Β·

(2π‘˜log log𝑛+𝑂(1))and the complexity of 𝐻(c)isπ‘‚π‘˜( (𝑀/log𝑛)𝑛log2π‘˜π‘›). β–‘ Lemma 3.2.6. Letπ‘Ÿ,𝑠, and π‘˜be integers satisfyingπ‘Ÿ+𝑠 ≀ π‘˜. For sequencesc,cβ€²βˆˆ {0,1}𝑛, if cβ€² and c share a common resulting sequence after π‘Ÿ deletions and 𝑠 insertions in both, thencβ€²βˆˆ Bπ‘˜(c).

Lemma 3.2.7. For a positive integer𝑛 β‰₯ 3, the number of divisors of 𝑛is upper bounded by21.6 ln𝑛/(ln ln𝑛).

The proofs of Lemma 3.2.1, Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4 rely on several propositions, the details of which will be presented in the next sections.

For convenience, a dependency graph for the theorem, lemmas, and propositions is given in Fig.3.2.