Outline and Preliminaries - Binary Codes Correcting 𝑘 Deletions/Insertions

Chapter III: Binary Codes Correcting 𝑘 Deletions/Insertions

3.2 Outline and Preliminaries

Table 3.1: Summary of Notations in Ch. 3 B𝑘(c) ≜ The set of sequences that share a

length𝑛−𝑘

subsequence withc ∈ {0,1}^𝑛. R𝑚 ≜ The set of sequences where any

two 1 entries are

separated by at least𝑚−1 zeros.

1𝑠 𝑦𝑛𝑐(c) ≜ Synchronization vector defined in (3.1).

m⁽^𝑒) ≜ Weights of order𝑒in the VT generalization, defined

in (3.2).

𝑓(c) ≜ Generalized VT redundancy defined in (3.3).

𝐿 ≜ Maximal length of zero runs in the synchronization

vector of a 𝑘-dense sequence, defined in (3.4).

𝑝(c) ≜ The function used to compute the redundancy

protecting the synchronization vec- tor1𝑠 𝑦𝑛𝑐(c),

defined in Lemma3.2.2.

𝐻 𝑎 𝑠 ℎ_𝑘(c) ≜ The deletion correcting hash function for𝑘-dense

sequences, defined in Lemma3.2.3.

𝑇(c) ≜ The function that generates𝑘-dense sequences,

defined in Lemma3.2.4.

𝐻(c) ≜ Deletion correcting hash function for any sequence,

defined in Lemma8.2.3.

sequencec

Mapcto𝑇(c)(see Lemma3.2.4for definition of𝑇)

𝑘-dense sequence𝑇(c)

pattern synchronization

00. . .011111 block 1

Append Redundancy 1=(𝑓(1𝑠 𝑦 𝑛𝑐(𝑇(c)))mod𝑝(c), 𝑝(c)) (See Lemma3.2.2for definitions of𝑓 and𝑝) pattern

synchronization 01. . .011111 . . . block𝐽

protecting1𝑠 𝑦 𝑛𝑐(c) 00. . .011111 01. . .011111 Redundancy 1 Append Redundancy 2=𝐻 𝑎 𝑠 ℎ_𝑘(c) (See Lemma3.2.3for definition of𝐻 𝑎 𝑠 ℎ_𝑘) protecting

blocks 1, . . . , 𝐽 00. . .011111 01. . .011111 Redundancy 1 Redundancy 2 Append Redundancy 3=(𝑘+1)-fold repetition

of𝐻(Redundancy1,Redundancy2)(See Lemma8.2.3for definition of𝐻)

protecting

Redundancy 1 and Redundancy 2

01. . .011111

00. . .011111 Redundancy 1 Redundancy 2 Redundancy 3

Figure 3.1: Illustrating the encoding procedure of our 𝑘-deletion correcting code construction.

The final encoding/decoding algorithm of our code, which combines the results in Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4, is provided in Sec. 3.3. The encoding is illustrated in Fig.3.1.

Before presenting the lemmas, we give necessary definitions and notations. For a sequencec ∈ {0,1}^𝑛, define its deletion ball B𝑘(c) as the collection of sequences that share a length𝑛−𝑘 subsequence withc.

Definition 3.2.1. A sequencec ∈ {0,1}^𝑛is said to be𝑘-denseif the lengths of the0 runs in1𝑠 𝑦𝑛𝑐(c)is at most

𝐿 ≜( ⌈log𝑘⌉ +5)2^⌈^log^𝑘⌉+⁹⌈log𝑛⌉

+ (3𝑘 + ⌈log𝑘⌉ +4) ( ⌈log𝑛⌉ +9+ ⌈log𝑘⌉). (3.4) For𝑘-densec, the index distance between any two1entries in1𝑠 𝑦𝑛𝑐(c)is at most𝐿+1, i.e., the0-runs between two1entries have length at most 𝐿+1.

Note that 𝑓(c)is an integer vector that can be presented by log(3𝑘)⁶^𝑘+¹𝑛⁽³^𝑘+¹^{) (}⁶^𝑘+¹⁾ bits or by an integer in the range [0,(3𝑘)⁶^𝑘+1𝑛⁽³^{𝑘+1) (6}^𝑘+1) − 1]. In this chapter,

we interchangeably use 𝑓(c) to denote its binary presentation or integer presentation. The following lemma shows that the synchronization vector1𝑠 𝑦𝑛𝑐(c) can be recovered from𝑘 deletions with the help of 𝑓(1𝑠 𝑦𝑛𝑐(c)). Its proof will be given in Sec.3.4.

Lemma 3.2.1. For integers𝑛and𝑘and sequencesc,c^′, ifc^′ ∈ B𝑘(c)and 𝑓(1𝑠 𝑦𝑛𝑐(c)) = 𝑓(1𝑠 𝑦𝑛𝑐(c^′)), then1𝑠 𝑦𝑛𝑐(c) =1𝑠 𝑦𝑛𝑐(c^′).

By virtue of Lemma 3.2.1, the synchronization vector 1𝑠 𝑦𝑛𝑐(c) can be recovered from 𝑘 deletions inc, with the help of a hash 𝑓(1𝑠 𝑦𝑛𝑐(c)) of size𝑂(𝑘²log𝑛) bits.

To further reduce the size of the hash to𝑂(𝑘log𝑛)bits, we apply modulo operations on 𝑓(1𝑠 𝑦𝑛𝑐(c))in the following lemma, the proof of which will be proved in Sec.3.4.

Lemma 3.2.2. For integers 𝑛 and 𝑘 = 𝑜(√︁

log log𝑛), there exists a function 𝑝 : {0,1}^𝑛 → [1,2²^𝑘^log^{𝑛+𝑜(log}^𝑛)], such that if 𝑓(1𝑠 𝑦𝑛𝑐(c)) ≡ 𝑓(1𝑠 𝑦𝑛𝑐(c^′)) mod 𝑝(c) for two sequencesc∈ {0,1}^𝑛andc^′ ∈ B𝑘(c), then1𝑠 𝑦𝑛𝑐(c) =1𝑠 𝑦𝑛𝑐(c^′). Hence if

(𝑓(1𝑠 𝑦𝑛𝑐(c)) mod 𝑝(c), 𝑝(c))

=(𝑓(1𝑠 𝑦𝑛𝑐(c^′))mod 𝑝(c^′), 𝑝(c^′)) andc^′∈ B𝑘(c), we have that1𝑠 𝑦𝑛𝑐(c)=1𝑠 𝑦𝑛𝑐(c^′).

Lemma3.2.2presents a hash of size 4𝑘log𝑛+𝑜(log𝑛)bits for correcting1𝑠 𝑦𝑛𝑐(c). With the knowledge of the synchronization vector1𝑠 𝑦𝑛𝑐(c), the next lemma shows that the sequenceccan be further recovered using another 4𝑘log+𝑜(log𝑛)bit hash, whencis 𝑘-dense, i.e., when the synchronization patter occurs frequently enough inc. The proof of Lemma3.2.3will be given in Sec.3.5.

Lemma 3.2.3. For integers𝑛and𝑘 =𝑜(√︁

log log𝑛), there exists a function𝐻 𝑎 𝑠 ℎ_𝑘 : {0,1}^𝑛 → {0,1}⁴^𝑘^log^{𝑛+𝑜(log}^𝑛), such that every𝑘-dense sequencec∈ {0,1}^𝑛can be recovered, given its synchronization vector1𝑠 𝑦𝑛𝑐(c), its length𝑛−𝑘subsequenced, and𝐻 𝑎 𝑠 ℎ_𝑘(c).

Combining Lemma3.2.2and Lemma3.2.3, we obtain size𝑂(𝑘log𝑛)hash function to correct deletions for a 𝑘-dense sequence. To encode for arbitrary sequencec ∈ {0,1}^𝑛, a mapping that transforms any sequence to a𝑘-dense sequence is given in the following lemma. The details will be given in Sec.3.6.

Lemma 3.2.4. For integers 𝑘 and 𝑛 > 𝑘, there exists a map 𝑇 : {0,1}^𝑛 → {0,1}^𝑛+3^𝑘+3⌈log^𝑘⌉+15, computable in 𝑝 𝑜𝑙 𝑦(𝑛, 𝑘) time, such that 𝑇(c) is a 𝑘-dense sequence forc∈ {0,1}^𝑛. Moreover, the sequenceccan be recovered from𝑇(c). The next three lemmas (Lemma 8.2.3, Lemma 3.2.6, and Lemma 3.2.7) present existence results that are necessary to prove Lemma 3.2.2 and Lemma 3.2.3.

Lemma8.2.3gives a𝑘-deletion correcting hash function for short sequences, which is the block hash described above in proving Lemma3.2.3. It is an extension of the result in [12]. Lemma3.2.6is a slight variation of the result in [64]. It shows the equivalence between correcting deletions and correcting deletions and insertions.

Lemma3.2.6 will be used in the proof of Lemma3.2.1, where we need an upper bound on the number of deletions/insertions in1𝑠 𝑦𝑛𝑐(c) caused by a deletion in c.

Lemma3.2.7(see [71]) gives an upper bound on the number of divisors of a positive integer𝑛. With Lemma3.2.7, we show that the VT generalization in Lemma3.2.1 can be compressed by taking modulo operations. The details will be given in the proof of Lemma3.2.2.

Lemma 3.2.5. For any integers 𝑤, 𝑛, and 𝑘, there exists a hash function 𝐻 : {0,1}^𝑤 →

{0,1}^{⌈(𝑤/⌈log}^{𝑛⌉)⌉ (2}^𝑘^{log log}^𝑛+𝑂⁽¹⁾⁾, computable in𝑂_𝑘( (𝑤/log𝑛)𝑛log²^𝑘𝑛) time, such that any sequencec∈ {0,1}^𝑤can be recovered from its length𝑤−𝑘 subsequenced and the hash𝐻(c).

Proof. We first show by counting arguments the existence of a hash function𝐻^′ : {0,1}^⌈log^𝑛^⌉ → {0,1}²^𝑘^{log log}^𝑛⁺^𝑂⁽¹⁾, exhaustively computable in𝑂_𝑘(𝑛log²^𝑘𝑛)time, such that𝐻^′(s) ≠ 𝐻^′(s^′)for alls∈ {0,1}^⌈log^𝑛^⌉ands^′∈ B𝑘(s)\{s}. The hash𝐻^′(c^′) protects the sequence s ∈ {0,1}^⌈^log^𝑛⌉ from 𝑘 deletions. Note that |B𝑘(c^′) | ≤

⌈log𝑛⌉ 𝑘

2^𝑘 ≤ 2⌈log𝑛⌉²^𝑘. Hence it suffices to use brute force and greedily assign a hash value for each sequences ∈ {0,1}^⌈^log^𝑛⌉ such that 𝐻^′(𝑠) ≠ 𝐻^′(s^′) for all s^′ ∈ B𝑘(s)\{s}. Since the size of B𝑘(s) is upper bounded by 2⌈log𝑛⌉²^𝑘, there always exists such a hash 𝐻^′(s) ∈ {0,1}^log(2⌈log^𝑛⌉²^𝑘⁺¹⁾, that has no conflict with the hash values of sequences inB𝑘(s). The total complexity is𝑂_𝑘(𝑛log²^𝑘𝑛)and the size of the hash value𝐻^′(s) is 2𝑘log log𝑛+𝑂(1).

Now splitcinto ⌈(𝑤/⌈log𝑛⌉)⌉ blocks𝑐_{(𝑖−1) ⌈log}_𝑛⌉+1, . . . , 𝑐_𝑖⌈log_𝑛⌉,

𝑖 ∈ [1,⌈(𝑤/⌈log𝑛⌉)⌉] of length ⌈log𝑛⌉. If the length of the last block is less than ⌈log𝑛⌉, add zeros to the end of the last block such that its length is ⌈log𝑛⌉. Assign a hash value h𝑖 = 𝐻^′( (𝑐₍_𝑖_{−1) ⌈log}_𝑛_⌉+1, . . . , 𝑐_𝑖_⌈log_𝑛_⌉)), 𝑖 ∈ [1,⌈(𝑤/⌈log𝑛⌉)⌉]

for each block. Let 𝐻(c) = (h₁, . . . ,h⌈(𝑤/⌈log𝑛⌉)⌉) be the concatenation of h𝑖

for𝑖 ∈ [1,⌈(𝑤/⌈log𝑛⌉)⌉].

We show that 𝐻(c) protects c from 𝑘 deletions. Let d be a length 𝑛− 𝑘 subsequence of c. Note that𝑑_{(𝑖−1) ⌈log}_𝑛⌉+1 and 𝑑_𝑖⌈log_{𝑛⌉−𝑘} come from bits 𝑐_{(𝑖−1) ⌈log}_{𝑛⌉+1+𝑥} and 𝑐_𝑖⌈log_{𝑛⌉−𝑘+𝑦} respectively after deletions in c, where the integers 𝑥 , 𝑦 ∈ [0, 𝑘]. Therefore, (𝑑₍_𝑖_{−1) ⌈log}_𝑛_⌉+1, . . . , 𝑑_𝑖_⌈log_𝑛_⌉−_𝑘) is a length ⌈log𝑛⌉ − 𝑘 subsequence of (𝑐₍_𝑖_{−1) ⌈log}_𝑛_⌉+1+_𝑥, . . . , 𝑐_𝑖_⌈log_𝑛_⌉−_𝑘₊_𝑦), and thus a subsequence of the block(𝑐₍_𝑖_{−1) ⌈log}_𝑛_⌉+1, . . . , 𝑐_𝑖_⌈log_𝑛_⌉). Hence the 𝑖-th block (𝑐₍_𝑖_{−1) ⌈log}_𝑛_⌉+1, . . . , 𝑐_𝑖_⌈log_𝑛_⌉) can be recovered fromh𝑖 =𝐻^′(𝑐₍_𝑖_{−1) ⌈log}_𝑛_⌉+1, . . . , 𝑐_𝑖_⌈log_𝑛_⌉)and (𝑑₍_𝑖_{−1) ⌈log}_𝑛_⌉+1, . . . , 𝑑_𝑖_⌈log_𝑛_⌉−_𝑘). There- fore,c can be recovered given dand 𝐻(c). The length of 𝐻(c) is ⌈(𝑤/⌈log𝑛⌉)⌉·

(2𝑘log log𝑛+𝑂(1))and the complexity of 𝐻(c)is𝑂_𝑘( (𝑤/log𝑛)𝑛log²^𝑘𝑛). □ Lemma 3.2.6. Let𝑟,𝑠, and 𝑘be integers satisfying𝑟+𝑠 ≤ 𝑘. For sequencesc,c^′∈ {0,1}^𝑛, if c^′ and c share a common resulting sequence after 𝑟 deletions and 𝑠 insertions in both, thenc^′∈ B𝑘(c).

Lemma 3.2.7. For a positive integer𝑛 ≥ 3, the number of divisors of 𝑛is upper bounded by2¹^.^{6 ln}^𝑛/(^{ln ln}^𝑛).

The proofs of Lemma 3.2.1, Lemma 3.2.2, Lemma 3.2.3, and Lemma 3.2.4 rely on several propositions, the details of which will be presented in the next sections.

For convenience, a dependency graph for the theorem, lemmas, and propositions is given in Fig.3.2.

Dalam dokumen CorrectingErrorsinDNAStorage - California Institute of Technology (Halaman 67-72)