Robust Indexing for Deletion/Insertion Errors

Chapter VI: Robust Indexing: Optimal Codes Correcting Deletion/Insertion

6.5 Robust Indexing for Deletion/Insertion Errors

Letℓ^∗be the most significant bit wherea𝑖₁ anda𝑖₂ differ, i.e.,(𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ^∗−1) = (𝑎_𝑖

2,1, . . . , 𝑎_𝑖

2,ℓ^∗−1) and𝑎_𝑖

1,ℓ^∗ = 1 and𝑎_𝑖

2,ℓ^∗ =0. Then according to the if statement in the encoding procedure, we have that

𝑞_𝑖

1− ∑︁

ℓ:𝑎𝑖

1,ℓ=1and^ℓ^∈[ℓ^∗^]

(2^𝐿^′⁻^ℓ−𝑁_𝐻( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖¹⁻¹

𝑗=1)) > 0 and 𝑞_𝑖

2− ∑︁

ℓ:𝑎_𝑖

1,ℓ=1and^ℓ^∈[^ℓ^∗]

(2^𝐿^′⁻^ℓ−𝑁_𝐻( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖²⁻¹

𝑗=1)) ≤0, which implies that

𝑞_𝑖

2−𝑞_𝑖

1 <

∑︁

ℓ:𝑎𝑖

1,ℓ=1and^ℓ∈[ℓ^∗^]

(2^𝐿^′^−ℓ −𝑁_𝐻( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖²⁻¹

𝑗=1))

− ∑︁

ℓ:𝑎𝑖

1,ℓ=1and^ℓ∈[ℓ^∗^]

(2^𝐿^′⁻^ℓ −𝑁_𝐻( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖¹⁻¹

𝑗=1))

= ∑︁

ℓ:𝑎𝑖

1,ℓ=1and^ℓ∈[ℓ^∗^]

(𝑁_𝐻( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖¹⁻¹

𝑗=1)

−𝑁_𝐻( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖²⁻¹

𝑗=1))

= ∑︁

ℓ:𝑎𝑖

1,ℓ=1and^ℓ∈[ℓ^∗^]

𝑁_𝐻( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖_𝑗=𝑖¹⁻¹

(𝑎)

≤

𝑖₁−1

∑︁

𝑗=𝑖₂

|c: 𝑑_𝐻(c,a𝑗) ≤2𝑘|

=(𝑖₁−𝑖₂)𝑄 , (6.13)

where (𝑎) follows from the definition of 𝑁_𝐻(a, 𝐴) and the fact that the strings which have(𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ₁−1,0)and(𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ₂−1,0)as prefixes, respectively, where𝑎_𝑖

1,ℓ₁=1, 𝑎_𝑖

1,ℓ₂=1 andℓ₁ ≠ ℓ₂, are different. Eq. (6.13) contradicts to the fact that the integers(𝑞₁, . . . , 𝑞_𝑀) =𝐹^𝐻

𝑄 (𝑑)satisfy𝑞_𝑖−𝑞_𝑖+₁ > 𝑄for𝑖 ∈ [𝑀−1], which implies𝑞_𝑖

1−𝑞_𝑖

2 ≥ (𝑖₁−𝑖₂)𝑄.

Since the calculation of𝑁_𝐻(a, 𝐴)has polynomial complexity, the complexity of the encoding/decoding procedure is polynomial in 𝑀and 𝐿^′.

Theorem 6.5.1. For integers 𝑀 , 𝐿 , 𝑘, and 𝐿^′ ≜ 3 log𝑀+4𝑘²+1. If 𝐿^′+4𝑘 𝐿^′+ 2𝑘log(4𝑘 𝐿^′) ≤ 𝐿, there exists a𝑘-deletion code, computable in 𝑝 𝑜𝑙 𝑦(𝑀 , 𝐿) time, that has redundancy8𝑘log𝑀 𝐿+ (12𝑘 +2)log𝑀+𝑂(𝑘³) +𝑜(log𝑀 𝐿).

Similar to the construction in Sec.6.3, we use the first 𝐿^′bits (𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′), 𝑖 ∈ [𝑀] in each string x𝑖 as indexing bits and sort the strings {x𝑖} according to the lexicographic order of {(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1. To protect the ordering, we use Reed- Solomon code to protect the characteristic vector 1({(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1). The difference is that in this section, we construct the indexing bits{(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1

such that the mutual deletion distance among {(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1, rather than the mutual Hamming distance considered in Sec.6.3, is at least 2𝑘+1, i.e., the deletion ballsD𝑘( (𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′))andD𝑘( (𝑥_{𝑗 ,}₁, . . . , 𝑥_{𝑗 , 𝐿}′))do not intersect for𝑖≠ 𝑗, where the deletion ballD𝑘(u)of a stringu{0,1}^𝑛is the set of all length𝑛−𝑘 subsequence ofu. Define

S^𝐷 =

{a₁, . . . ,a𝑀} :D𝑘(a𝑖) ∩ D𝑘(a𝑗) =∅for𝑖 ≠ 𝑗 .

The construction is based on the following two lemmas, where the first one is robust indexing for deletion/insertion errors, which will be proved in Sec. 6.5 and the second one is a deletion code construction, which we presented in Ch. 4.

Lemma 6.5.1. For 𝑃=2^𝑘 ^𝐿_𝑘^′²

, there exists an invertible mapping 𝐹^𝐷

𝑆 : 2^𝐿^′ − (𝑀 −1)𝑃+𝑀−1

𝑀−1 →

{0,1}^𝐿^′ 𝑀

, computable in 𝑝 𝑜𝑙 𝑦(𝑀 , 𝐿) time, such that for any 𝑑 ∈ [ ⌈⁽²

𝐿′

−𝑀 𝑃)^𝑀−1

(𝑀−1)! ⌉], we have that𝐹^𝐷

𝑆 (𝑑) ∈ S^𝐷.

Lemma 6.5.2. (Corollary of Theorem 4.1.2) For any integer𝑛and𝑁 =𝑛+4𝑘log𝑛+ 𝑜(log𝑛), there exists a systematic encoding function 𝐸 𝑛𝑐 : {0,1}^𝑛 → {0,1}^𝑁, computed in𝑂(𝑛²^𝑘⁺¹) time, and a decoding function 𝐷 𝑒 𝑐 : {0,1}^𝑁⁻^𝑘 → {0,1}^𝑛, computed in𝑂(𝑛^𝑘+¹)time, such that for anyc∈ {0,1}^𝑛and substringd∈ {0,1}^𝑁−𝑘 of𝐸 𝑛𝑐(c), we have that𝐷 𝑒 𝑐(d) =c.

Code Constructions

The code construction is the same as that in Sec.6.3except that here, the indexing bits {(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1 are generated using the map 𝐹^𝐷

𝑆 . In addition, a deletion code in Lemma6.5.2is used to protect the concatenated string.

Let the datad∈ 𝐷to be encoded be a tupled=(𝑑₁,d₂), where𝑑₁ ∈ [ ⌈⁽²

𝐿′

−𝑀 𝑃)^𝑀−1 (𝑀−1)! ⌉]

and

d₂ ∈ {0,1}^𝑛

such that𝑛+4𝑘log𝑛+𝑜(log𝑛) =𝑀(𝐿−𝐿^′) −4𝑘 𝐿^′, which implies that𝑛= 𝑀(𝐿− 𝐿^′) −4𝑘 𝐿^′−4𝑘⌈log𝑀 𝐿⌉ −𝑜(log𝑀 𝐿). We briefly present the encoding/decoding procedure as follows.

Encoding:

(1) Let𝐹^𝐷

𝑆 (𝑑₁) ={a₁, . . . ,a𝑀} ∈ S^𝐻such thata₁=1𝐿^′anda₁ >a₂ > . . . >a𝑀. Let(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′) =a𝑖, for𝑖 ∈ [𝑀].

(2) Let

(𝑥₁_{, 𝐿}′+1, . . . , 𝑥₁_{, 𝐿}′+4𝑘 𝐿^′+4𝑘log(4𝑘 𝐿^′)+𝑜(log(4𝑘 𝐿^′)))

=𝐸 𝑛𝑐(𝑅 𝑆₂_𝑘(1({a₁, . . . ,a𝑀}))) (3) Place the deletion code𝐸 𝑛𝑐(d₂)in bits

(𝑥₁_{, 𝐿}′+4𝑘 𝐿^′+4𝑘log(4𝑘 𝐿^′)+𝑜(log(4𝑘 𝐿^′))+1, . . . , 𝑥₁_{, 𝐿}), and (𝑥_{𝑖, 𝐿}′+1, . . . , 𝑥_{𝑖, 𝐿})for𝑖 ∈ [2, 𝑀].

Upon receiving{x^′_𝑖}^𝑀

𝑖=1, the decoding procedure is as follows.

Decoding:

(1) Find the unique stringx^′_𝑖

0 such that (𝑥^′

𝑖₀,1, . . . , 𝑥^′

𝑖₀, 𝐿^′−𝑘) = 1𝐿^′−𝑘. Thenx^′_𝑖

0 is an erroneous copy ofx₁and the string

(𝑥^′

𝑖₀, 𝐿^′+1, . . . , 𝑥^′

𝑖₀, 𝐿^′+4𝑘 𝐿+4𝑘log(4𝑘 𝐿^′)+𝑜(log(4𝑘 𝐿^′))−𝑘) is an erroneous copy of

(𝑥₁_{, 𝐿}′+1, . . . , 𝑥₁_{, 𝐿}′+4𝑘 𝐿^′+4𝑘log(4𝑘 𝐿^′)+𝑜(log(4𝑘 𝐿^′))) =𝐸 𝑛𝑐(𝑅 𝑆₂_𝑘(1({a𝑖}_𝑖=1^𝑀 ))). Correct the vector𝑅 𝑆₂_𝑘(1({a𝑖}^𝑀

𝑖=1))and use it to recover1({a𝑖}^𝑀

𝑖=1), and thus the indexing bits{(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1. Recover𝑑₁= (𝐹^𝐻

𝑆 )⁻¹({a𝑖}^𝑀

𝑖=1).

(2) For each𝑖 ∈ [𝑀], find the unique𝜋(𝑖) ∈ [𝑀]such that(𝑥^′

𝜋(𝑖),1, . . . , 𝑥^′

𝜋(𝑖), 𝐿^′−𝑘) is a length𝐿^′−𝑘 substring of(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)(note that𝜋(1) =𝑖₀). Checking if a string is a substring of another can be done in linear time using a greedy algorithm.

(4) Since x^′_𝜋(𝑖) is an erroneous copy of x𝑖, 𝑖 ∈ [𝑀], the concatenation m^′ = ( (𝑥^′

𝜋(1), 𝐿^′+4𝑘 𝐿^′+4𝑘log(4𝑘 𝐿^′)+𝑜(log(4𝑘 𝐿^′))+1, . . . , 𝑥^′

𝜋(1), 𝐿₁),x^′_𝜋(

2), . . . ,x^′_𝜋(_𝑀)), where 𝐿₁ is the length of x^′_𝜋(

1), is an erroneous copy of 𝐸 𝑛𝑐(d₂). Use the de- coder𝐷 𝑒 𝑐(m^′)=d₂.

(5) Output(𝑑₁,d₂).

The proof of correctness is similar to that in Sec.6.3. The redundancy of the code is

𝑟(C)=log 2^𝐿

𝑀

−log⌈

Î^𝑀−1

𝑖=1 (2^𝐿^′ −𝑖 𝑃) (𝑀 −1)! ⌉

− [𝑀(𝐿−𝐿^′) −4𝑘 𝐿^′−8𝑘log(4𝑘 𝐿^′) −𝑜(log(4𝑘 𝐿^′))

−4𝑘⌈log𝑀 𝐿⌉ −𝑜(log𝑀 𝐿)]

≤8𝑘log𝑀 𝐿+ (12𝑘+2)log𝑀+𝑂(𝑘³) +𝑜(𝑘log𝑀 𝐿). Computing𝐹^𝐷

𝑆

We now prove Lemma 6.5.1. The robust indexing algorithm for generating the indexing strings {𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′} is the same as in Sec. 6.4 except that we replace the notations 𝑁_𝐻(a, 𝐴) and 𝑄, which are based on Hamming distance, with their deletion distance counterparts. For a string c ∈ {0,1}^ℓ and a set of indices Δ = {𝛿₁, . . . , 𝛿_𝑟} ⊂ [ℓ], letc(Δ)be the lengthℓ−𝑟subsequence ofcobtained by deleting bits(𝑐_𝛿

1, 𝑐_𝛿

2, . . . , 𝑐_𝛿

𝑟)inc.

For sequencesc₁ ∈ {0,1}^ℓ¹ andc₂ ∈ {0,1}^ℓ² and nonnegative integers𝑟₁, 𝑟₂, define the set

I (c₁,c₂, 𝑟₁, 𝑟₂) ={(Δ₁,Δ₂) :Δ₁ ⊆ [ℓ₁],|Δ₁| ≤𝑟₁,Δ₂ ⊆ [ℓ₂],|Δ₂| ≤ 𝑟₂, c₁(Δ₁) =c₂(Δ₂)}

and the number

𝑁(c₁,c₂, 𝑟₁, 𝑟₂) =|I (c₁,c₂, 𝑟₁, 𝑟₂) |, (6.14) which is the number of ways to delete no more than 𝑟₁ and 𝑟₂ bits in c₁ and c₂, respectively, such that the resulting subsequences are the same. For a sequencea∈ {0,1}^ℓ of lengthℓ ∈ [0, 𝐿^′]and a set of sequences 𝐴 ⊂ {0,1}^𝐿^′, define

𝑁_𝐷(a, 𝐴) =∑︁

c∈𝐴

∑︁

c^′:c^′∈{0,1}^𝐿′ and⁽^𝑐₁^′^,...,𝑐^ℓ)=a

𝑁(c^′,c, 𝑘 , 𝑘).

For an empty sequenceaand a sequencec, we have that 𝑁_𝐷(a,c) =𝑃 ≜

𝑘

∑︁

𝑟=0

𝐿^′ 𝑟

2^𝑟, (6.15)

since 𝑁_𝐷(a,c) is the number of tuples (c^′,Δ₁,Δ₂) of sequences c^′ ∈ {0,1}^𝐿^′ and index setsΔ₁,Δ₂ ⊂ [𝐿^′] such that after no more than𝑘 deletions in indicesΔ₁and Δ₂incandc^′, respectively, we obtain the same subsequencec(Δ₁) =c^′(Δ₂).

The algorithm for computing𝐹^𝐷

𝑆 is the same as that for computing𝐹^𝐻

𝑆 , by replacing the numbers 𝑁_𝐻( (𝑎_𝑖,₁, . . . , 𝑎_𝑖,ℓ−1,0),{a𝑗}^𝑖⁻¹

𝑗=1) and 𝑄 with numbers 𝑁_𝐷( (𝑎_𝑖,₁, . . . , 𝑎_𝑖,ℓ₋₁,0),{a𝑗}^𝑖⁻¹

𝑗=1) and 𝑃. To prove the correctness of the algorithm, we need to show that 𝑁_𝐷(a, 𝐴) satisfies the two properties similar to the ones in Eq. (6.5) and Eq. (6.6). The first is that

𝑁_𝐷(a, 𝐴)= 𝑁_𝐷( (a,0), 𝐴) +𝑁_𝐷( (a,1), 𝐴) (6.16) for a sequencea∈ {0,1}^ℓ of lengthℓ ∈ [𝐿^′−1] and a set 𝐴 ⊂ {0,1}^𝐿^′, which is a deletion counterpart of Eq. (6.5). This can be proved by noticing that

𝑁_𝐷(a, 𝐴)= ∑︁

c^′:c^′∈{0,1}^𝐿^′ and^(𝑐^′₁^,...,𝑐ℓ^′)=a

∑︁

c∈𝐴

𝑁(c^′,c, 𝑘 , 𝑘) and that for every sequence c^′ ∈ {0,1}^𝐿^′ that satisfies (𝑐^′

1, . . . , 𝑐^′

ℓ) = a, we have either𝑐^′

ℓ+1 =1 or𝑐^′

ℓ+1 =0.

The second property is that the number𝑁_𝐷(a, 𝐴)is computable in polynomial time.

Since obtaining an explicit expression as in Eq. (6.6) is challenging, we compute the number 𝑁_𝐷(a,c) using dynamic programming for two sequences a ∈ {0,1}^ℓ andc ∈ {0,1}^𝐿^′ such thatℓ ∈ [0, 𝐿^′]. Givenaandc, we compute

𝑛(𝑘₁, 𝑘₂, 𝑟₁, 𝑟₂)

= ∑︁

c^′:c^′∈{0,1}^{𝐿′−ℓ+𝑘}1 and^(𝑐^′₁^,...,𝑐^′𝑘 1)=(𝑎ℓ−𝑘

1+1,...,𝑎ℓ)

𝑁(c^′,(𝑐_𝐿′−𝑘₂+1, . . . , 𝑐_𝐿′), 𝑟₁, 𝑟₂). Note that𝑁_𝐷(a,c) =𝑛(ℓ, 𝐿^′, 𝑘 , 𝑘). In addition, by definition of𝑁_𝐷(a, 𝐴), we have that𝑁_𝐷(a, 𝐴) =Í

c∈𝐴𝑁_𝐷(a,c). Hence,𝑁_𝐷(a, 𝐴)can be computed efficiently when 𝑁_𝐷(a,c)is computed.

For𝑘₁=0, we have that

𝑛(0, 𝑘₂, 𝑟₁, 𝑟₂) = ∑︁

c^′:c^′∈{0,1}^𝐿^′−^ℓ

𝑁(c^′,(𝑐_𝐿′−𝑘₂+1, . . . , 𝑐_𝐿′), 𝑟₁, 𝑟₂), (6.17)

which by Eq. (6.14) equals 0 when 𝐿^′−ℓ−𝑟₁ > 𝑘₂or𝑘₂−𝑟₂ > 𝐿^′−ℓ. When𝐿^′− ℓ−𝑟₁ ≤ 𝑘₂and𝑘₂−𝑟₂≤ 𝐿^′−ℓ, we show that

𝑛(0, 𝑘₂, 𝑟₁, 𝑟₂) =

𝑟₂

∑︁

𝑖=𝑘₂−(𝐿^′−ℓ)

𝑘₂ 𝑖

𝐿^′−ℓ 𝐿^′−ℓ− (𝑘₂−𝑖)

2^𝐿^′⁻^ℓ⁻⁽^𝑘²⁻^𝑖⁾, (6.18) for𝑘₂≥ 𝐿^′−ℓand that

𝑛(0, 𝑘₂, 𝑟₁, 𝑟₂) =

𝑟₁

∑︁

𝑖=𝐿^′−ℓ−𝑘₂

𝑘₂ 𝑘₂− (𝐿^′−ℓ−𝑖)

𝐿^′−ℓ 𝑖

2^𝑖, (6.19) for𝑘₂< 𝐿^′−ℓ. For𝑘₂ ≥ 𝐿^′−ℓand sets(Δ₁,Δ₂) ∈ I (c^′,c= (𝑐_𝐿′−𝑘₂+1, . . . , 𝑐_𝐿′), 𝑟₁, 𝑟₂), the cardinality|Δ₂|satisfies𝑘₂− (𝐿^′−ℓ) ≤ |Δ₂| ≤𝑟₂because

c^′(Δ₁)= (𝑐_𝐿′−𝑘₂+1, . . . , 𝑐_𝐿′) (Δ₂). For given |Δ₂|, there are _|Δ^𝑘²

ways to select Δ₂ and _𝐿′−ℓ−(^𝐿^′⁻𝑘^ℓ₂−Δ₂)

choices of Δ₁. Moreover, givenc,Δ₁, andΔ₂, there are 2^𝐿^′^{−ℓ−(𝑘}²^−Δ²⁾ choices ofc^′such thatc(Δ₂)= c^′(Δ₁). Hence we have Eq. (6.18). Similarly, we have Eq. (6.19). Therefore, the number𝑛₍𝑘₁, 𝑘₂, 𝑟₁, 𝑟₂)can be computed when𝑘₁=0.

For 𝑘₁ > 0, we compute 𝑛_𝑘

1, 𝑘₂,𝑟₁,𝑟₂ iteratively from 𝑘₁ = 0 to 𝑘₁ = ℓ using the following recursion.

𝑛(𝑘₁, 𝑘₂, 𝑟₁, 𝑟₂) = ∑︁

𝑘:𝑘∈[𝐿^′−𝑘₂+1, 𝐿^′],𝑐_𝑘=𝑎_ℓ−𝑘

1+1

𝑛(𝑘₁−1, 𝐿^′−𝑘 , 𝑟₁, 𝑟₂− 𝑘+𝐿^′−𝑘₂+1) +2𝑛(𝑘₁−1, 𝑘₂, 𝑟₁−1, 𝑟₂), (6.20) where 𝑛(𝑘^′, 𝑘^′′, 𝑟^′, 𝑟^′′) = 0 if 𝑟^′ < 0 or 𝑟^′′ < 0. Note that for any (Δ₁,Δ₂) ∈ I (c^′,c= (𝑐_𝐿′−𝑘₂+1, . . . , 𝑐_𝐿′), 𝑟₁, 𝑟₂), we have either 1∈Δ₁or 1∉ Δ₁. When 1∈Δ₁, thenc^′′(Δ₁\{1} −1) = (𝑐_𝐿′−𝑘₂+1, . . . , 𝑐_𝐿′) (Δ₂), wherec^′′ = (𝑐^′

2, . . . , 𝑐^′

𝐿^′−ℓ+𝑘₁) and Δ−𝑖 ={𝑗−𝑖 : 𝑗 ∈Δ}for any setΔand integer𝑖. Note that there are𝑛(𝑘₁−1, 𝑘₂, 𝑟₁− 1, 𝑟₂) choices of(c^′′,Δ₁\{1} −1,Δ₂)such that(𝑐^′′

1, . . . , 𝑐^′′

𝑘₁−1) =(𝑎_ℓ−𝑘

1+2,...,𝑎ℓ)and c^′′(Δ₁\{1}−1)= (𝑐_𝐿′−𝑘₂+1, . . . , 𝑐_𝐿′) (Δ₂). Since𝑐^′

1can be either 0 or 1 when 1∈Δ₁. We have 2𝑛(𝑘₁−1, 𝑘₂, 𝑟₁−1, 𝑟₂) choices of (c^′,Δ₁,Δ₂) such that (𝑐^′

2, . . . , 𝑐^′

𝑘₁) = (𝑐^′′

1, . . . , 𝑐^′′

𝑘₁−1) =(𝑎_𝑙₋_𝑘

1+2,...,𝑎_ℓ)andc^′(Δ₁)= (𝑐_𝐿′−𝑘₂+1, . . . , 𝑐_𝐿′) (Δ₂), when 1 ∈Δ₁. When 1 ∉ Δ₁, Let 𝑘 be the minimum index such that 𝑘 ∈ [𝐿^′− 𝑘₂ +1, 𝐿^′] and (𝑘−𝐿^′+𝑘₂) ∉ Δ₂. Then, we have that𝑐_𝑘 =𝑐^′

1=𝑎_𝑙₋_𝑘

1+1, [1, 𝑘−𝐿^′+𝑘₂−1] ∈ Δ₂, and c^′′(Δ₁−1) = (𝑐_𝑘₊₁, . . . , 𝑐_𝐿′) (Δ₂\[1, 𝑘 − 𝐿^′+ 𝑘₂ −1] − 𝑘 + 𝐿^′− 𝑘₂), where c^′′ =(𝑐^′

2, . . . , 𝑐^′

𝐿^′−ℓ+𝑘₁). There are𝑛(𝑘₁−1, 𝐿^′−𝑘 , 𝑟₁, 𝑟₂−𝑘+𝐿^′−𝑘₂+1)choices of (c^′′,Δ₁ − 1,Δ₂\[1, 𝑘 − 𝐿^′+ 𝑘₂ − 1] − 𝑘 + 𝐿^′− 𝑘₂) such that c^′′(Δ₁ − 1) =

(𝑐_𝑘+₁, . . . , 𝑐_𝐿′) (Δ₂\[1, 𝑘 − 𝐿^′+ 𝑘₂−1] −𝑘 + 𝐿^′− 𝑘₂) and that (𝑐^′′

1, . . . , 𝑐^′′

𝑘₁−1) = (𝑎_ℓ−𝑘

1+2,...,𝑎ℓ). Therefore, there are 𝑛(𝑘₁ − 1, 𝐿^′ − 𝑘 , 𝑟₁, 𝑟₂ − 𝑘 + 𝐿^′ − 𝑘₂ + 1) choices of (c^′,Δ₁,Δ₂) such that (𝑐^′

1, . . . , 𝑐^′

𝑘₁) = (𝑎_ℓ−𝑘

1+1, . . . , 𝑎_ℓ) and c^′(Δ₁) = (𝑐_𝐿′−𝑘₂+1, . . . , 𝑐_𝐿′) (Δ₂). Note that for each 𝑘 satisfying 𝑘 ∈ [𝐿^′− 𝑘₂+1, 𝐿^′] and 𝑐_𝑘 =𝑐^′

1 =𝑎_ℓ₋_𝑘

1+1, there are𝑛(𝑘₁−1, 𝐿^′−𝑘 , 𝑟₁, 𝑟₂−𝑘+𝐿^′−𝑘₂+1)choices of such (c^′,Δ₁,Δ₂). In addition, different 𝑘 corresponds to different choices since 𝑘 is the minimum index such that(𝑘−𝐿^′+𝑘₂) ∉ Δ₂. Hence, we have (6.4).

By Eq. (6.17), (6.18), (6.19), and (6.4), the number 𝑁(a,c) = 𝑛(ℓ, 𝐿^′, 𝑘 , 𝑘) can be recursively computed for any a ∈ {0,1}^ℓ and c ∈ {0,1}^𝐿^′. Therefore, the encoding/decoding can be computed in 𝑝 𝑜𝑙 𝑦(𝑀 , 𝐿^′) time.

We are now ready to present the algorithm that computes 𝐹^𝐷

𝑆 (𝑑) for an integer 𝑑 ∈ h

2^𝐿^′−(𝑀−1)𝑃+𝑀−1 𝑀−1

. The algorithm is the same as the encoding procedure in Sec. 6.4, by replacing 𝑁_𝐻(a, 𝐴) with 𝑁_𝐷(a, 𝐴) for any sequence a and set of sequences 𝐴. In addition, the integers 𝑞_𝑖 are generated such that 𝑞₁ = 2^𝐿^′ and 𝑞_𝑖₊₁− 𝑞_𝑖 > 𝑃 for𝑖 ∈ [𝑀 −1]. Such 𝑞_𝑖, 𝑖 ∈ [𝑀] can be generated following the same argument in Lemma6.4.1, since𝑑 ∈ h

2^𝐿^′−(𝑀−1)𝑃+𝑀−1 𝑀−1

. Given integers 𝑞_𝑖, 𝑖 ∈ [𝑀], satisfying 𝑞₁ = 2^𝐿^′ and 𝑞_𝑖+₁− 𝑞_𝑖 > 𝑃 for 𝑖 ∈ [𝑀 −1], the encoding procedure for generating{a₁, . . . ,a𝑀} is given as follows.

Encoding:

for𝑖 ∈ [𝑀], do 𝑞 =𝑞_𝑖.

forℓ ∈ [𝐿^′], do

if 2^𝐿^′^−ℓ −𝑁_𝐷( (𝑎_𝑖,₁, . . . , 𝑎_𝑖,ℓ−1,0),{a𝑗}^𝑖−1

𝑗=1) ≥𝑞, then𝑎_𝑖,ℓ =0.

else

𝑞 =𝑞− (2^𝐿^′^−ℓ −𝑁_𝐷( (𝑎_𝑖,₁, . . . , 𝑎_𝑖,ℓ−1,0),{a𝑗}^𝑖−1

𝑗=1)), 𝑎_𝑖,ℓ =1.

end if end for

end for

return{a₁, . . . ,a𝑀}.

The correctness of the encoding procedure follows similar argument to the one in Sec. 6.4. We prove that the input(𝑞₁, . . . , 𝑞_𝑀)and output{a₁, . . . ,a𝑀}satisfy

decimal(a𝑖) =𝑞_𝑖−1+ ∑︁

ℓ:𝑎𝑖 ,ℓ=1and^ℓ^∈[𝐿^′^]

𝑁_𝐷( (𝑎_𝑖,₁, . . . , 𝑎_𝑖,ℓ₋₁,0),{a𝑗}^𝑖−_𝑗₌₁¹) (6.21) and {a₁, . . . ,a𝑀} ∈ S^𝐷. The following is a deletion metric version of 6.4.3, by replacing𝑁_𝐻(a, 𝐴)with𝑁_𝐷(a, 𝐴)for any sequencea∈ {0,1}^ℓand set𝐴 ∈ {0,1}^𝐿^′. Lemma 6.5.3. After theℓ-th,ℓ ∈ [𝐿^′], inner for loop in the𝑖-th,𝑖 ∈ [𝑀], outer for loop in the encoding procedure, we have that

0< 𝑞 ≤ 2^𝐿^′⁻^ℓ −𝑁_𝐷( (𝑎_𝑖,₁, . . . , 𝑎_𝑖,ℓ),{a𝑗}^𝑖_𝑗⁻¹₌₁). (6.22) At the end of the𝑖-th outer for loop, we have that𝑞 =1.

Proof. The proof is the same as that of Lemma6.4.3, by noticing that 𝑁_𝐷(0,{a𝑗}^𝑖_𝑗⁻¹₌₁) +𝑁_𝐷(1,{a𝑗}^𝑖_𝑗⁻¹₌₁) =

𝑖−1

∑︁

𝑗=1

𝑁_𝐷(,a𝑗)

(𝑎)

=(𝑖−1)𝑃,

where is the empty sequence and(𝑎)follows from (6.15) and the fact that𝑁_𝐷(a, 𝐴)= Í

c∈𝐴𝑁_𝐷(a,c). In addition, we have (6.16), which is the deletion metric version of (6.5). The rest of the proof follows the same as in6.4.3. □ From Lemma6.5.3, we have

𝑞 =2^𝐿^′⁻^𝐿^′ −𝑁_𝐷(a𝑖,{a𝑗}^𝑖_𝑗⁻¹₌₁) =1,

at the end of the 𝑖-th outer for-loop, 𝑖 ∈ [𝑀]. Hence, 𝑁_𝐷(a𝑖,{a𝑗}^𝑖⁻¹

𝑗=1) = 0 for 𝑖 ∈ [𝑀] and D𝑘(a𝑖) ∩ D𝑘(a𝑗) = ∅ for any 𝑖 ≠ 𝑗, 𝑖, 𝑗 ∈ [𝑀]. Then, we have that{a𝑖}^𝑀

𝑖=1 ∈ S^𝐷. In addition, similar to Lemma6.4.4, we can use Lemma6.5.3to show that the output{a𝑖}^𝑀

𝑖=1satisfies Eq. (6.21).

Therefore, we have the following decoding algorithm, similar to the one in Sec. 6.4.

Decoding:

(1) Order the strings{a𝑖}^𝑀

𝑖=1such thata₁ > a₂ > . . . >a𝑀. (2) For𝑖 ∈ [𝑀],

𝑞_𝑖=decimal(a𝑖) +1+ ∑︁

ℓ:𝑎_{𝑖 ,ℓ}=1and^ℓ^∈[^𝐿^′^]

𝑁_𝐷( (𝑎_𝑖,₁, . . . , 𝑎_𝑖,ℓ−1,0),{a𝑗}^𝑖_𝑗⁻¹₌₁). (6.23) Finally, the correctness of decoding is guaranteed by (6.21) and the fact thata₁ >

a₂ > . . . > a𝑀, where a𝑖 is the output generated in the𝑖-th outer-loop. The latter follows similar proof to the one in Sec. 6.4. Suppose there exists𝑖₁ > 𝑖₂ such that a𝑖₁ >a𝑖₂ alphabetically. Then we have that

𝑞_𝑖

2−𝑞_𝑖

1 <

∑︁

ℓ:𝑎_𝑖

1,ℓ=1and^ℓ^∈[^ℓ^∗^] (2^𝐿^′^−ℓ

−𝑁_𝐷( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖²⁻¹

𝑗=1))

− ∑︁

ℓ:𝑎𝑖

1,ℓ=1and^ℓ∈[ℓ^∗^]

(2^𝐿^′^−ℓ −𝑁_𝐷( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖¹⁻¹

𝑗=1))

= ∑︁

ℓ:𝑎𝑖

1,ℓ=1and^ℓ∈[ℓ^∗^]

(𝑁_𝐷( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖¹⁻¹

𝑗=1)

−𝑁_𝐷( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖²⁻¹

𝑗=1))

= ∑︁

ℓ:𝑎𝑖

1,ℓ=1and^ℓ∈[ℓ^∗^]

𝑁_𝐷( (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ−1,0),{a𝑗}^𝑖_𝑗=𝑖¹⁻¹

(𝑎)

≤𝑁_𝐷(∅,{a𝑗}^𝑖¹⁻¹

𝑗=𝑖₂)

(𝑏)

≤ (𝑖₁−𝑖₂)𝑃, (6.24)

where∅is the empty sequence and(𝑎)follows from the definition of𝑁_𝐷(a, 𝐴)and the fact that the strings which have (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ₁−1,0) and (𝑎_𝑖

1,1, . . . , 𝑎_𝑖

1,ℓ₂−1,0) as prefixes, respectively, are different. Inequality (𝑏) follows from (6.15) and the fact that𝑁_𝐷(a, 𝐴)=Í

c∈𝐴𝑁_𝐷(a,c).

Dalam dokumen CorrectingErrorsinDNAStorage - California Institute of Technology (Halaman 189-197)