Robust Indexing for Codes over Sets: Substitution Errors

Chapter VI: Robust Indexing: Optimal Codes Correcting Deletion/Insertion

6.3 Robust Indexing for Codes over Sets: Substitution Errors

In this section we describe our code constructions, which use the idea of robust indexing. These codes deal with substitution errors in the 𝑀 strings. Our codes have redundancy𝑂(𝑘log𝑀 𝐿), which is order-wise optimal whenever 𝑘 is at most 𝑂(min{𝐿^1/3, 𝐿/log𝑀}).

Theorem 6.3.1. For integers𝑀 , 𝐿, and𝑘, let𝐿^′ ≜ 3 log𝑀+4𝑘²+1. If𝐿^′+4𝑘 𝐿^′+ 2𝑘log(4𝑘 𝐿^′) ≤ 𝐿, then there exists an explicit 𝑘-substitution code, computable in𝑝 𝑜𝑙 𝑦(𝑀 , 𝐿 , 𝑘)time, that has redundancy2𝑘log𝑀 𝐿+ (12𝑘+2)log𝑀+𝑂(𝑘³) + 𝑂(𝑘log log𝑀 𝐿).

Since the codewords consist of unordered strings, we assign indexing bits to each string such that the strings are ordered. However, instead of directly assigning the indices 1, . . . , 𝑀 to each string, we embed information into the indexing bits. In other words, we use the information bits themselves for the purpose of indexing.

This provides more efficiency in sending information.

Specifically, for a codeword𝑊 = {x𝑖}^𝑀

𝑖=1, we choose the first𝐿^′bits (𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′), 𝑖 ∈ [𝑀] in each string x𝑖 as the indexing bits, and encode information in them. Then, the strings{x𝑖}^𝑀

𝑖=1are sorted according to the lexicographic order𝜋of the indexing bits(𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′),𝑖 ∈ [𝑀], where (𝑥_𝜋(𝑖),₁, 𝑥_𝜋(𝑖),₂, . . . , 𝑥_{𝜋(𝑖), 𝐿}′) <

(𝑥_𝜋(_𝑗),₁, 𝑥_𝜋(_𝑗),₂, . . . , 𝑥_𝜋(_{𝑗), 𝐿}′) for𝑖 < 𝑗. Once {x𝑖}^𝑀

𝑖=1are ordered, it suffices to use a Reed-Solomon code to protect the concatenated string (x_𝜋(1), . . . ,x𝜋(𝑀)), and thus the codeword{x𝑖}^𝑀

𝑖=1, from𝑘substitution errors.

One of the key issues with this approach is that the indexing bits and their lexicographic order can be disrupted by substitution errors. To deal with this, we present a technique referred to as robust indexing, which protects the indexing bits from substitution errors. The basic ideas of robust indexing are as follows:

(1) Constructing the indexing bits {(𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1such that the Hamming distance between any two distinct (𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′) and (𝑥_{𝑗 ,}₁, 𝑥_{𝑗 ,}₂, . . . , 𝑥_{𝑗 , 𝐿}′) is at least 2𝑘 +1, i.e., the strings {(𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1 form a error correcting code under classic coding theoretic definition. Then, we can identify which string among{(𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1results in the erroneous version(𝑥^′

𝑖,1, 𝑥^′

𝑖,2, . . . , 𝑥^′

𝑖, 𝐿^′), by using a minimum Hamming distance criterion; (2) Using additional redundancy to protect the set of indexing bits{(𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1from substitution errors.

Note that we encode data in the code {(𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1 through different choices of the code. After substitution errors, two choices of the code, which represent different messages, might result in the same read{(𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1. Example 6.3.1. For 𝑘 = 𝑀 = 2 and 𝐿 = 8, consider two codes {11111111, 00000000} and {11111111,00010010}. Both have minimum Hamming distance greater than 2𝑘 + 1 = 5 and can result in the same set {11111111,00000011} after𝑘 =2substitutions.

Hence, to recover the indexing bits(𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′),𝑖 ∈ [𝑀], we need to know the code{(𝑥_𝑖,₁, 𝑥_𝑖,₂, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1to which the erroneous string (𝑥^′

𝑖,1, 𝑥^′

𝑖,2, . . . , 𝑥^′

𝑖, 𝐿^′) is corrected,𝑖 ∈ [𝑀].

For an integerℓ, let 1ℓ be the all 1’s vector of lengthℓ. DefineS^𝐻 as the set of all length𝐿^′codes with cardinality𝑀and minimum Hamming distance at least 2𝑘+1, which contain1𝐿^′, that is,

S^𝐻 ≜ n

{a₁, . . . ,a𝑀} ∈

{0,1}^𝐿^′ 𝑀

a₁=1𝐿^′ and𝑑_𝐻(a𝑖,a𝑗) ≥2𝑘+1 for every distinct𝑖, 𝑗 ∈ [𝑀]o

The following lemma gives a lower bound on the size ofS^𝐻 and is obtained using counting argument.

Lemma 6.3.1. Let𝑄 =Í2𝑘 𝑖=0

𝐿^′ 𝑖

be the size of a Hamming ball of radius2𝑘centered at a vector in{0,1}^𝐿^′. We have that

|S^𝐻| ≥ (2^𝐿^′− 𝑀 𝑄)^𝑀−1

(𝑀 −1)! . (6.1)

Proof. Define the set of ordered tuples S_𝑇^𝐻 =

(a₁, . . . ,a𝑀) :a₁=1𝐿^′ and𝑑_𝐻(a𝑖,a𝑗) ≥2𝑘+1 for distinct𝑖, 𝑗 ∈ [𝑀] such that for each tuple (a₁, . . . ,a𝑀) ∈ S^𝐻

𝑇 , we have that{a₁, . . . ,a𝑀} ∈ S^𝐻. We show that |S^𝐻

𝑇 | ≥ Î^𝑀

𝑖=2[2^𝐿^′ − (𝑖 −1)𝑄], by finding Î^𝑀

𝑖=2[2^𝐿^′ − (𝑖− 1)𝑄] tuples in S^𝐻

𝑇 . Let a₁ = 1𝐿^′. We select a₂, . . . ,a𝑀 sequentially such that each selected string a𝑖, 𝑖 ∈ [2, 𝑀] ≜ {2, . . . , 𝑀}, is of Hamming distance at least 2𝑘 +1 from each one ofa₁, . . . ,a_𝑖−1. The tuple(a₁, . . . ,a𝑀) selected in this way has pairwise Hamming distance at least 2𝑘 +1 and thus belongs toS^𝐻

𝑇 .

Since the number of strings having Hamming distance at most 2𝑘 from at least one ofa₁, . . . ,a𝑖−1is at most(𝑖−1)𝑄, there are at least 2^𝐿^′−(𝑖−1)𝑄possible choices ofa𝑖

that have Hamming distance at least 2𝑘+1 from each one ofa₁, . . . ,a𝑖−1. Therefore, the total number of ways selecting tuples(a₁, . . . ,a𝑀)is at leastÎ^𝑀

𝑖=2[2^𝐿^′− (𝑖−1)𝑄]. Since in the above selection of tuples (a₁ =1𝐿^′, . . . ,a𝑀) ∈ S^𝐻

𝑇 , there are (𝑀 −1)! tuples that correspond to the same set{a₁=1𝐿^′,a₂, . . . ,a𝑀}inS^𝐻, we have that

|S^𝐻| =|S_𝑇^𝐻|/(𝑀 −1)!≥

𝑀

𝑖=2

[2^𝐿^′ − (𝑖−1)𝑄]/(𝑀−1)! ≥ (2^𝐿^′− 𝑀 𝑄)^𝑀⁻¹ (𝑀−1)! .

□ According to (6.1), there exists an invertible mapping 𝐹^𝐻

𝑆 : [ ⌈⁽²

𝐿′

−𝑀 𝑄)^𝑀⁻¹

(𝑀−1)! ⌉] →

{0,1}^𝐿^′ 𝑀

, computed in 𝑂(2^{𝑀 𝐿}^′) time using brute force, that maps an integer 𝑑 ∈ h

⌈⁽²

𝐿′

−𝑀 𝑄)^𝑀⁻¹ (𝑀−1)! ⌉i

to a code𝐹^𝐻

𝑆 (𝑑) ∈ S^𝐻. In the next session, we will present a poly- nomial time algorithm that computes a map𝐹^𝐻

𝑆 (𝑑)for any𝑑 ∈ h

⌈⁽²

𝐿′

−𝑀 𝑄)^𝑀⁻¹ (𝑀−1)! ⌉i

such that𝐹^𝐻

𝑆 (𝑑) ∈ S^𝐻 and𝐹^𝐻

𝑆 (𝑑₁) ≠ 𝐹^𝐻

𝑆 (𝑑₂)for𝑑₁≠ 𝑑₂and𝑑₁, 𝑑₂∈ h

⌈⁽²

𝐿′

−𝑀 𝑄)^𝑀−1 (𝑀−1)! ⌉i

. Let us assume for now that the mapping𝐹^𝐻

𝑆 is given.

For a set𝑆 ∈ ^{0^,^1}^𝐿

′

≤𝑀

, define the characteristic vector1(𝑆) ∈ {0,1}²^𝐿

′

of𝑆by 1(𝑆)𝑖=











1 if the binary presentation of𝑖is in𝑆 0 else

Notice that the Hamming weight of1(𝑆)is𝑀for every𝑆 ∈ ^{0^,^1}^𝐿

′

𝑀

. The following lemma states that 𝑘 substitution errors in a set of strings 𝑆result in at most 2𝑘 bit flips in1(𝑆).

Lemma 6.3.2. For 𝑆₁, 𝑆₂ ∈ ^{0^,^1}^𝐿

′

≤𝑀

, if𝑆₁ ∈ B^𝐻

𝑘 (𝑆₂), then𝑑_𝐻(1(𝑆₁),1(𝑆₂)) ≤2𝑘, where𝑑_𝐻(1(𝑆₁),1(𝑆₂))is the Hamming distance between1(𝑆₁)and1(𝑆₂). Proof. Note that|𝑆₁\𝑆₂| ≤ 𝑘and|𝑆₂\𝑆₁| ≤ 𝑘. Hence𝑑_𝐻(1(𝑆₁),1(𝑆₂)) =|𝑆₁\𝑆₂∪

𝑆₂\𝑆₁| ≤ 2𝑘. □

We are ready to present the code construction. We use a set𝑆 ∈ S^𝐻 as indexing bits and protect the vector1𝑆from substitution errors. Note that any two strings in the set𝑆have Hamming distance at least 2𝑘+1. Hence, knowing the set𝑆, each string of indexing bits can be extracted from its erroneous version using a minimum distance decoder, which finds the unique string in𝑆that is within Hamming distance𝑘 from it. The details are given as follows.

Consider the data d ∈ 𝐷 to be encoded as a tuple d = (𝑑₁,d₂), where 𝑑₁ ∈ [ ⌈⁽²

𝐿′

−𝑀 𝑄)^𝑀−1 (𝑀−1)! ⌉]and

d₂ ∈ {0,1}^𝑀(^𝐿−𝐿^′⁾⁻⁴^{𝑘 𝐿}^′⁻²^𝑘⌈^log^{𝑀 𝐿⌉}. Given (𝑑₁,d₂), the codeword{x𝑖}^𝑀

𝑖=1is generated by the following procedure.

Encoding:

(1) Let𝐹^𝐻

𝑆 (𝑑₁) = {a₁, . . . ,a𝑀} ∈ S^𝐻 such thata₁ =1𝐿^′ and thea𝑖’s are sorted in a descending lexicographic order. Let (𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)=a𝑖, for𝑖 ∈ [𝑀].

(2) Let(𝑥₁_{, 𝐿}′+1, . . . , 𝑥₁_{, 𝐿}′+4𝑘 𝐿^′) =𝑅 𝑆₂_𝑘(1({a₁, . . . ,a𝑀})), where𝑅 𝑆₂_𝑘(1({a₁, . . . , a𝑀}))is the redundancy of a systematic Reed-Solomon code that corrects 2𝑘 substitutions in1({a₁, . . . ,a𝑀}).

(3) Place the information bits ofd₂in bits (𝑥₁_{, 𝐿}′+4𝑘 𝐿^′+1, . . . , 𝑥₁_{, 𝐿}),

(𝑥_{𝑀 , 𝐿}′+1, . . . , 𝑥_{𝑀 , 𝐿}₋₂_𝑘_⌈log_{𝑀 𝐿}_⌉); and (𝑥_{𝑖, 𝐿}′+1, . . . , 𝑥_{𝑖, 𝐿})for𝑖 ∈ [2, 𝑀−1]. (4) Define

m=(x₁, . . . ,x𝑀−1,(𝑥_{𝑀 ,}₁, . . . , 𝑥_{𝑀 , 𝐿}₋₂_𝑘_⌈log_{𝑀 𝐿}_⌉))

and let (𝑥_{𝑀 , 𝐿}₋₂_𝑘_⌈log_{𝑀 𝐿}_⌉+1, . . . , 𝑥_{𝑀 , 𝐿}) =𝑅 𝑆_𝑘(m), where𝑅 𝑆_𝑘(m) is the Reed- Solomon redundancy that corrects𝑘substitution errors inm. Note that(x₁, . . . , x𝑀) = (m, 𝑅 𝑆_𝑘(m))is a𝑘-substitution correcting Reed-Solomon code.

Upon receiving the erroneous version1 {x^′₁, . . . ,x^′_𝑀}, the decoding procedure is as follows.

Decoding:

(1) Note that during the encoding process, the redundancy bits that correct the vector 1({a𝑖}^𝑀

𝑖=1), i.e., the characteristic vector of the set of indexing bits{(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1, are stored inx₁. Hence we must first identify the erroneous copy ofx₁. To this end, find the unique stringx^′_𝑖

0 such that(𝑥^′

𝑖₀,1, . . . , 𝑥^′

𝑖₀, 𝐿^′) has at least 𝐿^′− 𝑘 many 1-entries. Since the strings {x𝑖}^𝑀

𝑖=1 have Hamming distance at least 2𝑘+1, there is a unique such string, which is the erroneous copy of{(𝑥₁_,₁, . . . , 𝑥₁_{, 𝐿}′)}^𝑀

𝑖=1. Hence x^′_𝑖

0is an erroneous copy ofx₁ and the string

(𝑥^′

𝑖₀, 𝐿^′+1, . . . , 𝑥^′

𝑖₀, 𝐿^′+4𝑘 𝐿)

is an erroneous copy of (𝑥₁_{, 𝐿}′+1, . . . , 𝑥₁_{, 𝐿}′+4𝑘 𝐿^′) =𝑅 𝑆₂_𝑘(1({a𝑖}^𝑀

𝑖=1)).

(2) Let𝑘 be the number of substitution errors that occur to the indexing bits{𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′}^𝑀

𝑖=1. According to Lemma6.3.2, the vector1({(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1)is within Hamming distance 2𝑘 from the vector1({(𝑥^′

𝑖,1, . . . , 𝑥^′

𝑖, 𝐿^′)}^𝑀

𝑖=1). Hence the Hamming distance between

s₁= (1({(𝑥^′

𝑖,1, . . . , 𝑥^′

𝑖, 𝐿^′)}_𝑖^𝑀₌₁),(𝑥^′

𝑖₀, 𝐿^′+1, . . . , 𝑥^′

𝑖₀, 𝐿^′+4𝑘 𝐿))and s₂= (1({a𝑖}_𝑖^𝑀₌₁), 𝑅 𝑆₂_𝑘(1({a𝑖}_𝑖^𝑀₌₁)))

is at most 2𝑘. Since s₂ is a 2𝑘 error correcting Reed Solomon code, it can be recovered from s₁ using the Reed-Solomon decoder. Recover 𝑑₁ = (𝐹^𝐻

𝑆 )⁻¹({a𝑖}^𝑀

𝑖=1).

1Since the strings{x𝑖}^𝑀_𝑖=1 have distance at least 2𝑘+1 with each other, the strings{x^′_𝑖}^𝑀_𝑖=1 are different.

(3) Since s₂ is recovered, the strings {(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1 = {a𝑖}^𝑀

𝑖=1 are known.

Sort{(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1lexicographically in descending order. For each𝑖 ∈ [𝑀], find the unique𝜋(𝑖) ∈ [𝑀]such that𝑑_𝐻( (𝑥^′

𝜋(𝑖),1, . . . , 𝑥^′

𝜋(𝑖), 𝐿^′),

(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)) ≤ 𝑘 (note that𝑖₀ = 𝜋(1)). Similar to Step (1), we conclude that the stringx^′_𝜋(𝑖) is an erroneous copy ofx𝑖,𝑖 ∈ [𝑀], since the Hamming distance betweenx𝑗 andx𝑖 is at least 2𝑘 +1 for 𝑗 ≠𝑖. Hence, the identify of {(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}′)}^𝑀

𝑖=1are determined from{(𝑥^′

𝑖,1, . . . , 𝑥^′

𝑖, 𝐿^′)}^𝑀

𝑖=1.

(4) Since x^′_𝜋(𝑖) is an erroneous copy of x𝑖, 𝑖 ∈ [𝑀], it follows that the con- catenation s^′ = (x^′_𝜋(1), . . . ,x^′_𝜋(_𝑀)) is an erroneous copy of (x₁, . . . ,x𝑀) = (m, 𝑅 𝑆_𝑘(m)), where m is defined in Step (4) in the encoding procedure.

Therefore,(x₁, . . . ,x𝑀)and thusd₂can be recovered from(x^′_𝜋₍₁₎, . . . ,x^′_𝜋₍_𝑀₎) by using the Reed-Solomon decoder.

(5) Output(𝑑₁,d₂).

Therefore, the codeword{x𝑖}^𝑀

𝑖=1can be recovered. The redundancy of the code is 𝑟(C)=log

2^𝐿 𝑀

−log⌈(2^𝐿^′ −𝑀 𝑄)^𝑀⁻¹

(𝑀−1)! ⌉ − [𝑀(𝐿−𝐿^′) −4𝑘 𝐿^′−2𝑘⌈log𝑀 𝐿⌉]

(6.2)

(𝑎)

≤2𝑘log𝑀 𝐿+ (12𝑘+2)log𝑀+𝑂(𝑘³) +𝑂(𝑘log log𝑀 𝐿), (6.3) where(𝑎)will be proved in Appendix6.7. The complexity of the encoding/decoding is that of computing the function 𝐹^𝐻

𝑆 , which as will be discussed in Sec. 6.4, is 𝑝 𝑜𝑙 𝑦(𝑀 , 𝐿 , 𝑘).

Dalam dokumen CorrectingErrorsinDNAStorage - California Institute of Technology (Halaman 178-183)