• Tidak ada hasil yang ditemukan

Robust Indexing for Codes over Sets: Substitution Errors

Chapter VI: Robust Indexing: Optimal Codes Correcting Deletion/Insertion

6.3 Robust Indexing for Codes over Sets: Substitution Errors

In this section we describe our code constructions, which use the idea of robust indexing. These codes deal with substitution errors in the 𝑀 strings. Our codes have redundancy𝑂(π‘˜log𝑀 𝐿), which is order-wise optimal whenever π‘˜ is at most 𝑂(min{𝐿1/3, 𝐿/log𝑀}).

Theorem 6.3.1. For integers𝑀 , 𝐿, andπ‘˜, let𝐿′ β‰œ 3 log𝑀+4π‘˜2+1. If𝐿′+4π‘˜ 𝐿′+ 2π‘˜log(4π‘˜ 𝐿′) ≀ 𝐿, then there exists an explicit π‘˜-substitution code, computable in𝑝 π‘œπ‘™ 𝑦(𝑀 , 𝐿 , π‘˜)time, that has redundancy2π‘˜log𝑀 𝐿+ (12π‘˜+2)log𝑀+𝑂(π‘˜3) + 𝑂(π‘˜log log𝑀 𝐿).

Since the codewords consist of unordered strings, we assign indexing bits to each string such that the strings are ordered. However, instead of directly assigning the indices 1, . . . , 𝑀 to each string, we embed information into the indexing bits. In other words, we use the information bits themselves for the purpose of indexing.

This provides more efficiency in sending information.

Specifically, for a codewordπ‘Š = {x𝑖}𝑀

𝑖=1, we choose the first𝐿′bits (π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′), 𝑖 ∈ [𝑀] in each string x𝑖 as the indexing bits, and encode information in them. Then, the strings{x𝑖}𝑀

𝑖=1are sorted according to the lexicographic orderπœ‹of the indexing bits(π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′),𝑖 ∈ [𝑀], where (π‘₯πœ‹(𝑖),1, π‘₯πœ‹(𝑖),2, . . . , π‘₯πœ‹(𝑖), 𝐿′) <

(π‘₯πœ‹(𝑗),1, π‘₯πœ‹(𝑗),2, . . . , π‘₯πœ‹(𝑗), 𝐿′) for𝑖 < 𝑗. Once {x𝑖}𝑀

𝑖=1are ordered, it suffices to use a Reed-Solomon code to protect the concatenated string (xπœ‹(1), . . . ,xπœ‹(𝑀)), and thus the codeword{x𝑖}𝑀

𝑖=1, fromπ‘˜substitution errors.

One of the key issues with this approach is that the indexing bits and their lex- icographic order can be disrupted by substitution errors. To deal with this, we present a technique referred to as robust indexing, which protects the indexing bits from substitution errors. The basic ideas of robust indexing are as follows:

(1) Constructing the indexing bits {(π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1such that the Hamming distance between any two distinct (π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′) and (π‘₯𝑗 ,1, π‘₯𝑗 ,2, . . . , π‘₯𝑗 , 𝐿′) is at least 2π‘˜ +1, i.e., the strings {(π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1 form a error correcting code under classic coding theoretic definition. Then, we can identify which string among{(π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1results in the erroneous version(π‘₯β€²

𝑖,1, π‘₯β€²

𝑖,2, . . . , π‘₯β€²

𝑖, 𝐿′), by using a minimum Hamming distance criterion; (2) Using additional redundancy to protect the set of indexing bits{(π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1from substitution errors.

Note that we encode data in the code {(π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1 through different choices of the code. After substitution errors, two choices of the code, which represent different messages, might result in the same read{(π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1. Example 6.3.1. For π‘˜ = 𝑀 = 2 and 𝐿 = 8, consider two codes {11111111, 00000000} and {11111111,00010010}. Both have minimum Hamming distance greater than 2π‘˜ + 1 = 5 and can result in the same set {11111111,00000011} afterπ‘˜ =2substitutions.

Hence, to recover the indexing bits(π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′),𝑖 ∈ [𝑀], we need to know the code{(π‘₯𝑖,1, π‘₯𝑖,2, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1to which the erroneous string (π‘₯β€²

𝑖,1, π‘₯β€²

𝑖,2, . . . , π‘₯β€²

𝑖, 𝐿′) is corrected,𝑖 ∈ [𝑀].

For an integerβ„“, let 1β„“ be the all 1’s vector of lengthβ„“. DefineS𝐻 as the set of all length𝐿′codes with cardinality𝑀and minimum Hamming distance at least 2π‘˜+1, which contain1𝐿′, that is,

S𝐻 β‰œ n

{a1, . . . ,a𝑀} ∈

{0,1}𝐿′ 𝑀

a1=1𝐿′ and𝑑𝐻(a𝑖,a𝑗) β‰₯2π‘˜+1 for every distinct𝑖, 𝑗 ∈ [𝑀]o

.

The following lemma gives a lower bound on the size ofS𝐻 and is obtained using counting argument.

Lemma 6.3.1. Let𝑄 =Í2π‘˜ 𝑖=0

𝐿′ 𝑖

be the size of a Hamming ball of radius2π‘˜centered at a vector in{0,1}𝐿′. We have that

|S𝐻| β‰₯ (2πΏβ€²βˆ’ 𝑀 𝑄)π‘€βˆ’1

(𝑀 βˆ’1)! . (6.1)

Proof. Define the set of ordered tuples S𝑇𝐻 =

(a1, . . . ,a𝑀) :a1=1𝐿′ and𝑑𝐻(a𝑖,a𝑗) β‰₯2π‘˜+1 for distinct𝑖, 𝑗 ∈ [𝑀] such that for each tuple (a1, . . . ,a𝑀) ∈ S𝐻

𝑇 , we have that{a1, . . . ,a𝑀} ∈ S𝐻. We show that |S𝐻

𝑇 | β‰₯ ΓŽπ‘€

𝑖=2[2𝐿′ βˆ’ (𝑖 βˆ’1)𝑄], by finding ΓŽπ‘€

𝑖=2[2𝐿′ βˆ’ (π‘–βˆ’ 1)𝑄] tuples in S𝐻

𝑇 . Let a1 = 1𝐿′. We select a2, . . . ,a𝑀 sequentially such that each selected string a𝑖, 𝑖 ∈ [2, 𝑀] β‰œ {2, . . . , 𝑀}, is of Hamming distance at least 2π‘˜ +1 from each one ofa1, . . . ,aπ‘–βˆ’1. The tuple(a1, . . . ,a𝑀) selected in this way has pairwise Hamming distance at least 2π‘˜ +1 and thus belongs toS𝐻

𝑇 .

Since the number of strings having Hamming distance at most 2π‘˜ from at least one ofa1, . . . ,aπ‘–βˆ’1is at most(π‘–βˆ’1)𝑄, there are at least 2πΏβ€²βˆ’(π‘–βˆ’1)𝑄possible choices ofa𝑖

that have Hamming distance at least 2π‘˜+1 from each one ofa1, . . . ,aπ‘–βˆ’1. Therefore, the total number of ways selecting tuples(a1, . . . ,a𝑀)is at leastΓŽπ‘€

𝑖=2[2πΏβ€²βˆ’ (π‘–βˆ’1)𝑄]. Since in the above selection of tuples (a1 =1𝐿′, . . . ,a𝑀) ∈ S𝐻

𝑇 , there are (𝑀 βˆ’1)! tuples that correspond to the same set{a1=1𝐿′,a2, . . . ,a𝑀}inS𝐻, we have that

|S𝐻| =|S𝑇𝐻|/(𝑀 βˆ’1)!β‰₯

𝑀

Γ–

𝑖=2

[2𝐿′ βˆ’ (π‘–βˆ’1)𝑄]/(π‘€βˆ’1)! β‰₯ (2πΏβ€²βˆ’ 𝑀 𝑄)π‘€βˆ’1 (π‘€βˆ’1)! .

β–‘ According to (6.1), there exists an invertible mapping 𝐹𝐻

𝑆 : [ ⌈(2

𝐿′

βˆ’π‘€ 𝑄)π‘€βˆ’1

(π‘€βˆ’1)! βŒ‰] β†’

{0,1}𝐿′ 𝑀

, computed in 𝑂(2𝑀 𝐿′) time using brute force, that maps an integer 𝑑 ∈ h

⌈(2

𝐿′

βˆ’π‘€ 𝑄)π‘€βˆ’1 (π‘€βˆ’1)! βŒ‰i

to a code𝐹𝐻

𝑆 (𝑑) ∈ S𝐻. In the next session, we will present a poly- nomial time algorithm that computes a map𝐹𝐻

𝑆 (𝑑)for any𝑑 ∈ h

⌈(2

𝐿′

βˆ’π‘€ 𝑄)π‘€βˆ’1 (π‘€βˆ’1)! βŒ‰i

such that𝐹𝐻

𝑆 (𝑑) ∈ S𝐻 and𝐹𝐻

𝑆 (𝑑1) β‰  𝐹𝐻

𝑆 (𝑑2)for𝑑1β‰  𝑑2and𝑑1, 𝑑2∈ h

⌈(2

𝐿′

βˆ’π‘€ 𝑄)π‘€βˆ’1 (π‘€βˆ’1)! βŒ‰i

. Let us assume for now that the mapping𝐹𝐻

𝑆 is given.

For a set𝑆 ∈ {0,1}𝐿

β€²

≀𝑀

, define the characteristic vector1(𝑆) ∈ {0,1}2𝐿

β€²

of𝑆by 1(𝑆)𝑖=





ο£²



ο£³

1 if the binary presentation of𝑖is in𝑆 0 else

.

Notice that the Hamming weight of1(𝑆)is𝑀for every𝑆 ∈ {0,1}𝐿

β€²

𝑀

. The following lemma states that π‘˜ substitution errors in a set of strings 𝑆result in at most 2π‘˜ bit flips in1(𝑆).

Lemma 6.3.2. For 𝑆1, 𝑆2 ∈ {0,1}𝐿

β€²

≀𝑀

, if𝑆1 ∈ B𝐻

π‘˜ (𝑆2), then𝑑𝐻(1(𝑆1),1(𝑆2)) ≀2π‘˜, where𝑑𝐻(1(𝑆1),1(𝑆2))is the Hamming distance between1(𝑆1)and1(𝑆2). Proof. Note that|𝑆1\𝑆2| ≀ π‘˜and|𝑆2\𝑆1| ≀ π‘˜. Hence𝑑𝐻(1(𝑆1),1(𝑆2)) =|𝑆1\𝑆2βˆͺ

𝑆2\𝑆1| ≀ 2π‘˜. β–‘

We are ready to present the code construction. We use a set𝑆 ∈ S𝐻 as indexing bits and protect the vector1𝑆from substitution errors. Note that any two strings in the set𝑆have Hamming distance at least 2π‘˜+1. Hence, knowing the set𝑆, each string of indexing bits can be extracted from its erroneous version using a minimum distance decoder, which finds the unique string in𝑆that is within Hamming distanceπ‘˜ from it. The details are given as follows.

Consider the data d ∈ 𝐷 to be encoded as a tuple d = (𝑑1,d2), where 𝑑1 ∈ [ ⌈(2

𝐿′

βˆ’π‘€ 𝑄)π‘€βˆ’1 (π‘€βˆ’1)! βŒ‰]and

d2 ∈ {0,1}𝑀(πΏβˆ’πΏβ€²)βˆ’4π‘˜ πΏβ€²βˆ’2π‘˜βŒˆlog𝑀 πΏβŒ‰. Given (𝑑1,d2), the codeword{x𝑖}𝑀

𝑖=1is generated by the following procedure.

Encoding:

(1) Let𝐹𝐻

𝑆 (𝑑1) = {a1, . . . ,a𝑀} ∈ S𝐻 such thata1 =1𝐿′ and thea𝑖’s are sorted in a descending lexicographic order. Let (π‘₯𝑖,1, . . . , π‘₯𝑖, 𝐿′)=a𝑖, for𝑖 ∈ [𝑀].

(2) Let(π‘₯1, 𝐿′+1, . . . , π‘₯1, 𝐿′+4π‘˜ 𝐿′) =𝑅 𝑆2π‘˜(1({a1, . . . ,a𝑀})), where𝑅 𝑆2π‘˜(1({a1, . . . , a𝑀}))is the redundancy of a systematic Reed-Solomon code that corrects 2π‘˜ substitutions in1({a1, . . . ,a𝑀}).

(3) Place the information bits ofd2in bits (π‘₯1, 𝐿′+4π‘˜ 𝐿′+1, . . . , π‘₯1, 𝐿),

(π‘₯𝑀 , 𝐿′+1, . . . , π‘₯𝑀 , πΏβˆ’2π‘˜βŒˆlog𝑀 πΏβŒ‰); and (π‘₯𝑖, 𝐿′+1, . . . , π‘₯𝑖, 𝐿)for𝑖 ∈ [2, π‘€βˆ’1]. (4) Define

m=(x1, . . . ,xπ‘€βˆ’1,(π‘₯𝑀 ,1, . . . , π‘₯𝑀 , πΏβˆ’2π‘˜βŒˆlog𝑀 πΏβŒ‰))

and let (π‘₯𝑀 , πΏβˆ’2π‘˜βŒˆlog𝑀 πΏβŒ‰+1, . . . , π‘₯𝑀 , 𝐿) =𝑅 π‘†π‘˜(m), where𝑅 π‘†π‘˜(m) is the Reed- Solomon redundancy that correctsπ‘˜substitution errors inm. Note that(x1, . . . , x𝑀) = (m, 𝑅 π‘†π‘˜(m))is aπ‘˜-substitution correcting Reed-Solomon code.

Upon receiving the erroneous version1 {xβ€²1, . . . ,x′𝑀}, the decoding procedure is as follows.

Decoding:

(1) Note that during the encoding process, the redundancy bits that correct the vector 1({a𝑖}𝑀

𝑖=1), i.e., the characteristic vector of the set of indexing bits{(π‘₯𝑖,1, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1, are stored inx1. Hence we must first identify the er- roneous copy ofx1. To this end, find the unique stringx′𝑖

0 such that(π‘₯β€²

𝑖0,1, . . . , π‘₯β€²

𝑖0, 𝐿′) has at least πΏβ€²βˆ’ π‘˜ many 1-entries. Since the strings {x𝑖}𝑀

𝑖=1 have Hamming distance at least 2π‘˜+1, there is a unique such string, which is the erroneous copy of{(π‘₯1,1, . . . , π‘₯1, 𝐿′)}𝑀

𝑖=1. Hence x′𝑖

0is an erroneous copy ofx1 and the string

(π‘₯β€²

𝑖0, 𝐿′+1, . . . , π‘₯β€²

𝑖0, 𝐿′+4π‘˜ 𝐿)

is an erroneous copy of (π‘₯1, 𝐿′+1, . . . , π‘₯1, 𝐿′+4π‘˜ 𝐿′) =𝑅 𝑆2π‘˜(1({a𝑖}𝑀

𝑖=1)).

(2) Letπ‘˜ be the number of substitution errors that occur to the indexing bits{π‘₯𝑖,1, . . . , π‘₯𝑖, 𝐿′}𝑀

𝑖=1. According to Lemma6.3.2, the vector1({(π‘₯𝑖,1, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1)is within Hamming distance 2π‘˜ from the vector1({(π‘₯β€²

𝑖,1, . . . , π‘₯β€²

𝑖, 𝐿′)}𝑀

𝑖=1). Hence the Hamming distance between

s1= (1({(π‘₯β€²

𝑖,1, . . . , π‘₯β€²

𝑖, 𝐿′)}𝑖𝑀=1),(π‘₯β€²

𝑖0, 𝐿′+1, . . . , π‘₯β€²

𝑖0, 𝐿′+4π‘˜ 𝐿))and s2= (1({a𝑖}𝑖𝑀=1), 𝑅 𝑆2π‘˜(1({a𝑖}𝑖𝑀=1)))

is at most 2π‘˜. Since s2 is a 2π‘˜ error correcting Reed Solomon code, it can be recovered from s1 using the Reed-Solomon decoder. Recover 𝑑1 = (𝐹𝐻

𝑆 )βˆ’1({a𝑖}𝑀

𝑖=1).

1Since the strings{x𝑖}𝑀𝑖=1 have distance at least 2π‘˜+1 with each other, the strings{x′𝑖}𝑀𝑖=1 are different.

(3) Since s2 is recovered, the strings {(π‘₯𝑖,1, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1 = {a𝑖}𝑀

𝑖=1 are known.

Sort{(π‘₯𝑖,1, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1lexicographically in descending order. For each𝑖 ∈ [𝑀], find the uniqueπœ‹(𝑖) ∈ [𝑀]such that𝑑𝐻( (π‘₯β€²

πœ‹(𝑖),1, . . . , π‘₯β€²

πœ‹(𝑖), 𝐿′),

(π‘₯𝑖,1, . . . , π‘₯𝑖, 𝐿′)) ≀ π‘˜ (note that𝑖0 = πœ‹(1)). Similar to Step (1), we conclude that the stringxβ€²πœ‹(𝑖) is an erroneous copy ofx𝑖,𝑖 ∈ [𝑀], since the Hamming distance betweenx𝑗 andx𝑖 is at least 2π‘˜ +1 for 𝑗 ≠𝑖. Hence, the identify of {(π‘₯𝑖,1, . . . , π‘₯𝑖, 𝐿′)}𝑀

𝑖=1are determined from{(π‘₯β€²

𝑖,1, . . . , π‘₯β€²

𝑖, 𝐿′)}𝑀

𝑖=1.

(4) Since xβ€²πœ‹(𝑖) is an erroneous copy of x𝑖, 𝑖 ∈ [𝑀], it follows that the con- catenation sβ€² = (xβ€²πœ‹(1), . . . ,xβ€²πœ‹(𝑀)) is an erroneous copy of (x1, . . . ,x𝑀) = (m, 𝑅 π‘†π‘˜(m)), where m is defined in Step (4) in the encoding procedure.

Therefore,(x1, . . . ,x𝑀)and thusd2can be recovered from(xβ€²πœ‹(1), . . . ,xβ€²πœ‹(𝑀)) by using the Reed-Solomon decoder.

(5) Output(𝑑1,d2).

Therefore, the codeword{x𝑖}𝑀

𝑖=1can be recovered. The redundancy of the code is π‘Ÿ(C)=log

2𝐿 𝑀

βˆ’log⌈(2𝐿′ βˆ’π‘€ 𝑄)π‘€βˆ’1

(π‘€βˆ’1)! βŒ‰ βˆ’ [𝑀(πΏβˆ’πΏβ€²) βˆ’4π‘˜ πΏβ€²βˆ’2π‘˜βŒˆlog𝑀 πΏβŒ‰]

(6.2)

(π‘Ž)

≀2π‘˜log𝑀 𝐿+ (12π‘˜+2)log𝑀+𝑂(π‘˜3) +𝑂(π‘˜log log𝑀 𝐿), (6.3) where(π‘Ž)will be proved in Appendix6.7. The complexity of the encoding/decoding is that of computing the function 𝐹𝐻

𝑆 , which as will be discussed in Sec. 6.4, is 𝑝 π‘œπ‘™ 𝑦(𝑀 , 𝐿 , π‘˜).