Chapter VI: Robust Indexing: Optimal Codes Correcting Deletion/Insertion
6.3 Robust Indexing for Codes over Sets: Substitution Errors
In this section we describe our code constructions, which use the idea of robust indexing. These codes deal with substitution errors in the π strings. Our codes have redundancyπ(πlogπ πΏ), which is order-wise optimal whenever π is at most π(min{πΏ1/3, πΏ/logπ}).
Theorem 6.3.1. For integersπ , πΏ, andπ, letπΏβ² β 3 logπ+4π2+1. IfπΏβ²+4π πΏβ²+ 2πlog(4π πΏβ²) β€ πΏ, then there exists an explicit π-substitution code, computable inπ ππ π¦(π , πΏ , π)time, that has redundancy2πlogπ πΏ+ (12π+2)logπ+π(π3) + π(πlog logπ πΏ).
Since the codewords consist of unordered strings, we assign indexing bits to each string such that the strings are ordered. However, instead of directly assigning the indices 1, . . . , π to each string, we embed information into the indexing bits. In other words, we use the information bits themselves for the purpose of indexing.
This provides more efficiency in sending information.
Specifically, for a codewordπ = {xπ}π
π=1, we choose the firstπΏβ²bits (π₯π,1, π₯π,2, . . . , π₯π, πΏβ²), π β [π] in each string xπ as the indexing bits, and encode information in them. Then, the strings{xπ}π
π=1are sorted according to the lexicographic orderπof the indexing bits(π₯π,1, π₯π,2, . . . , π₯π, πΏβ²),π β [π], where (π₯π(π),1, π₯π(π),2, . . . , π₯π(π), πΏβ²) <
(π₯π(π),1, π₯π(π),2, . . . , π₯π(π), πΏβ²) forπ < π. Once {xπ}π
π=1are ordered, it suffices to use a Reed-Solomon code to protect the concatenated string (xπ(1), . . . ,xπ(π)), and thus the codeword{xπ}π
π=1, fromπsubstitution errors.
One of the key issues with this approach is that the indexing bits and their lex- icographic order can be disrupted by substitution errors. To deal with this, we present a technique referred to as robust indexing, which protects the indexing bits from substitution errors. The basic ideas of robust indexing are as follows:
(1) Constructing the indexing bits {(π₯π,1, π₯π,2, . . . , π₯π, πΏβ²)}π
π=1such that the Hamming distance between any two distinct (π₯π,1, π₯π,2, . . . , π₯π, πΏβ²) and (π₯π ,1, π₯π ,2, . . . , π₯π , πΏβ²) is at least 2π +1, i.e., the strings {(π₯π,1, π₯π,2, . . . , π₯π, πΏβ²)}π
π=1 form a error correcting code under classic coding theoretic definition. Then, we can identify which string among{(π₯π,1, π₯π,2, . . . , π₯π, πΏβ²)}π
π=1results in the erroneous version(π₯β²
π,1, π₯β²
π,2, . . . , π₯β²
π, πΏβ²), by using a minimum Hamming distance criterion; (2) Using additional redundancy to protect the set of indexing bits{(π₯π,1, π₯π,2, . . . , π₯π, πΏβ²)}π
π=1from substitution errors.
Note that we encode data in the code {(π₯π,1, π₯π,2, . . . , π₯π, πΏβ²)}π
π=1 through different choices of the code. After substitution errors, two choices of the code, which represent different messages, might result in the same read{(π₯π,1, π₯π,2, . . . , π₯π, πΏβ²)}π
π=1. Example 6.3.1. For π = π = 2 and πΏ = 8, consider two codes {11111111, 00000000} and {11111111,00010010}. Both have minimum Hamming distance greater than 2π + 1 = 5 and can result in the same set {11111111,00000011} afterπ =2substitutions.
Hence, to recover the indexing bits(π₯π,1, π₯π,2, . . . , π₯π, πΏβ²),π β [π], we need to know the code{(π₯π,1, π₯π,2, . . . , π₯π, πΏβ²)}π
π=1to which the erroneous string (π₯β²
π,1, π₯β²
π,2, . . . , π₯β²
π, πΏβ²) is corrected,π β [π].
For an integerβ, let 1β be the all 1βs vector of lengthβ. DefineSπ» as the set of all lengthπΏβ²codes with cardinalityπand minimum Hamming distance at least 2π+1, which contain1πΏβ², that is,
Sπ» β n
{a1, . . . ,aπ} β
{0,1}πΏβ² π
a1=1πΏβ² andππ»(aπ,aπ) β₯2π+1 for every distinctπ, π β [π]o
.
The following lemma gives a lower bound on the size ofSπ» and is obtained using counting argument.
Lemma 6.3.1. Letπ =Γ2π π=0
πΏβ² π
be the size of a Hamming ball of radius2πcentered at a vector in{0,1}πΏβ². We have that
|Sπ»| β₯ (2πΏβ²β π π)πβ1
(π β1)! . (6.1)
Proof. Define the set of ordered tuples Sππ» =
(a1, . . . ,aπ) :a1=1πΏβ² andππ»(aπ,aπ) β₯2π+1 for distinctπ, π β [π] such that for each tuple (a1, . . . ,aπ) β Sπ»
π , we have that{a1, . . . ,aπ} β Sπ». We show that |Sπ»
π | β₯ Γπ
π=2[2πΏβ² β (π β1)π], by finding Γπ
π=2[2πΏβ² β (πβ 1)π] tuples in Sπ»
π . Let a1 = 1πΏβ². We select a2, . . . ,aπ sequentially such that each selected string aπ, π β [2, π] β {2, . . . , π}, is of Hamming distance at least 2π +1 from each one ofa1, . . . ,aπβ1. The tuple(a1, . . . ,aπ) selected in this way has pairwise Hamming distance at least 2π +1 and thus belongs toSπ»
π .
Since the number of strings having Hamming distance at most 2π from at least one ofa1, . . . ,aπβ1is at most(πβ1)π, there are at least 2πΏβ²β(πβ1)πpossible choices ofaπ
that have Hamming distance at least 2π+1 from each one ofa1, . . . ,aπβ1. Therefore, the total number of ways selecting tuples(a1, . . . ,aπ)is at leastΓπ
π=2[2πΏβ²β (πβ1)π]. Since in the above selection of tuples (a1 =1πΏβ², . . . ,aπ) β Sπ»
π , there are (π β1)! tuples that correspond to the same set{a1=1πΏβ²,a2, . . . ,aπ}inSπ», we have that
|Sπ»| =|Sππ»|/(π β1)!β₯
π
Γ
π=2
[2πΏβ² β (πβ1)π]/(πβ1)! β₯ (2πΏβ²β π π)πβ1 (πβ1)! .
β‘ According to (6.1), there exists an invertible mapping πΉπ»
π : [ β(2
πΏβ²
βπ π)πβ1
(πβ1)! β] β
{0,1}πΏβ² π
, computed in π(2π πΏβ²) time using brute force, that maps an integer π β h
β(2
πΏβ²
βπ π)πβ1 (πβ1)! βi
to a codeπΉπ»
π (π) β Sπ». In the next session, we will present a poly- nomial time algorithm that computes a mapπΉπ»
π (π)for anyπ β h
β(2
πΏβ²
βπ π)πβ1 (πβ1)! βi
such thatπΉπ»
π (π) β Sπ» andπΉπ»
π (π1) β πΉπ»
π (π2)forπ1β π2andπ1, π2β h
β(2
πΏβ²
βπ π)πβ1 (πβ1)! βi
. Let us assume for now that the mappingπΉπ»
π is given.
For a setπ β {0,1}πΏ
β²
β€π
, define the characteristic vector1(π) β {0,1}2πΏ
β²
ofπby 1(π)π=


ο£²

ο£³
1 if the binary presentation ofπis inπ 0 else
.
Notice that the Hamming weight of1(π)isπfor everyπ β {0,1}πΏ
β²
π
. The following lemma states that π substitution errors in a set of strings πresult in at most 2π bit flips in1(π).
Lemma 6.3.2. For π1, π2 β {0,1}πΏ
β²
β€π
, ifπ1 β Bπ»
π (π2), thenππ»(1(π1),1(π2)) β€2π, whereππ»(1(π1),1(π2))is the Hamming distance between1(π1)and1(π2). Proof. Note that|π1\π2| β€ πand|π2\π1| β€ π. Henceππ»(1(π1),1(π2)) =|π1\π2βͺ
π2\π1| β€ 2π. β‘
We are ready to present the code construction. We use a setπ β Sπ» as indexing bits and protect the vector1πfrom substitution errors. Note that any two strings in the setπhave Hamming distance at least 2π+1. Hence, knowing the setπ, each string of indexing bits can be extracted from its erroneous version using a minimum distance decoder, which finds the unique string inπthat is within Hamming distanceπ from it. The details are given as follows.
Consider the data d β π· to be encoded as a tuple d = (π1,d2), where π1 β [ β(2
πΏβ²
βπ π)πβ1 (πβ1)! β]and
d2 β {0,1}π(πΏβπΏβ²)β4π πΏβ²β2πβlogπ πΏβ. Given (π1,d2), the codeword{xπ}π
π=1is generated by the following procedure.
Encoding:
(1) LetπΉπ»
π (π1) = {a1, . . . ,aπ} β Sπ» such thata1 =1πΏβ² and theaπβs are sorted in a descending lexicographic order. Let (π₯π,1, . . . , π₯π, πΏβ²)=aπ, forπ β [π].
(2) Let(π₯1, πΏβ²+1, . . . , π₯1, πΏβ²+4π πΏβ²) =π π2π(1({a1, . . . ,aπ})), whereπ π2π(1({a1, . . . , aπ}))is the redundancy of a systematic Reed-Solomon code that corrects 2π substitutions in1({a1, . . . ,aπ}).
(3) Place the information bits ofd2in bits (π₯1, πΏβ²+4π πΏβ²+1, . . . , π₯1, πΏ),
(π₯π , πΏβ²+1, . . . , π₯π , πΏβ2πβlogπ πΏβ); and (π₯π, πΏβ²+1, . . . , π₯π, πΏ)forπ β [2, πβ1]. (4) Define
m=(x1, . . . ,xπβ1,(π₯π ,1, . . . , π₯π , πΏβ2πβlogπ πΏβ))
and let (π₯π , πΏβ2πβlogπ πΏβ+1, . . . , π₯π , πΏ) =π ππ(m), whereπ ππ(m) is the Reed- Solomon redundancy that correctsπsubstitution errors inm. Note that(x1, . . . , xπ) = (m, π ππ(m))is aπ-substitution correcting Reed-Solomon code.
Upon receiving the erroneous version1 {xβ²1, . . . ,xβ²π}, the decoding procedure is as follows.
Decoding:
(1) Note that during the encoding process, the redundancy bits that correct the vector 1({aπ}π
π=1), i.e., the characteristic vector of the set of indexing bits{(π₯π,1, . . . , π₯π, πΏβ²)}π
π=1, are stored inx1. Hence we must first identify the er- roneous copy ofx1. To this end, find the unique stringxβ²π
0 such that(π₯β²
π0,1, . . . , π₯β²
π0, πΏβ²) has at least πΏβ²β π many 1-entries. Since the strings {xπ}π
π=1 have Hamming distance at least 2π+1, there is a unique such string, which is the erroneous copy of{(π₯1,1, . . . , π₯1, πΏβ²)}π
π=1. Hence xβ²π
0is an erroneous copy ofx1 and the string
(π₯β²
π0, πΏβ²+1, . . . , π₯β²
π0, πΏβ²+4π πΏ)
is an erroneous copy of (π₯1, πΏβ²+1, . . . , π₯1, πΏβ²+4π πΏβ²) =π π2π(1({aπ}π
π=1)).
(2) Letπ be the number of substitution errors that occur to the indexing bits{π₯π,1, . . . , π₯π, πΏβ²}π
π=1. According to Lemma6.3.2, the vector1({(π₯π,1, . . . , π₯π, πΏβ²)}π
π=1)is within Hamming distance 2π from the vector1({(π₯β²
π,1, . . . , π₯β²
π, πΏβ²)}π
π=1). Hence the Hamming distance between
s1= (1({(π₯β²
π,1, . . . , π₯β²
π, πΏβ²)}ππ=1),(π₯β²
π0, πΏβ²+1, . . . , π₯β²
π0, πΏβ²+4π πΏ))and s2= (1({aπ}ππ=1), π π2π(1({aπ}ππ=1)))
is at most 2π. Since s2 is a 2π error correcting Reed Solomon code, it can be recovered from s1 using the Reed-Solomon decoder. Recover π1 = (πΉπ»
π )β1({aπ}π
π=1).
1Since the strings{xπ}ππ=1 have distance at least 2π+1 with each other, the strings{xβ²π}ππ=1 are different.
(3) Since s2 is recovered, the strings {(π₯π,1, . . . , π₯π, πΏβ²)}π
π=1 = {aπ}π
π=1 are known.
Sort{(π₯π,1, . . . , π₯π, πΏβ²)}π
π=1lexicographically in descending order. For eachπ β [π], find the uniqueπ(π) β [π]such thatππ»( (π₯β²
π(π),1, . . . , π₯β²
π(π), πΏβ²),
(π₯π,1, . . . , π₯π, πΏβ²)) β€ π (note thatπ0 = π(1)). Similar to Step (1), we conclude that the stringxβ²π(π) is an erroneous copy ofxπ,π β [π], since the Hamming distance betweenxπ andxπ is at least 2π +1 for π β π. Hence, the identify of {(π₯π,1, . . . , π₯π, πΏβ²)}π
π=1are determined from{(π₯β²
π,1, . . . , π₯β²
π, πΏβ²)}π
π=1.
(4) Since xβ²π(π) is an erroneous copy of xπ, π β [π], it follows that the con- catenation sβ² = (xβ²π(1), . . . ,xβ²π(π)) is an erroneous copy of (x1, . . . ,xπ) = (m, π ππ(m)), where m is defined in Step (4) in the encoding procedure.
Therefore,(x1, . . . ,xπ)and thusd2can be recovered from(xβ²π(1), . . . ,xβ²π(π)) by using the Reed-Solomon decoder.
(5) Output(π1,d2).
Therefore, the codeword{xπ}π
π=1can be recovered. The redundancy of the code is π(C)=log
2πΏ π
βlogβ(2πΏβ² βπ π)πβ1
(πβ1)! β β [π(πΏβπΏβ²) β4π πΏβ²β2πβlogπ πΏβ]
(6.2)
(π)
β€2πlogπ πΏ+ (12π+2)logπ+π(π3) +π(πlog logπ πΏ), (6.3) where(π)will be proved in Appendix6.7. The complexity of the encoding/decoding is that of computing the function πΉπ»
π , which as will be discussed in Sec. 6.4, is π ππ π¦(π , πΏ , π).