Chapter VI: Robust Indexing: Optimal Codes Correcting Deletion/Insertion
6.4 Computing πΉ π»
(3) Since s2 is recovered, the strings {(π₯π,1, . . . , π₯π, πΏβ²)}π
π=1 = {aπ}π
π=1 are known.
Sort{(π₯π,1, . . . , π₯π, πΏβ²)}π
π=1lexicographically in descending order. For eachπ β [π], find the uniqueπ(π) β [π]such thatππ»( (π₯β²
π(π),1, . . . , π₯β²
π(π), πΏβ²),
(π₯π,1, . . . , π₯π, πΏβ²)) β€ π (note thatπ0 = π(1)). Similar to Step (1), we conclude that the stringxβ²π(π) is an erroneous copy ofxπ,π β [π], since the Hamming distance betweenxπ andxπ is at least 2π +1 for π β π. Hence, the identify of {(π₯π,1, . . . , π₯π, πΏβ²)}π
π=1are determined from{(π₯β²
π,1, . . . , π₯β²
π, πΏβ²)}π
π=1.
(4) Since xβ²π(π) is an erroneous copy of xπ, π β [π], it follows that the con- catenation sβ² = (xβ²π(1), . . . ,xβ²π(π)) is an erroneous copy of (x1, . . . ,xπ) = (m, π ππ(m)), where m is defined in Step (4) in the encoding procedure.
Therefore,(x1, . . . ,xπ)and thusd2can be recovered from(xβ²π(1), . . . ,xβ²π(π)) by using the Reed-Solomon decoder.
(5) Output(π1,d2).
Therefore, the codeword{xπ}π
π=1can be recovered. The redundancy of the code is π(C)=log
2πΏ π
βlogβ(2πΏβ² βπ π)πβ1
(πβ1)! β β [π(πΏβπΏβ²) β4π πΏβ²β2πβlogπ πΏβ]
(6.2)
(π)
β€2πlogπ πΏ+ (12π+2)logπ+π(π3) +π(πlog logπ πΏ), (6.3) where(π)will be proved in Appendix6.7. The complexity of the encoding/decoding is that of computing the function πΉπ»
π , which as will be discussed in Sec. 6.4, is π ππ π¦(π , πΏ , π).
consists of two steps. In the first step we map the integer π β [ β(2
πΏβ²
βπ π)πβ1 (πβ1)! β]
into π β 1 integers π1, . . . , ππ β [2πΏβ²] such that π1 = 2πΏβ² and ππ+1 β€ ππ βπ for π β [π β 1]. In the second step, we use ππ to generate aπ sequentially forπ β [2, π]. The first step is given in the following lemma.
Lemma 6.4.1. There exists an invertible map πΉπ»
π : [ β(2
πΏβ²
βπ π)πβ1
(πβ1)! β] β [2πΏ
β²] π
, computable in π ππ π¦(πΏβ², π) time, that maps and integerπ β [ β(2
πΏβ²
βπ π)πβ1
(πβ1)! β] to an integer tuple(π1, . . . , ππ) such thatπ1=2πΏβ² andππ β₯ ππ+1+π forπ β [πβ1]. Proof. Recall thecombinatorial numbering map πΉπππ that maps an integer in the range [ π
π
] to a set of π different and unordered integers in the range [π] for integers π and π β€ π. Since (2
πΏβ²
βπ π)πβ1
(πβ1)! β€ 2πΏ
β²βπ π+πβ1 πβ1
, we can map π β [ β(2
πΏβ²
βπ π)πβ1
(πβ1)! β] toπ β1 integers πΉπππ(π) ={πβ²
2, . . . , πβ²
π} such that 2πΏβ² β π π+ π β 1 β₯ πβ²
2 > πβ²
3 > . . . > πβ²
π. Let π1 = 2πΏβ², ππ = πβ²
π + (π β π +1) (π β 1) forπ β [2, π], and πΉπ»
π(π) = {π1, . . . , ππ}. Then we have thatπ2 β€ 2πΏβ² βπ and thatππ β₯ ππ+1+πforπ β [2, πβ1]. Since the mapπΉπππis invertible and computed in π ππ π¦(πΏβ², π) time, so is the mapπΉπ»
π. β‘
We now turn to the second step. Given the integers πΉπ»
π(π) = (π1, . . . , ππ), we generate the indexing bits{aπ = (π₯π,1, . . . , π₯π, πΏβ²)}π
π=1 β Sπ». First, we have thata1= 1πΏβ². The algorithm generates the indexing string aπ sequentially for π β [2, π]. Each indexing stringaπ is generated bit by bit in a recursive manner. We first give the following definition, on which the algorithm is based.
For a set of strings π΄ β {0,1}πΏβ² and a stringaβ {0,1}β of lengthβ β [πΏβ²]. Denote ππ»(a, π΄)= βοΈ
c:cβπ΄
|{cβ²: (πβ²
1, . . . , πβ²
β)=aandππ»(cβ²,c) β€2π}|
as the sum of the number of sequences that have prefixaand have Hamming distance at most 2π fromcoverc β π΄. The number ππ»(a, π΄) has the following properties that will be useful in our proof. The first property implies that
2πΏβ²βββππ»(a, π΄) =(2πΏβ²βββ1β ππ»( (a,0), π΄)) + (2πΏβ²βββ1βππ»( (a,1), π΄)), (6.4) which enables a recursion to generate each sequence aπ. The second property provides a way to computeππ»(a, π΄).
Lemma 6.4.2. 1. For any sequence a β {0,1}β of length β β [πΏβ² β 1] and setπ΄ β {0,1}πΏβ², we have
ππ»(a, π΄) =ππ»( (a,0), π΄) +ππ»( (a,1), π΄), (6.5)
where (a,0)or (a,1)is the concatenation ofaand a0or1bit respectively.
2. For anya β {0,1}β andπ΄ β {0,1}πΏβ², we have
ππ»(a, π΄) = βοΈ
c:cβπ΄
2πβππ»(a,(π1,...,πβ))
βοΈ
π=0
πΏβ²ββ π
. (6.6)
Proof. Note that for any sequence c, the β + 1-th bit of any sequence cβ² satisfy- ing(πβ²
1, . . . , πβ²
β) =ais either 0 or 1. Hence
|{cβ²:(πβ²
1, . . . , πβ²
β) =aandππ»(cβ²,c) β€ 2π}|
=|{cβ²:(πβ²
1, . . . , πβ²
β+1) =(a,0)andππ»(cβ²,c) β€2π}|
+ |{cβ²:(πβ²
1, . . . , πβ²
β+1) = (a,1)andππ»(cβ²,c) β€2π}|,
which implies Eq. (6.5). Moreover, for any sequencecβ {0,1}πΏβ², we have that
|{cβ²: (πβ²
1, . . . , πβ²
β)=aandππ»(cβ²,c) β€2π}| =
2πβππ»(a,(π1,...,πβ))
βοΈ
π=0
πΏβ²ββ π
. Hence the number ππ»(a, π΄)can be computed by Eq. (6.6). β‘ Next, we present the algorithm that takesπΉπ»
π (π) = (π1, . . . , ππ) as input and out- putsaπ such that{a1, . . . ,aπ} β Sπ» and that the decimal presentation decimal(aπ) ofaπ, π β [π] satisfies
decimal(aπ) =ππβ1+ βοΈ
β:ππ ,β=1andββ[πΏβ²]
ππ»( (ππ,1, . . . , ππ,ββ1,0),{aπ}πβ1
π=1). (6.7) We then show that the sequencesaπ, π β [π]satisfying (6.7) are decodable, i.e., we can recover the tuple(π1, . . . , ππ)from{a1, . . . ,aπ}.
Encoding:
forπ β [π], do π =ππ.
forβ β [πΏβ²], do
if 2πΏβ²ββ βππ»( (ππ,1, . . . , ππ,ββ1,0),{aπ}πβ1
π=1) β₯ π, thenππ,β =0.
else
π =πβ (2πΏβ²ββ βππ»( (ππ,1, . . . , ππ,ββ1,0),{aπ}πβ1
π=1)), ππ,β =1.
end if end for end for
return{a1, . . . ,aπ}.
The generation of aπ, π β [π] in the encoding procedure can be intuitively char- acterized as walking on a complete binary tree of πΏβ²+1 layers. The walk starts at layer 1, i.e., the root of the binary tree, and ends at layer πΏβ²+1 at one of the leaf nodes. At each step, it goes to one of its two child nodes, which represent the bits 0 and 1 respectively. Each string aπ, π β [π] is represented by the path of a walk. For each pathaπ = (ππ,1, . . . , ππ, πΏβ²) and each layer β β [πΏβ²], assign the weightπ€(ππ,β) =2πΏβ²ββ βππ»( (ππ,1, . . . , ππ,β),{aπ}πβ1
π=1)to nodeππ,β in theβ-th layer, and the weightπ€(πΒ―π,β) =2πΏβ²βββππ»( (ππ,1, . . . ,1βππ,β),{aπ}πβ1
π=1)to the brother node of nodeππ,β, i.e., the node that shares the same parent node withππ,β. From Eq. (6.5) we have that π€(ππ,β) = π€(ππ,β+1) +π€(πΒ―π,β+1) forβ β [πΏβ²β1]. Moreover, we have that 0 < π β€ π€(ππ,β) after theβ-th inner for loop in theπ-th outer for loop. This is formalized in the following lemma, which can be used to prove that Eq. (6.7) holds and that{a1, . . . ,aπ} β Sπ».
Lemma 6.4.3. After theβ-th,β β [πΏβ²], inner for loop in theπ-th,π β [π], outer for loop in the encoding procedure, we have that
0< π β€2πΏβ²βββππ»( (ππ,1, . . . , ππ,β),{aπ}ππβ1=1). (6.8) At the end of theπ-th outer for loop, we have thatπ =1.
Proof. We prove Eq. (6.8) by induction onβ. Forβ =1, according to Lemma6.4.1, we have 0 < π = ππ β€ 2πΏβ² β (π β1)π at the beginning of the π-th outer for loop.
If ππ,1 = 0, then according to the if condition in the encoding procedure, we have that 0 < π β€ 2πΏβ²ββ β ππ»(0,{aπ}πβ1
π=1) for β = 1, which proves (6.8). Otherwise ifππ,1=1, we have
0 < π=ππβ (2πΏβ²βββππ»(0,{aπ}πβ1
π=1))
β€ 2πΏβ² β (πβ1)πβ (2πΏβ²βββππ»(0,{aπ}πβ1
π=1))
(π)
=(2πΏβ²β1βππ»(1,{aπ}πβ1π=1)),
where(π)holds since by definition ofππ»(a, π΄), we have that ππ»(0,{aπ}ππβ1=1) +ππ»(1,{aπ}ππβ1=1) =
πβ1
βοΈ
π=1
|{c:ππ»(c,aπ) β€ 2π}|
=
πβ1
βοΈ
π=1
π
= (πβ1)π .
Hence the claim holds forβ =1. Suppose Eq. (6.8) holds forβ =π. Forβ =π+1, ifππ,π+1=0, then from Step (3), we have 0< π β€ 2πΏβ²βπβ1βππ»( (ππ,1, . . . , ππ,π,0), {aπ}πβ1
π=1). Otherwise if ππ,π+1=1, we have that
0 < π=ππβ (2πΏβ²βπβ1β ππ»( (ππ,1, . . . , ππ,β,0),{aπ}πβ1
π=1))
β€ 2πΏβ²βπ βππ»( (ππ,1, . . . , ππ,π),{aπ}πβπ=11)
β (2πΏβ²βπβ1βππ»( (ππ,1, . . . , ππ,π,0),{aπ}ππβ1=1))
(π)= (2πΏβ²βπβ1βππ»( (ππ,1, . . . , ππ,π,1),{aπ}ππβ1=1)),
where(π)follows from Eq. (6.5). Therefore, Eq. (6.8) holds forβ =π+1 and thus holds forβ β [πΏβ²]. Hence at the end of Step (2) we have that
0< π β€ 2πΏβ²βπΏβ²βππ»(aπ,{aπ}πβ1π=1) β€1. (6.9)
Henceπequals 1 at the end of Step (2). β‘
We now show that the strings {a1, . . . ,aπ} generated in the encoding procedure belong toSπ». By Lemma6.4.3, we have
π =2πΏβ²βπΏβ² βππ»(aπ,{aπ}πβ1
π=1) =1,
at the end of each round of Step (2) in the encoding procedure. This implies thatππ»(aπ,{aπ}πβ1
π=1) =0 and thusππ»(aπ,aπ) β₯2π+1 forπ β [2, π]and π β [πβ1]. Moreover, sinceπ1=2πΏβ², we have thata1=1πΏβ². Therefore,{aπ}π
π=1β Sπ». Next, we use Lemma6.4.3to show that the strings{aπ}π
π=1satisfy Eq. (6.7).
Lemma 6.4.4. The output{aπ}π
π=1of the encoding algorithm satisfies Eq.(6.7).
Proof. Note that in each inner for loop, the number π is subtracted by 2πΏβ²ββ β ππ»( (ππ,1, . . . , ππ,ββ1,0),{aπ}πβ1
π=1) only when ππ,β = 1 and β β [πΏβ²]. Since the numberπequalsππ at the beginning of each outer for loop, and from Lemma6.4.3 equals 1 at the end of each outer for loop, hence we have that
ππβ βοΈ
β:ππ ,β=1andββ[πΏβ²]
(2πΏβ²βββππ»( (ππ,1, . . . , ππ,ββ1,0),{aπ}ππβ1=1)) =1,
which implies (6.7). β‘
Remark 6.4.1. By definition ofππ»(a, π΄), we have the following alternative char- acterization ofdecimal(aπ),π β [π].
decimal(aπ)=ππβ1+
πβ1
βοΈ
π=1
|{c:ππ ππ π ππ(c) < ππ ππ π ππ(aπ)andππ»(c,aπ) β€2π}|, (6.10) which is ππ β1 plus the sum of number of strings that are lexicographically less thanaπand have Hamming distance at most2π fromaπ over π < π.
Lemma6.4.4immediately implies a decoding algorithm that transforms{aπ}π
π=1back to(π1, . . . , ππ).
Decoding:
(1) Order the strings{aπ}π
π=1such thata1 > a2 > . . . >aπ. (2) Forπ β [π],
ππ =decimal(aπ) +1+ βοΈ
β:ππ ,β=1andββ[πΏβ²]
ππ»( (ππ,1, . . . , ππ,ββ1,0),{aπ}πβ1
π=1). (6.11) To show that the decoding is correct, we prove that the stringaπ,π β [π] generated in the encoding procedure satisfies
a1 > a2> . . . >aπ. (6.12) Then we conclude that the string aπ obtained by ordering{aπ}π
π=1in Step (1) in the decoding procedure satisfies Eq. (6.7). Hence we have Eq. (6.23) and thus ππ, π β [π]can be recovered. Suppose on the contrary, there existaπ1 > aπ2for someπ1> π2.
Letββbe the most significant bit whereaπ1 andaπ2 differ, i.e.,(ππ
1,1, . . . , ππ
1,βββ1) = (ππ
2,1, . . . , ππ
2,βββ1) andππ
1,ββ = 1 andππ
2,ββ =0. Then according to the if statement in the encoding procedure, we have that
ππ
1β βοΈ
β:ππ
1,β=1andββ[ββ]
(2πΏβ²βββππ»( (ππ
1,1, . . . , ππ
1,ββ1,0),{aπ}π1β1
π=1)) > 0 and ππ
2β βοΈ
β:ππ
1,β=1andββ[ββ]
(2πΏβ²βββππ»( (ππ
1,1, . . . , ππ
1,ββ1,0),{aπ}π2β1
π=1)) β€0, which implies that
ππ
2βππ
1 <
βοΈ
β:ππ
1,β=1andββ[ββ]
(2πΏβ²ββ βππ»( (ππ
1,1, . . . , ππ
1,ββ1,0),{aπ}π2β1
π=1))
β βοΈ
β:ππ
1,β=1andββ[ββ]
(2πΏβ²ββ βππ»( (ππ
1,1, . . . , ππ
1,ββ1,0),{aπ}π1β1
π=1))
= βοΈ
β:ππ
1,β=1andββ[ββ]
(ππ»( (ππ
1,1, . . . , ππ
1,ββ1,0),{aπ}π1β1
π=1)
βππ»( (ππ
1,1, . . . , ππ
1,ββ1,0),{aπ}π2β1
π=1))
= βοΈ
β:ππ
1,β=1andββ[ββ]
ππ»( (ππ
1,1, . . . , ππ
1,ββ1,0),{aπ}ππ=π1β1
2)
(π)
β€
π1β1
βοΈ
π=π2
|c: ππ»(c,aπ) β€2π|
=(π1βπ2)π , (6.13)
where (π) follows from the definition of ππ»(a, π΄) and the fact that the strings which have(ππ
1,1, . . . , ππ
1,β1β1,0)and(ππ
1,1, . . . , ππ
1,β2β1,0)as prefixes, respectively, whereππ
1,β1=1, ππ
1,β2=1 andβ1 β β2, are different. Eq. (6.13) contradicts to the fact that the integers(π1, . . . , ππ) =πΉπ»
π (π)satisfyππβππ+1 > πforπ β [πβ1], which impliesππ
1βππ
2 β₯ (π1βπ2)π.
Since the calculation ofππ»(a, π΄)has polynomial complexity, the complexity of the encoding/decoding procedure is polynomial in πand πΏβ².