Codes for a Single Substitution - Coding over Sets for DNA Storage: Substitution Errors

Chapter V: Coding over Sets for DNA Storage: Substitution Errors

5.5 Codes for a Single Substitution

≥ log©

« √1

2𝜖 𝑀 𝐿 ^𝑘

𝑘^𝑘 ª

− 𝑘log(𝑘) −𝑂(1)

=𝑘

log(^√^𝜖

2) +log(𝑀 𝐿))

−2𝑘log(𝑘) −𝑂(1)

≥ 𝑘log(𝑀 𝐿) −2𝑘log(𝑘) −𝑂(𝑘). □

.. .

Stringz1 Scrambledbits Scrambledbits x₁

.. . x𝑀

a₁

a𝑀

b_𝜎(1)

b𝜎(𝑀)

c_𝜋(1)

c𝜋(𝑀)

𝐿/3 2𝐿/3 𝐿

Ordered bya𝑖

.. .

Scrambledbits Stringz2 Scrambledbits x𝜎⁻¹(1)

.. . x𝜎⁻¹(𝑀)

a𝜎⁻1(1)

a⁻¹𝜎 (𝑀)

b₁

b𝑀

c𝜎⁻1(𝜋(1))

c𝜎⁻1(𝜋(𝑀))

𝐿/3 2𝐿/3 𝐿

Ordered byb𝑖

.. .

Scrambledbits Scrambledbits Stringz3

x𝜋⁻1(1)

.. . x𝜋⁻1(𝑀)

a𝜋⁻1(1)

a𝜋⁻1(𝑀)

b𝜋⁻1𝜎(1)

b𝜋⁻1𝜎(𝑀)

c₁

c𝑀

𝐿/3 2𝐿/3 𝐿

Ordered byc𝑖

Figure 5.1: Illustration of single-substitution correcting codes over unordered sets.

𝐹(𝑑₃) =({b₁, . . . ,b𝑀}, 𝜎),

𝐹(𝑑₅) =({c₁, . . . ,c𝑀}, 𝜋), (5.6) wherea𝑖,b𝑖,c𝑖 ∈ {0,1}^𝐿/³⁻¹for every𝑖 ∈ [𝑀], the permutations𝜎and𝜋are in𝑆_𝑀, and the indexing of{a𝑖}^𝑀

𝑖=1, {b𝑖}^𝑀

𝑖=1, and{c𝑖}^𝑀

𝑖=1is lexicographic. Further, letd₂,d₄, andd₆be the binary strings that correspond to𝑑₂, 𝑑₄, and𝑑₆, respectively, and let

s₁= (a₁, . . . ,a𝑀, b𝜎(1), . . . ,b𝜎(𝑀), c_𝜋(1), . . . ,c𝜋(𝑀)),

s₂= (a𝜎⁻¹(1), . . . ,a𝜎⁻¹(𝑀), b₁, . . . ,b𝑀, c𝜎⁻¹𝜋(1), . . . ,c𝜎⁻¹𝜋(𝑀)),and

s₃= (a𝜋⁻¹(1), . . . ,a𝜋⁻¹(𝑀), b𝜋⁻¹𝜎(1), . . . ,b𝜋⁻¹𝜎(𝑀),

c₁, . . . ,c𝑀). (5.7)

Without loss of generality4 assume that there exists an integer 𝑡 for which the length|s𝑖|=(𝐿−3)𝑀 =2^𝑡−𝑡−1 for all𝑖 ∈ [3]. Then, eachs𝑖 can be encoded by using asystematic [2^𝑡−1,2^𝑡 −𝑡 −1]₂ Hamming code, by introducing𝑡 redundant bits. That is, the encoding function is of the forms𝑖 ↦→ (s𝑖, 𝐸_𝐻(s𝑖)), where 𝐸_𝐻(s𝑖) are the𝑡redundant bits, and𝑡 ≤ ⌈log(𝑀 𝐿)⌉. Similarly, we assume that there exists an integer ℎfor which the length |d𝑖| =2^ℎ−ℎ−1 for𝑖 ∈ {2,4,6}, and let 𝐸_𝐻(d𝑖) be the corresponding ℎ bits of redundancy, that result from encoding d𝑖 by using a [2^ℎ−1,2^ℎ− ℎ−1] Hamming code. By the properties of a Hamming code, and by the definition of ℎ, we have thatℎ ≤ ⌈log(𝑀)⌉.

The data 𝑑 ∈ 𝐷 is mapped to a codeword 𝐶 = {x₁, . . . ,x𝑀} as follows, and the reader is encouraged to refer to Figure5.1for clarifications. First, we place{a𝑖}^𝑀

𝑖=1, {b𝑖}^𝑀

𝑖=1, and {c𝑖}^𝑀

𝑖=1 in the different thirds of thex𝑖’s, sorted by 𝜎 and 𝜋. That is, denotingx𝑖 =(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}), we define

(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿}_/3−1) =a𝑖,

(𝑥_{𝑖, 𝐿/}₃₊₁, . . . , 𝑥_𝑖,₂_𝐿/₃₋₁) =b𝜎(𝑖), and

(𝑥_𝑖,₂_𝐿_/3+1, . . . , 𝑥_{𝑖, 𝐿−}₁) =c𝜋(𝑖). (5.8) The remaining bits {𝑥_{𝑖, 𝐿}_/3}^𝑀

𝑖=1, {𝑥_𝑖,₂_𝐿_/3}^𝑀

𝑖=1, and {𝑥_{𝑖, 𝐿}}^𝑀

𝑖=1 are used to accommodate the information bits of d₂,d₄,d₆, and the redundancy bits {𝐸_𝐻(s𝑖)}³

𝑖=1 and {𝐸_𝐻(d𝑖)}_𝑖∈{₂,4,6}, in the following manner.

𝑥_{𝑖, 𝐿/3}=











 𝑑₂_,𝑖,

if𝑖 ∈ [𝑀− ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉], 𝐸_𝐻(d₂)𝑖−(𝑀−⌈log𝑀 𝐿⌉−⌈log𝑀⌉),

if𝑖 ∈ [𝑀− ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉ +1, 𝑀− ⌈log𝑀 𝐿⌉], ℎ

𝐸_𝐻(s₁)𝑖−(𝑀−⌈log𝑀 𝐿⌉),

if𝑖 ∈ [𝑀− ⌈log𝑀 𝐿⌉ +1, 𝑀],

4Every string can be padded with zeros to extend its length to 2^𝑡−𝑡−1 for some𝑡. It is readily verified that this operation extends the string by at most a factor of two, and by the properties of the Hamming code, this will increase the number of redundant bits by at most 1.

𝑥_𝑖,₂_𝐿_/3=













𝑑4,𝜎⁻¹(𝑖),

if𝜎⁻¹(𝑖) ∈ [𝑀− ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉], 𝐸_𝐻(d₄)_𝜎−1(𝑖)−(𝑀−⌈log𝑀 𝐿⌉−⌈log𝑀⌉),

if𝜎⁻¹(𝑖) ∈ [𝑀− ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉ +1, 𝑀− ⌈log𝑀 𝐿⌉],

𝐸_𝐻(s₂)_𝜎−1(𝑖)−(𝑀−⌈log𝑀 𝐿⌉),

if𝜎⁻¹(𝑖) ∈ [𝑀− ⌈log𝑀 𝐿⌉ +1, 𝑀],

𝑥_{𝑖, 𝐿} =











 𝑑₆

,𝜋⁻¹(𝑖),

if𝜋⁻¹(𝑖) ∈ [𝑀 − ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉], 𝐸_𝐻(d₆)_𝜋−1(𝑖)−(𝑀−⌈log𝑀 𝐿⌉−⌈log𝑀⌉),

if𝜋⁻¹(𝑖) ∈ [𝑀 − ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉ +1, 𝑀− ⌈log𝑀 𝐿⌉],

𝐸_𝐻(s₃)_𝜋−1(𝑖)−(𝑀−⌈log𝑀 𝐿⌉),

if𝜋⁻¹(𝑖) ∈ [𝑀 − ⌈log𝑀 𝐿⌉ +1, 𝑀].

(5.9)

That is, if the strings{x𝑖}^𝑀

𝑖=1are sorted according to the content of the bits(𝑥_𝑖,₁, . . . , 𝑥_{𝑖, 𝐿/3−1}) =a𝑖, then the top𝑀− ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉bits of the(𝐿/3)-th column5con- taind₂, the middle⌈log𝑀⌉bits contain𝐸_𝐻(d₂), and the bottom⌈log𝑀 𝐿⌉bits con- tain𝐸_𝐻(s₁). Similarly, if the strings are sorted according to(𝑥_{𝑖, 𝐿}_/3+1, . . . , 𝑥_𝑖,₂_𝐿_/3−1) = b𝑖, then the top𝑀− ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉bits of the(2𝐿/3)-th column containd₄, the middle⌈log𝑀⌉bits contain𝐸_𝐻(d₄), and the bottom⌈log𝑀 𝐿⌉bits contain𝐸_𝐻(s₂), and so on. Equations (5.8) and (5.9) conclude the encoding function 𝐸 of Theo- rem5.5.1. It can be readily verified that𝐸is injective since different messages result in either different({a𝑖}^𝑀

𝑖=1,{b𝑖}^𝑀

𝑖=1,{c𝑖}^𝑀

𝑖=1)or the same({a𝑖}^𝑀

𝑖=1,{b𝑖}^𝑀

𝑖=1,{c𝑖}^𝑀

𝑖=1)with different (d₂,d₄,d₆). In either case, the resulting codewords {x𝑖}^𝑀

𝑖=1 of the two messages are different.

To verify that the image of𝐸is a 1-substitution code, observe first that since{a𝑖}^𝑀

𝑖=1, {b𝑖}^𝑀

𝑖=1, and {c𝑖}^𝑀

𝑖=1 are sets, it follows that any two strings in the same set are distinct. Hence, according to (5.8), it follows that𝑑_𝐻(x𝑖,x𝑗) ≥3 for every distinct𝑖 and 𝑗 in [𝑀]. Therefore, no 1-substitution error can cause one x𝑖 to be equal to another, and consequently, the result of a 1-substitution error is always in ^{⁰^,_𝑀¹^}^𝐿

5Sorting the strings{x𝑖}_𝑖=1^𝑀 by any ordering method provides a matrix in a natural way, and can consider columns in this matrix.

In what follows a decoding algorithm is presented, whose input is a codeword that was distorted by at most a single substitution, and its output is 𝑑. The algorithm is summarized in Algorithm3.

Algorithm 3: Decoding

Input: A word𝐶^′ ∈ B₁(𝐶) for some codeword𝐶. Output: The message𝑑encoded as𝐶.

Sort and index the strings in𝐶^′={x^′₁, . . . ,x^′_𝑀}lexicographically;

Compute the strings ˆa𝑖, ˆb𝑖, and ˆc𝑖 for𝑖 ∈ [𝑀], according to (5.10);

Compute the stringss^′₁,s^′₂, ands^′₃according to (5.12);

Compute the strings𝐸_𝐻(s₁)^′,𝐸_𝐻(s₂)^′, and𝐸_𝐻(s₃)^′according to (5.11);

Use Hamming decoder to decode(s^′_𝑖, 𝐸_𝐻(s𝑖))and obtains𝑖for𝑖 ∈ [3];

According to Lemma5.5.1, we can apply majority vote on the recovered{s𝑖}³

𝑖=1

to obtain the correct strings{a𝑖}^𝑀

𝑖=1, {b𝑖}^𝑀

𝑖=1, and{c𝑖}^𝑀

𝑖=1, and the permutations𝜎 and𝜋. Then determine𝑑₁,𝑑₃,𝑑₅using combinatorial map (5.6);

Compute(d^′_𝑖, 𝐸_𝐻(d𝑖)^′)𝑖 ∈ {2,4,6}according to (5.11) and use Hamming decoder to decode (d^′_𝑖, 𝐸_𝐻(d𝑖)^′)and obtaind𝑖 for𝑖 ∈ {2,4,6};

Output𝑑 =(𝑑₁, 𝑑₂, 𝑑₃, 𝑑₄, 𝑑₅, 𝑑₆).

Upon receiving a word 𝐶^′ = {x^′₁, . . . ,x^′_𝑀} ∈ B₁(𝐶) for some codeword𝐶 (once again, the indexing of the elements of𝐶^′is lexicographic), we define

ˆ a𝑖 = (𝑥^′

𝑖,1, . . . , 𝑥^′

𝑖, 𝐿/3−1) bˆ𝑖 = (𝑥^′

𝜏⁻¹(𝑖), 𝐿/3+1, . . . , 𝑥^′

𝜏⁻¹(𝑖),2𝐿/3−1) (5.10) ˆ

c𝑖 = (𝑥^′

𝜌⁻¹(𝑖),2𝐿/3+1, . . . , 𝑥^′

𝜌⁻¹(𝑖), 𝐿−1), where 𝜏 is the permutation by which {x^′_𝑖}^𝑀

𝑖=1 are sorted according to their 𝐿/3+ 1, . . . ,2𝐿/3−1 entries, and𝜌is the permutation by which they are sorted according to their 2𝐿/3+1, . . . , 𝐿−1 entries (we emphasize that𝜏and𝜌are unrelated to the original 𝜋and 𝜎, and those will be decoded later). Further, when ordering{x^′_𝑖}^𝑀

𝑖=1

by either the lexicographic ordering, by𝜏, or by 𝜌, we obtaincandidates for each one ofd₂, d₄, d₆, 𝐸_𝐻(d₂), 𝐸_𝐻(d₄), 𝐸_𝐻(d₆), 𝐸_𝐻(s₁), 𝐸_𝐻(s₂), and 𝐸_𝐻(s₃), that we similarly denote with an additional apostrophe6, as follows for𝑖 ∈ [𝑀].

𝑑^′

2,𝑖 =𝑥^′

𝑖, 𝐿/3,

for𝑖 ∈ [𝑀− ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉], 𝐸_𝐻(d₂)_𝑖^′=𝑥^′

𝑖+(𝑀−⌈log𝑀 𝐿⌉−⌈log𝑀⌉), 𝐿/3,

6That is, each one ofd^′₂,d^′₄, etc., is obtained fromd₂,d₄, etc., by at most a single substitution.

for𝑖 ∈ [ ⌈log𝑀⌉], 𝐸_𝐻(s₁)_𝑖^′ =𝑥^′

𝑖+(𝑀−⌈log𝑀 𝐿⌉), 𝐿/3, for𝑖 ∈ [ ⌈log𝑀 𝐿⌉], 𝑑^′

4,𝑖 =𝑥^′

𝜏(𝑖),2𝐿/3,

for𝑖 ∈ [𝑀− ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉], 𝐸_𝐻(d₄)_𝑖^′=𝑥^′

𝜏(𝑖+(𝑀−⌈log𝑀 𝐿⌉−⌈log𝑀⌉)),2𝐿/3, for𝑖 ∈ [ ⌈log𝑀⌉],

𝐸_𝐻(s₂)_𝑖^′ =𝑥^′

𝜏(𝑖+(𝑀−⌈log𝑀 𝐿⌉)),2𝐿/3, for𝑖 ∈ [ ⌈log𝑀 𝐿⌉], 𝑑^′

6,𝑖 =𝑥^′

𝜌(𝑖), 𝐿,

for𝑖 ∈ [𝑀− ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉], 𝐸_𝐻(d₆)_𝑖^′=𝑥^′

𝜌(𝑖+(𝑀−⌈log𝑀 𝐿⌉−⌈log𝑀⌉)), 𝐿, for𝑖 ∈ [ ⌈log𝑀⌉],

𝐸_𝐻(s₃)_𝑖^′ =𝑥^′

𝜌(𝑖+(𝑀−⌈log𝑀 𝐿⌉)), 𝐿,

for𝑖 ∈ [ ⌈log𝑀 𝐿⌉]. (5.11)

For example, if we order {x^′_𝑖}^𝑀

𝑖=1 according to𝜏, then the bottom ⌈log(𝑀 𝐿)⌉ bits of the (2𝐿/3)-th column are𝐸_𝐻(s₂)^′, the middle⌈log𝑀⌉bits are 𝐸_𝐻(d₄)^′, and the top𝑀 − ⌈log𝑀 𝐿⌉ − ⌈log𝑀⌉ bits ared^′₄(see Eq. (5.9)). Now, let

s^′₁= (aˆ₁, . . . ,aˆ𝑀, bˆ_𝜏(1), . . . ,bˆ𝜏(𝑀), cˆ𝜌(1), . . . ,cˆ𝜌(𝑀)),

s^′₂= (aˆ𝜏⁻¹(1), . . . ,aˆ𝜏⁻¹(𝑀), bˆ₁, . . . ,bˆ𝑀, ˆ

c𝜏⁻¹𝜌(1), . . . ,cˆ𝜏⁻¹𝜌(𝑀)), and (5.12) s^′₃= (aˆ𝜌⁻1(1), . . . ,aˆ𝜌⁻1(𝑀), bˆ𝜌⁻1𝜏(1), . . . ,bˆ𝜌⁻1𝜏(𝑀),

c₁, . . . ,cˆ𝑀).

The following lemma shows that at least two of the aboves^′_𝑖 are close in Hamming distance to their encoded counterpart(s𝑖, 𝐸_𝐻(s𝑖)).

Lemma 5.5.1. There exist distinct integers𝑘 , ℓ ∈ [3]such that 𝑑_𝐻( (s^′_𝑘, 𝐸_𝐻(s𝑘)^′),(s𝑘, 𝐸_𝐻(s𝑘)) ≤1, and 𝑑_𝐻( (s^′_ℓ, 𝐸_𝐻(sℓ)^′),(sℓ, 𝐸_𝐻(sℓ))) ≤1.

Proof. If the substitution did not occur at either of index sets {1, . . . , 𝐿/3−1}, {𝐿/3+1, . . . ,2𝐿/3−1}, or{2𝐿/3+1, . . . , 𝐿 −1}(which correspond to the values of the strings {a𝑖}^𝑀

𝑖=1, {b𝑖}^𝑀

𝑖=1, and{c𝑖}^𝑀

𝑖=1, respectively), then the orders among the strings {a𝑖}^𝑀

𝑖=1, {b𝑖}^𝑀

𝑖=1, and {c𝑖}^𝑀

𝑖=1 are maintained, respectively. That is, we have that𝜏=𝜎and𝜌 =𝜋. This implies that

s^′₁= (a₁, . . . ,a𝑀, b𝜎(1), . . . ,b𝜎(𝑀), c𝜋(1), . . . ,c𝜋(𝑀)),

s^′₂= (a𝜎⁻¹(1), . . . ,a𝜎⁻¹(𝑀), b₁, . . . ,b𝑀, c𝜎⁻1𝜋(1), . . . ,c𝜎⁻1𝜋(𝑀)),

s^′₃= (a𝜋⁻¹(1), . . . ,a𝜋⁻¹(𝑀), b𝜋⁻¹𝜎(1), . . . ,b𝜋⁻¹𝜎(𝑀), c₁, . . . ,c𝑀),

and that 𝑑_𝐻(𝐸_𝐻(s𝑖), 𝐸_𝐻(s^′_𝑖)) ≤ 1, 𝑖 ∈ [3], according to (5.9) and (5.11). In this case, the claim is clear. It remains to show the other cases, and due to symmetry, assume without loss of generality that the substitution occurred in one of the a𝑖’s, i.e., in an entry which is indexed by an integer in [𝐿/3−1].

Let 𝐴 ∈ {0,1}^𝑀×𝐿 be a matrix whose rows are the x𝑖’s, in any order. Let 𝐴_left be the result of ordering the rows of 𝐴 according to the lexicographic order of their 1, . . . , 𝐿/3−1 entries. Similarly, let 𝐴_midand 𝐴_right be the results of ordering the rows of 𝐴 by their 𝐿/3 +1, . . . ,2𝐿/3 − 1 and 2𝐿/3+ 1, . . . , 𝐿 − 1 entries, respectively, and let 𝐴^′

left, 𝐴^′

mid, and 𝐴^′

right be defined analogously with {x^′_𝑖}^𝑀

𝑖=1

instead of{x𝑖}^𝑀

𝑖=1.

It is readily verified that there exist permutation matrices𝑃₁and𝑃₂such that𝐴_mid = 𝑃₁𝐴_left and 𝐴_right = 𝑃₂𝐴_left. Moreover, since {b𝑖}^𝑀

𝑖=1 = {bˆ𝑖}^𝑀

𝑖=1, and {c𝑖}^𝑀

𝑖=1 = {cˆ𝑖}^𝑀

𝑖=1, it follows that 𝐴^′

mid = 𝑃₁(𝐴_left +𝑅) and 𝐴^′

right = 𝑃₂(𝐴_left +𝑅), where𝑅 ∈ {0,1}^𝑀×𝐿 is a matrix of Hamming weight 1; this clearly implies that 𝐴^′

mid = 𝐴_mid+𝑃₁𝑅and that𝐴^′

right = 𝐴_right+𝑃₂𝑅. Now, notice thats₂results from vectorizing some submatrix 𝑀₂ of 𝐴_mid, ands^′₂ results from vectorizing some submatrix 𝑀^′

of 𝐴^′

mid. Moreover, the matrices 𝑀₂ and 𝑀^′

2 are taken from their mother matrix by omitting the same rows and columns, and both vectorizing operations consider the entries of 𝑀₂and 𝑀^′

2 in the same order. In addition, no substitution occurs in the𝐿/3, . . . , 𝐿 entries in thex𝑖’s, which implies that𝑥^′

𝜏𝑖,2𝐿/3 =𝑥_𝜋(𝑖),₂_𝐿/3. Then, the redundancies𝐸_𝐻(s₂)^′=𝐸_𝐻(s₂)and𝐸_𝐻(s₃)^′=𝐸_𝐻(s₃)can be identified from (5.11).

Therefore, it follows from𝐴^′

mid =𝐴_mid+𝑃₁𝑅that𝑑_𝐻( (s^′₂, 𝐸_𝐻(s^′₂)),(s₂, 𝐸_𝐻(s₂))) ≤

1. The claim fors₃is similar. □

By applying a Hamming decoder on either one of the s𝑖’s, the decoder obtains possible candidates for{a𝑖}^𝑀

𝑖=1,{b𝑖}^𝑀

𝑖=1, and{c𝑖}^𝑀

𝑖=1, and by Lemma5.5.1, it follows that these sets of candidates will coincide in at least two cases. Therefore, the decoder can apply a majority vote of the candidates from the decoding of eachs^′_𝑖, and the winning values are{a𝑖}^𝑀

𝑖=1,{b𝑖}^𝑀

𝑖=1, and{c𝑖}^𝑀

𝑖=1. Having these correct values, the decoder can sort{x^′_𝑖}^𝑀

𝑖=1according to theira𝑖columns, and deduce the values of𝜎 and𝜋by observing the resulting permutation in theb𝑖andc𝑖columns, with respect to their lexicographic ordering. This concludes the decoding of the values 𝑑₁, 𝑑₃, and𝑑₅of the data𝑑.

We are left to extract𝑑₂, 𝑑₄,and𝑑₆. To this end, observe that since the correct values of {a𝑖}^𝑀

𝑖=1, {b𝑖}^𝑀

𝑖=1, and {c𝑖}^𝑀

𝑖=1are known at this point, the decoder can extract the truepositions ofd₂,d₄,andd₆, as well as their respective redundancy bits𝐸_𝐻(d₂), 𝐸_𝐻(d₄), 𝐸_𝐻(d₆). Hence, we have that

𝑑_𝐻( (d^′_𝑖, 𝐸_𝐻(d𝑖)^′),(d𝑖, 𝐸_𝐻(d𝑖))) ≤1

for 𝑖 ∈ {2,4,6}, and thus that the decoding algorithm is complete by applying a Hamming decoder.

We now turn to compute the redundancy of the above codeC. Note that there are two sources of redundancy—the Hamming code redundancy, which is at most 3(log𝑀 𝐿+ log𝑀 +2) and the fact that the sets {a𝑖}^𝑀

𝑖=1, {b𝑖}^𝑀

𝑖=1, and {c𝑖}^𝑀

𝑖=1 contain distinct strings. By a straightforward computation, for 4≤ 𝑀 ≤ 2^𝐿/6we have

𝑟(C) =log 2^𝐿

𝑀

−log

2^𝐿/³⁻¹ 𝑀

· (𝑀!)²·2³⁽^𝑀^−log^{𝑀 𝐿}^−log^𝑀⁻²⁾

=log

𝑀−1

𝑖=0

(2^𝐿 −𝑖) −log

𝑀−1

𝑖=0

(2^𝐿^/3−1−𝑖)³

−3𝑀+3 log𝑀 𝐿+3 log𝑀+6

=log

𝑀−1

𝑖=0

(2^𝐿 −𝑖)

(2^𝐿^/3−2𝑖)³ +3 log𝑀 𝐿+3 log𝑀 +6

≤3𝑀log 2^𝐿^/3 2^𝐿^/3−2𝑀

+3 log𝑀 𝐿+3 log+6.

(𝑎)

≤12 log𝑒+3 log𝑀 𝐿+3 log𝑀+6, (5.13)

where inequality(𝑎)is derived in Appendix5.8.

For the case when𝑀 <log𝑀 𝐿+log𝑀, we generate{a𝑖}^𝑀

𝑖=1,{b𝑖}^𝑀

𝑖=1, and{c𝑖}^𝑀

𝑖=1with length𝐿/3− ⌈^log^{𝑀 𝐿+log}^𝑀

𝑀 ⌉. As a result, we have⌈^log^{𝑀 𝐿+log}^𝑀

𝑀 ⌉bits𝑥_{𝑖, 𝑗},𝑖 ∈ [𝑀], 𝑗 ∈ {𝐿/3− ⌈^log^{𝑀 𝐿}^+log^𝑀

𝑀 ⌉ +1, . . . , 𝐿/3} ∪ {2𝐿/3− ⌈^log^{𝑀 𝐿}^+log^𝑀

𝑀 ⌉ +1, . . . ,2𝐿/3} ∪ {𝐿−

⌈^log^{𝑀 𝐿}^+log^𝑀

𝑀 ⌉ +1, . . . , 𝐿} to accommodate the information bits d₂,d₄,d₆ and the redundancy bits{𝐸_𝐻(s𝑖)}³

𝑖=1and{𝐸_𝐻(d𝑖)}𝑖∈{2,4,6} in each part.

Remark 5.5.1. The above construction is valid whenever 𝑀 ≤ 2^𝐿/3−1. However, asymptotically optimal amount of redundancy is achieved for𝑀 ≤ 2^𝐿/6.

Remark 5.5.2. In this construction, the separate storage of the Hamming code re- dundancies𝐸_𝐻(d₂), 𝐸_𝐻(d₄), and𝐸_𝐻(d₆)is not necessary. Instead, storing𝐸_𝐻(d₂, d₄,d₆) is sufficient, since the true position of those can be inferred after {a𝑖}^𝑀

𝑖=1, {b𝑖}^𝑀

𝑖=1, and{c𝑖}^𝑀

𝑖=1were successfully decoded. This approach results in redundancy of 3 log𝑀 𝐿+log 3𝑀 +𝑂(1), and a similar approach can be utilized in the next section as well.

Dalam dokumen CorrectingErrorsinDNAStorage - California Institute of Technology (Halaman 159-167)