Chapter V: Coding over Sets for DNA Storage: Substitution Errors
5.2 Preliminaries
To discuss the problem in its most general form and illustrate the ideas, we restrict our attention to binary strings. Most of the techniques in this chapter can be extended to non-binary cases. Generally, we denote scalars by lower-case lettersπ₯ , π¦, . . ., vectors by bold symbolsx,y, . . ., integers by capital lettersπ , πΏ , . . ., and[π] β {1,2, . . . , π}.
For integers π and πΏ such that1 π β€ 2πΏ we denote by {0,1}
πΏ
π
the family of all subsets of size πof{0,1}πΏ, and by {0,1}
πΏ
β€π
the family of subsets of sizeat most π of {0,1}πΏ. In our channel model, a word is an element π β {0,1}
πΏ
π
, and a code C β {0,1}
πΏ
π
is a set of words (for clarity, we refer to words in a given code as codewords). To prevent ambiguity with classical coding theoretic terms, the elements in a word π = {x1, . . . ,xπ} are referred to as strings. We emphasize
1We occasionally also assume thatπ β€2π πΏ for some 0 < π < 1. This is in accordance with typical values ofπandπΏin contemporary DNA storage prototypes (see Sec.5.1).
that the indexing inπ is merely a notational convenience, e.g., by the lexicographic order of the strings, and this information is not available at the decoder.
For π β€ π πΏ, a π-substitution error (π-substitution, in short), is an operation that changes the values of at most π different positions in a word. Notice that the result of aπ-substitution is not necessarily an element of {0,1}
πΏ
π
, and might be an element of {0,1}
πΏ
π
for someπ βπ β€π β€ π. This gives rise to the following definition.
Definition 5.2.1. For a wordπ β {0,1}
πΏ
π
, a ballBπ(π) βΓπ
π=πβπ {0,1}πΏ
π
centered atπis the collection of all subsets of{0,1}πΏthat can be obtained by aπ-substitution inπ.
Example 5.2.1. For π =2, πΏ =3,π =1, andπ ={001,011}, we have that Bπ(π) ={{001,011},{101,011},{011},{000,011},
{001,111},{001},{001,010}}.
In this chapter, we discuss bounds and constructions of codes in {0,1}
πΏ
π
that can correctπsubstitutions (π-substitution codes, for short), for various values ofπ. The sizeof a code, which is denoted by|C |, is the number of codewords (that is, sets) in it. Theredundancyof the code, a quantity that measures the amount of redundant information that is to be added to the data to guarantee successful decoding, is defined asπ(C) β log 2
πΏ
π
βlog(|C |), where the logarithms are in base 2.
A codeCis used in our channel as follows. First, the data to be stored (or transmitted) is mapped by a bijectiveencoding functionto a codeword πΆ β C. This codeword passes through a channel that might introduce up toπ substitutions, and as a result a wordπ β Bπ(πΆ) is obtained at the decoder. In turn, the decoder applies some decoding functionto extract the original data. The codeCis called aπ-substitution code if the decoding process always recovers the original data successfully. Having settled the channel model, we are now in a position to formally state our contribution.
Theorem 5.2.1. (Main) For any integersπ,πΏ, andπsuch thatπ β€ 2πΏ/(4π+2), there exists an explicit code construction with redundancy π(π2log(π πΏ))2 (Sec. 5.6).
For π = 1, the redundancy of this construction is asymptotically at most six times larger than the optimal one (Sec.5.5), whenπΏgoes to infinity andπ β₯ 4. Further- more, an improved construction forπ = 1achieves redundancy which is asymptot-
2Throughout this chapter, we write π(π) = π(π(π)) for any functions π , π : N β R and integerπ, if lim supπββ
π(π) π(π) <β.
ically at most three times the optimal one (Appendix5.8), when πΏ goes to infinity and4 β€ π β€2πΏ/4.
Remark 5.2.1. We note that under the current technology restriction, our code constructions apply to a limited parameter range. For example, when π = 1 and πΏ = 100, our code requires that π β€ 105. The upper bound constraint on π is more strict asπ increases. Yet, our constructions provide the following insights.
First, as the technology advances, it is reasonable to hope that the error rate will decrease and the synthesis lengthπΏ will increase, in which case we have smallerπ and largerπΏ. Asπ gets smaller, the range of πincreases exponentially withπΏ. Second, while πΏ is limited under current technology, the number of strings π can be more freely chosen. One can get smallerπ by storing information in moreπ· π π΄ pools.
Third, the application of this work is not limited to DNA storage. Our techniques apply to other cases such as network packet transmission.
A few auxiliary notions are used throughout this chapter, and are introduced herein.
For two stringss,tβ {0,1}πΏ, theHamming distanceππ»(s,t)is the number of entries in which they differ. To prevent confusion with common terms, a subset of{0,1}πΏ is calleda vector-code, and the setBπ»
π·(s)of all strings within Hamming distanceπ· or less of a given string sis called theHamming ballof radiusπ· centered ats. A linear vector code is called an [π, π]π code if the strings in it form a subspace of dimensionπ inFππ, whereFπis the finite field withπelements.
Several well-known vector-codes are used in the sequel, such as Reed-Solomon codes or Hamming codes. For an integerπ‘, the Hamming code is an [2π‘β1,2π‘βπ‘β1]2 code (i.e., there areπ‘redundant bits in every codeword), and its minimum Hamming distance is 3. Reed-Solomon (RS) codes over Fπ exist for every length π and dimensionπ, as long asπ β₯ πβ1 [79, Sec. 5], and requireπβπ redundant symbols inFπ. Wheneverπis a power of two, RS codes can be made binary by representing each element ofFπ as a binary string of length log2(π). In the sequel we use this form of RS code, which requires log(π) (πβπ)redundant bits.
Finally, our encoding algorithms make use ofcombinatorial numbering maps[53], that are functions that map a number to an element in some structured set. Specif- ically,πΉπππ : [ π
π
] β {π : π β [π],|π| = π} maps a number to a set of distinct elements, andπΉπ ππ π : [π!] βππmaps a number to a permutation in the symmetric
groupππ. The functionπΉπππcan be computed using a greedy algorithm with com- plexityπ(π πlogπ), and the functionπΉπ ππ πcan be computed in a straightforward manner with complexityπ(πlogπ). Using πΉπππ andπΉπ ππ π together, we define a map πΉ : [ π
π
π!] β {π : π β [π],|π| = π} Γππ that maps a number into an unordered set of sizeπ together with a permutation.