Restriction Mapping - AN INTRODUCTION TO BIOINFORMATICS ALGORITHMS

4 Exhaustive Search

Exhaustive search algorithms require little effort to design but for many problems of interest cannot process inputs of any reasonable size within your lifetime. Despite this problem, exhaustive search, orbrute forcealgorithms are often the ﬁrst step in designing more efﬁcient algorithms.

We introduce two biological problems:DNA restriction mappingandregula- tory motif finding, whose brute force solutions are not practical. We further describe thebranch-and-boundtechnique to transform an inefficient brute force algorithm into a practical one. In chapter 5 we will see how to improve our motif finding algorithm to arrive at an algorithm very similar to the popular CONSENSUSmotif finding tool. In chapter 12 we describe two randomized algorithms,GibbsSamplerandRandomProjections, that use coin-tossing to find motifs.

for many years biologists were forced to build restriction maps for genomes without prior knowledge of the genomes’ DNA sequence.¹

Several experimental approaches to restriction mapping exist, each with advantages and disadvantages. The distance between two individual restriction sites corresponds to the length of the restriction fragment between those two sites and can be measured by thegel electrophoresistechnique described in chapter 2. This requires no knowledge of the DNA sequence. Biologists can vary experimental conditions to produce either acompletedigest [ﬁg. 4.1 (a)] or apartialdigest [ﬁg. 4.1 (b)] of DNA.² The restriction mapping problem can be formulated in terms of recovering positions of points when only pairwise distances between those points are known.

To formulate the restriction mapping problem, we will introduce some notation. Amultisetis a set that allows duplicate elements (e.g.,{2,2,2,3,3,4,5}

is a multiset with duplicate elements2and3). IfX ={x1= 0, x2, . . . , xn}is a set ofnpoints on a line segment in increasing order, thenΔXdenotes the multiset ofall_n

pairwise distances³between points inX:

ΔX={x_j−x_i: 1≤i < j≤n}.

For example, ifX={0,2,4,7,10}, thenΔX={2,2,3,3,4,5,6,7,8,10}, which are the ten pairwise distances between these points (table 4.1). In restriction mapping, we are givenΔX, the experimental data about fragment lengths.

The problem is to reconstructX from ΔX. For example, could you infer

1. While restriction maps were popular research tools in the late 1980s, their role has been some- what reduced in the last ten to ﬁfteen years with the development of efﬁcient DNA sequencing technologies. Though biologists rarely have to solve the DNA mapping problems in current research, we present them here as illustrations of branch-and-bound techniques for algorithm development.

2. Biologists typically work with billions of identical DNA molecules in solution. A complete digest corresponds to experimental conditions under whicheveryDNA molecule ateveryre- striction site is cut (i.e., the probability of cut at every restriction site is 1). Every linear DNA molecule withnrestriction sites is cut inton+1fragments that are recorded by gel electrophoresis as in ﬁgure 4.1 (a). A partial digest corresponds to experimental conditions that cut every DNA molecule at a given restriction site with probability less than 1. As a result, with some probability, the interval between any two (not necessarily consecutive) sites remains uncut, thus generating all fragments shown in ﬁgure 4.1 (b).

3. The notation`_n

´, read “nchoosek,” means “the number of distinct subsets ofkelements taken from a (larger) set ofnelements,” and is given by the expression_(n−k)!k!^n! . In particular,

`_n

´=ⁿ⁽ⁿ⁻¹⁾₂ is the number of different pairs of elements from ann-element set. For example, ifn = 5, the set{1,2,3,4,5}has`₅

´= 10subsets formed by two elements: {1,2},{1,3}, {1,4},{1,5},{2,3},{2,4},{2,5},{3,4},{3,5}, and{4,5}. These ten subsets give rise to the elementsx2−x1,x3−x1,x4−x1,x5−x1,x3−x2,x4−x2,x5−x2,x4−x3,x5−x3, and x5−x4.

4.1 Restriction Mapping 85

0 2 4 7 10

2 2

3 3

(a) Complete digest.

0 2 4 7 10

2 2

3 3

4 5

7 6

10 8

(b) Partial digest.

Figure 4.1 Different methods of digesting a DNA molecule. A complete digest pro- duces only fragments between consecutive restriction sites, while a partial digest yields fragments between any two restriction sites. Each of the dots represents a restriction site.

that the setΔX ={2,2,3,3,4,5,6,7,8,10}was derived from{0,2,4,7,10}?

Though gel electrophoresis allows one to determine the lengths of DNA fragments easily, it is often difﬁcult to judge their multiplicity. That is, the number of different fragments of a given length can be difﬁcult to determine.

However, it is experimentally possible to do so with a lot of work, and we assume for the sake of simpliﬁng the problem that this information is given to us as an input to the problem.

Table 4.1 Representation ofΔX ={2,2,3,3,4,5,6,7,8,10}as a two-dimensional table, with the elements ofX ={0,2,4,7,10}along both the top and right side. The element at(i, j)in the table is the valuex_j−x_ifor1≤i < j≤n.

0 2 4 7 10

2 2 5 8

4 3 6

7 3

Partial Digest Problem:

Given all pairwise distances between points on a line, reconstruct the positions of those points.

Input: The multiset of pairwise distancesL, containing_n

integers.

Output:A setX, ofnintegers, such thatΔX =L

ThisPartial Digest problem, or PDP, is sometimes called theTurnpike problem in computer science. Suppose you knew the set of distances between every (not necessarily consecutive) pair of exits on a highway leading from one town to another. Could you reconstruct the geography of the highway from this information? That is, could you ﬁnd the distance from the ﬁrst town to each exit? Here, the “highway exits” are the restrictions sites in DNA; the lengths of the resulting DNA restriction fragments correspond to distances between highway exits. Computationally, the only difference between the Turnpike problem and the PDP is that the distances between exits are given in miles in the Turnpike problem, while the distances between restriction sites are given in nucleotides in the PDP.

We remark that it is not always possible to uniquely reconstruct a setX based only onΔX. For example, for any integervand setA, one can see that ΔAis equal toΔ(A⊕ {v}), whereA⊕ {v}is deﬁned to be{a+v:a∈A}, a shiftof every element inAbyv. AlsoΔA= Δ(−A), where−A={−a:a∈A}

is thereﬂection ofA. For example, setsA = {0,2,4,7,10},Δ(A⊕ {100}) = {100,102,104,107,110}, and−A={−10,−7,−4,−2,0}all produce the same partial digest. The sets{0,1,3,8,9,11,12,13,15}and{0,1,3,4,5,7,12,13,15}

Dalam dokumen AN INTRODUCTION TO BIOINFORMATICS ALGORITHMS (Halaman 85-89)