AN INTRODUCTION TO BIOINFORMATICS ALGORITHMS

Bob tries to analyze the game and finds that there are too many variants in the game with two piles of ten stones (which we will call the 10+10 game). Below are brief descriptions of the basic commands used in pseudocode throughout this book.

Biological Algorithms versus Computer Algorithms

DNA polymerase (another molecular machine) binds to each freshly separated sample DNA strand; DNA polymerase crosses the parent strands only in the 3 → 5 direction. Therefore, DNA polymerases attached to both DNA strands move in opposite directions.

The Change Problem

While this problem is not particularly relevant to biology, it serves as a useful tool to illustrate a number of different algorithmic approaches. Output: the smallest number of quartersq, dimesd, nickels n and penniesp whose values add to M (i.e. 25q+ 10d+ 5n+p=M enq+d+n+pis as small as possible).

Figure 2.2 The subtle difference between a problem (top) and an instance of a prob- prob-lem (bottom).

Correct versus Incorrect Algorithms

We say an algorithm is correct if it can translate each input instance into the correct output. An algorithm is incorrect if there is at least one input instance for which the algorithm does not produce the correct output.

Recursive Algorithms

To move a stack of size-1 from the center to the right, first move a stack of size-2 from the center to the left, then move the (n-1) disk to the right, and then move the stack ofn−2 from the left to the right stick, and so on. The subsequent statements (lines 5-7) then solve the smaller problem of moving the stack of size-1 first to the temporary space, moving the largest disk, and then moving small disks to the final destination.

Table 2.1 The result of 6 − fromP eg − toP eg for all possible values of fromP eg and toP eg.

Iterative versus Recursive Algorithms

The number of adult rabbits in the period is equal to the number of rabbits (adults and babies) in the previous period, orFn−1. The number of baby rabbits in a given period is equal to the number of adult rabbits in Fn−1, namely Fn−2.

Fast versus Slow Algorithms

We have seen that the running time of an algorithm is often related to the size of its input. As we'll see in the next section, when we talk about an algorithm's running time as a function of input size, we're referring to the one input—or set of inputs—of a particular size that the algorithm will take the longest to process.

Figure 2.5 The recursion tree for R ECURSIVE F IBONACCI ( n ). Vertices enclosed in dashed circles represent duplicated effort—the same value had been calculated in another vertex in the tree at a higher level

Big-O Notation

Like the Big-O notation, which defines an upper bound on the growth of a function, we can define a relationship that reflects a lower bound on the growth of a function. ELECTIONSORT always performs the same operations on a list of sizes, we can be sure that this is a tight analysis of the running time of the algorithm.

Algorithm Design Techniques

For example, if you used a brute force algorithm to find a ringing phone, you would ignore the phone ringing as if you didn't hear it and simply walk over every square inch of your home to see if the phone was there. Another approach to the phone-finding problem is to collect statistics over the course of a year on where you leave your phone to see where the phone most often ends up.

Tractable versus Intractable Problems

However, no one seems to be able to prove that polynomial-time algorithms for these problems are impossible, so no one can rule out the possibility that these problems are actually efficiently solvable. The critical property of N P -complete problems is that, if one N P -complete problem can be solved by a polynomial-time algorithm, then all N P -complete problems can be solved by slight modifications of the same algorithm.

Notes

Richard Karp, born 1935 in Boston, is a professor at the University of California at Berkeley, with a principal appointment in computer science and additional appointments in mathematics, bioengineering and operations research. He has been a faculty member at the University of California at Berkeley since 1968 (with the exception of the period 1995–99, when he was a professor at the University of Washington).

Problems

The player who can place the king on the bottom right square of the chessboard wins. In the second minute, each of the viruses kills a bacterium and produces a new copy of itself (resulting in 4viruses and2(2(n−1)−2) = 4n−8bacteria; again the remaining bacteria reproduce.

What Is Life Made Of?

It would be safe to say that the minimum biological background one needs to digest a typical bioinformatics book can fit into ten pages.1 In this chapter we give a brief introduction to biology that covers most of the covers computational concepts discussed in bioinformatics. books. The incredibly reliable and complex algorithm that governs the life of the cell is still beyond our understanding.

What Is the Genetic Material?

Sturtevant crossed double mutantbandvg flies with normal flies and found that about 17% of the offspring had only one mutation. But when Sturtevant crossed bands of flies with double mutants, he found that 9% of the offspring had only one mutation.

What Do Genes Do?

Morgan's student Alfred Sturtevant followed Morgan's chromosome theory and produced the first genetic map of a chromosome showing the order of genes. By studying many genes in this way, gene order can be determined.

What Molecule Codes for Genes?

What Is the Structure of DNA?

Watson and Crick faced a three-dimensional puzzle: find a helical structure made of DNA subunits that explains the Chargaff rule. After tinkering with paper and metal, Tinkertoy representations of bases3Watson and Crick arrived at the very simple and elegant double-stranded helical structure of DNA.

Figure 3.1 Watson and Crick puzzling about the structure of DNA. (Photo courtesy of Photo Researchers, Inc.)

What Carries Information between DNA and Proteins?

Thus DNA served as a template used to copy a particular gene into messenger RNA (mRNA) that carries the gene's genetic information to the ribosome to make a particular protein.4. To do so, these cells must cut out the introns from the RNA transcript and splice all the exons together before the mRNA enters the ribosome.

How Are Proteins Made?

Many chemical systems in the cell require protein complexes, which are groups of proteins that assemble together into a large structure. This short molecule is then attacked by large molecular complexes known as ribosomes, which read successive codons and locate the corresponding amino acid for incorporation into the growing polypeptide chain.

Table 3.1 The genetic code, from the perspective of mRNA. The codon for methio- methio-nine, or AUG, also acts as a “start” codon that initiates transcription

How Can We Analyze DNA?

Restriction enzymes first bind to the recognition site on double-stranded DNA and then cut the DNA. The speed at which a fragment migrates is related to the size of the fragment, so measuring the migration distance over a given time allows one to estimate the size of a DNA fragment.

Figure 3.3 The three main operations in the polymerase chain reaction. Denatura- Denatura-tion (top) is performed by heating the soluDenatura-tion of DNA until the strands separate (which happens around 70 C)

How Do Individuals of a Species Differ?

How Do Different Species Differ?

Why Bioinformatics?

This showed that the Cretan civilization of the Linear B tablets had been part of the Greek civilization. He was so pleased with the result that he added Doolittle's name as one of the co-authors.

Restriction Mapping

4.1 (b)] of DNA.2 The constraint mapping problem can be formulated in terms of recovering the positions of points when only pairwise distances between these points are known. A complete digestion corresponds to experimental conditions under which the DNA molecule at any restriction site is cut (ie, the probability of cutting at each restriction site is 1).

Figure 4.1 Different methods of digesting a DNA molecule. A complete digest pro- pro-duces only fragments between consecutive restriction sites, while a partial digest yields fragments between any two restriction sites

Impractical Restriction Mapping Algorithms

For example, if L does not contain the number 5, it really doesn't make sense to choose anyxi = 5, even though the above algorithm will do that. For example, BRUTE-FORCEPDP takes a very long time to execute when called on input L, but ANOTHERBRUTEFORCEPDP takes very little time.

A Practical Restriction Mapping Algorithm

After each recursive call in PLACE, we undo our modifications to the sets XandLin to restore them for the next recursive call. At first glance, this algorithm looks efficient - at each point we examine two alternatives ("left" or "right"), and exclude the obviously wrong decisions that lead to inconsistent distances.

Regulatory Motifs in DNA Sequences

It turns out that many immunity genes in the fruit fly genome have strings reminiscent of TCGGGGATTTCC located upstream of the genes' start. Ideally, the fly infection experiment would result in a set of upstream regions from genes in the genome, each region containing at least one NF-κB binding site.

Proﬁles

In fact, the string ATGCAACT does not even appear in figure 4.2 (d), but its seven mutated versions appear at position 8 in the first row, position 19 in the second row, 3 in the third, 5 in the fourth, 31 in the fifth, 27 in the sixth and 15 in the seventh. Relying on a single strand to represent a motif often fails to represent the variation of the pattern in actual biological sequences, as in figure 4.2 (d).

Figure 4.3 From DNA sample, to alignment matrix, to proﬁle, and, ﬁnally, to con- con-sensus string

The Motif Finding Problem

The problem on the left is the Median String problem while the problem on the right is the Motif Finding problem. In other words, the consensus string for solving the Motif Finding problem is the median string for the input DNA sample.

Figure 4.4 Calculating the total Hamming distance for the consensus string ATG- ATG-CAACT (the alignment is the same as in ﬁgure 4.3)

Search Trees

At each vertex we calculate a frontier – the most optimistic score of all leaves in the subtree rooted at that vertex. Scores on internal vertices represent the maximum score in the subtree rooted at that vertex.

Figure 4.5 All 4-mers in the alphabet of {1 , 2}.

Finding Motifs

There are −l+1 choices for the first index (s1) and for each of these, there are no l+1 choices for the second index (s2). Therefore, the total number of positions is (n−l+ 1)t, which is exponential in t, the number of sequences.

Finding a Median String

8 word←nucleotide string corresponding to (s1, s2, . . . sl) 9 if TOTALDISTANCE(word, DNA)< bestDistance 10 best Distance←TOTALDISTANCE(word, DNA). 13 word←nucleotide string corresponding to (s1, s2, . . . sl) 14 if TOTALDISTANCE(word, DNA)< bestDistance 15 bestDistance←TOTALDISTANCE(word, DNA).

Figure 4.9 A search tree for the Median String problem. Each branching point can give rise to only four children, as opposed to the n − l +1 children in the Motif Finding problem.

Notes

Twenty years later, it was shown that these transcription factors bind specifically in the upstream regions of the genes they regulate and recognize certain patterns (motifs) in DNA. By combining the protein's binding parameters with analysis of its sequence, including a comparison with other gene sequences, they were able to provide a model for the protein's activity in gene regulation.

Problems

Design a brute force algorithm for DDP and suggest a branch-and-bound approach to improve its performance. Design a brute force algorithm for PPDP and suggest a branch-and-bound approach to improve its performance.

Figure 4.10 Restriction map of two restriction enzymes. When the digest is per- per-formed with each restriction enzyme separately and then with both enzymes com-bined, you may be able to reconstruct the original restriction map

Genome Rearrangements

The USCHANGE algorithm in Chapter 2 is an example of a greedy strategy: at each step, the checker would only consider the largest denomination less than (or equal to) M. TERCHANGE actually returned incorrect results in some cases for because of his short-sighted notion of "good". This is a common characteristic of greedy algorithms: they often produce suboptimal results, but take very little time to do so.

ChromosomeMouse X Chromosome

Sorting by Reversals

However, biologists believe that the architecture of the X chromosome in the human-mouse ancestor is roughly the same as the architecture of the human X chromosome. Even before biologists faced genome rearrangement problems, computer scientists studied the related Sort by Prefix Flips problem, also known as the Pancake Flipping problem: given an arbitrary permutation π, find dpref(π), which is the minimum number of flips of the formρ( is) 1, i) sort- ingπ.

Approximation Algorithms

William Gates, a student at Harvard in the mid-1970s, and Christos Papadimitriou, a professor at Harvard in the mid-1970s, now at Berkeley, made the first attempt to solve this problem and proved that any permutation can be solved. by up to 53(n+ 1) prefix reversals. Of course, an algorithm with an approximation ratio of 1 (a correct and optimal algorithm by definition) would be the pinnacle of perfection, but such algorithms can be difficult to find.

Breakpoints: A Different Face of Greed

On the other hand, it is easy to see that if all the inπ bars are increasing, then there may not be a reversal to reduce the number of breakpoints. By reversing the rising bar, ρ creates a falling bar, which means that IMPROVED REVERSAL BREAKPOINT SORTING will be able to reduce the number of bars in the next step.

A Greedy Approach to Motif Finding

As you can imagine, since sequences are scanned sequentially, it is possible to build up input instances where GREEDYMOTIFSEARCH will miss the optimal motif. Another important difference is that CONSENSUS stores a large number (typically at least 1000) of seed matrices at each iteration, rather than just the one stored by GREEDYMOTIFSEARCH, making CONSENSUS less likely to miss the optimal solution.

Notes

In 1969 he joined the new Center de recherches mathématiques (CRM) of the University of Montreal and was also a professor in the Department of Mathematics and Statistics from 1984-2002. He is one of the founders of bioinformatics, whose fundamental contributions to this field date back to the early 1970s.

Problems

Given the three permutationsπ1,π2 andπ3 from the previous problem, find the stem permutationσ that minimizes the total breakpoint distance P3. Given three permutationsπ1,π2 andπ3 from the previous problem, find an ancestral permutationσ that minimizes the total reversal distance P3.

The Power of DNA Sequence Comparison

More than 10 million Americans are unaware and asymptomatic carriers of the defective cystic fibrosis gene. In 1989, the search for the cystic fibrosis gene was narrowed down to a 1 million nucleotide region on chromosome 7, but the exact location of the gene remained unknown.

The Change Problem Revisited

Instead, we just calculate the minimum number of coins needed (this algorithm can easily be modified to also return a combination of coins that reaches this number). This works because the best number of coins for a given value only depends on values less than m.

Figure 6.1 The relationships between optimal solutions in the Change problem. The smallest number of coins for 77 cents depends on the smallest number of coins for 76, 74, and 70 cents; the smallest number of coins for 76 cents depends on the smallest numb

The Manhattan Tourist Problem

MANHATTANTOURIST calculates the length of the longest path in the grid, but does not give the path itself. In the case of the Manhattan tourist problem, this changes the optimal path (the optimal path in this new city has six attractions instead of five).

Figure 6.3 A city somewhat like Manhattan, laid out on a grid with one-way streets.

Edit Distance and Alignments

C Analyzing the merits of an alignment is equivalent to analyzing the merits of the corresponding path in the modification graph. Each alignment corresponds to a path in the alignment grid from (0,0)to(n,m), and each path from (0,0)to(n,m) in the alignment grid corresponds to an alignment.

Figure 6.10 Alignment of ATATATAT against TATATATA and of ATATATAT against TATAAT.

Longest Common Subsequences

The following recursive program prints the longest general sequence through the information stored inb. The dynamic programming table in figure 6.14 (left) presents the calculation of the similarity scores (v,w) between vand w, while the table on the right presents the calculation of the modification distance between v and w under the assumption that insertions and deletions are the only permitted operations.

Figure 6.14 Dynamic programming algorithm for computing the longest common subsequence.

Global Sequence Alignment

Scoring Alignments

Fortunately, in many cases the alignment of very similar sequences is so obvious that it can be constructed even without a scoring matrix, thus solving this predicament. Once these “obvious” alignments have been constructed, they can be used to calculate a scoring matrix δ that can be used iteratively to construct less obvious alignments.

Local Sequence Alignment

The solution to this apparently more difficult problem lies in realizing that the Global Alignment problem corresponds to finding the longest local path between the vertices (0,0) and (n, m) in the operation graph, while the Local Alignment problem corresponds to finding the longest path among the paths between arbitrary vertices (i, j) and (i, j) in the operation graph. The largest value of si,j over the entire edit graph represents the score of the best local alignment ofv andw; Recall that in the Global Alignment problem we simply looked at atsn,m.

Figure 6.16 (a) Global and (b) local alignments of two hypothetical genes that each have a conserved domain

Alignment with Gap Penalties

Multiple Alignment

The multiple alignment matrix we constructed is a generalization of the pairwise alignment matrix > 2 sequences. Each multiple alignment of three sequences corresponds to a path in the three-dimensional Manhattan-like editing graph.

Figure 6.18 A three-level edit graph for alignment with afﬁne gap penalties. Every vertex ( i, j ) in the middle level has one outgoing edge to the upper level, one outgo-ing edge to the lower level, and one incomoutgo-ing edge each from the upper and low

Gene Prediction

Thus, the difference in the size of the salamander and human genomes likely reflects larger amounts of junk DNA and repeats in the salamander genome. These four continuous segments (called exons) in the adenovirus genome are separated by three "junk" fragments called introns.

Figure 6.23 An electron microscopy experiment led to the discovery of split genes.

Statistical Approaches to Gene Prediction

Codon usage arrays for coding regions are different from codon usage arrays for non-coding regions, allowing them to be used for gene prediction. However, the accuracy of GENSCAN decreases for genes with many short exons or with unusual codon usage.

Figure 6.25 The six reading frames for the sequence ATGCTTAGTCTG. The string may be read forward or backward, and there are three frame shifts in each direction.

Similarity-Based Approaches to Gene Prediction