Human X ChromosomeMouse X Chromosome
5.6 Notes
The analysis of genome rearrangements in molecular biology was pioneered by Theodosius Dobzhansky and Alfred Sturtevant who, in 1936, published a milestone paper (102) presenting a rearrangement scenario for the species of fruit fly. In 1984 Nadeau and Taylor (78) estimated that surprisingly few genomic rearrangements (about 200) had taken place since the divergence of the human and mouse genomes. This estimate, made in the pregenomic era and based on a very limited data set, comes close to the recent postgenomic estimates based on the comparison of the entire human and mouse DNA sequences (85). The computational studies of the Reversal Distance prob- lem were pioneered by David Sankoff in the early 1990s (93). The greedy algorithm based on breakpoint elimination is from a paper (56) by John Ke- cecioglu and David Sankoff. The best currently known algorithm for sorting
138 5 Greedy Algorithms by reversals has an approximation ratio of 1.375 and was introduced by Piotr Berman, Sridhar Hannenhalli and Marek Karpinski (13). The first algorith- mic analysis of the Pancake Flipping problem was the work of William Gates and Christos Papadimitriou in 1979 (40).
The greedyCONSENSUSalgorithm was introduced by Gerald Hertz and Gary Stormo, and further improved upon in 1999 in a later paper by the same authors (47).
David Sankoffcurrently holds the Ca- nada Research Chair in Mathematical Genomics at the University of Ottawa.
He studied at McGill University, doing a PhD in Probability Theory with Don- ald Dawson, and writing a thesis on sto- chastic models for historical linguistics.
He joined the new Centre de recherches mathématiques (CRM) of the University of Montreal in 1969 and was also a pro- fessor in the Mathematics and Statistics Department from 1984–2002. He is one of the founding fathers of bioinformatics whose fundamental contributions to the area go back to the early 1970s.
Sankoff was trained in mathematics and physics; his undergraduate sum- mers in the early 1960s, however, were spent in a microbiology lab at the University of Toronto helping out with experiments in the field of virology and whiling away evenings and weekends in the library reading biological journals. It was exciting, and did not require too much background to keep up with the molecular biology literature: the Watson-Crick model was not even ten years old, the deciphering of the genetic code was still incomplete, and mRNA was just being discovered. With this experience, Sankoff had no problems communicating some years later with Robert J. Cedergren, a biochemist with a visionary interest in applying computers to problems in molecular biology.
In 1971, Cedergren asked Sankoff to find a way to align RNA sequences.
Sankoff knew little of algorithm design and nothing of discrete dynamic programming, but as an undergraduate he had effectively used the latter in working out an economics problem matching buyers and sellers. The same approach worked with alignment. Bob and David became hooked on the topic, exploring statistical tests for alignment and other problems, fortunately before they realized that Needleman and Wunsch had already published a dynamic programming technique for biological sequence com- parison.
A new question that emerged early in the Sankoff and Cedergren work was that of multiple alignment and its pertinence to molecular evolution.
140 5 Greedy Algorithms Sankoff was already familiar with phylogeny problems from his work on lan- guage families and participation in the early numerical taxonomy meetings (before the schism between the parsimony-promoting cladists, led by Steve Farris, and the more statistically oriented systematists). Combining phylo- genetics with sequence comparison led to tree-based dynamic programming for multiple alignment. Phylogenetic problems have cropped up often in Sankoff’s research projects over the following decades.
Sankoff and Cedergren also studied RNA folding, applying several passes of dynamic programming to build energy-optimal RNA structures. They did not find the loop-matching reported by Daniel Kleitman’s group (later integrated into a general, widely-used algorithm by Michael Zuker), though they eventually made a number of contributions in the 1980s, in particular to the problem of multiple loops and to simultaneous alignment and folding.
Sankoff says:
My collaboration with Cedergen also ran into its share of dead ends.
Applying multidimensional scaling to ribosome structure did not lead very far, efforts to trace the origin of the genetic code through the phy- logenetic analyses of tRNA sequences eventually petered out, and an attempt at dynamic programming for consensus folding of proteins was a flop.
The early and mid-1970s were nevertheless a highly productive time for Sankoff; he was also working on probabilistic analysis of grammatical vari- ation in natural languages, on game theory models for electoral processes, and various applied mathematics projects in archaeology, geography, and physics. He got Peter Sellers interested in sequence comparison; Sellers later attracted attention by converting the longest common subsequence (LCS) formulation to the edit distance version. Sankoff collaborated with promi- nent mathematician Vaclav Chvatal on the expected length of the LCS of two random sequences, for which they derived upper and lower bounds. Sev- eral generations of probabilists have contributed to narrowing these bounds.
Sankoff says:
Evolutionary biologists Walter Fitch and Steve Farris spent sabbaticals with me at the CRM, as did computer scientist Bill Day, generously adding my name to a series of papers establishing the hardness of var- ious phylogeny problems, most importantly the parsimony problem.
In 1987, Sankoff became a Fellow of the new Evolutionary Biology Pro- gram of the Canadian Institute for Advanced Research (CIAR). At the very
first meeting of the CIAR program he was inspired by a talk by Monique Turmel on the comparison of chloroplast genomes from two species of al- gae. This led Sankoff to the comparative genomics–genome rearrangement track that has been his main research line ever since. Originally he took a probabilistic approach, but within a year or two he was trying to develop algorithms and programs for reversal distance. A phylogeny based on the re- versal distances among sixteen mitochondrial genomes proved that a strong phylogenetic signal can be conserved in the gene order of even a miniscule genome across many hundreds of millions of years. Sankoff says:
The network of fellows and scholars of the CIAR program, including Bob Cedergren, Ford Doolittle, Franz Lang, Mike Gray, Brian Golding, Mike Zuker, Claude Lemieux, and others across Canada; and a stellar group of international advisors (such as Russ Doolittle, Michael Smith, Marcus Feldman, Wally Gilbert) and associates (Mike Waterman, Joe Felsenstein, Mike Steel and many others) became my virtual “home department," a source of intellectual support, knowledge, and expe- rience across multiple disciplines and a sounding board for the latest ideas.
My comparative genomics research received two key boosts in the 1990s.
One was the sustained collaboration of a series of outstanding stu- dents and postdocs: Guillaume Leduc, Vincent Ferretti, John Kece- cioglu, Mathieu Blanchette, Nadia El-Mabrouk and David Bryant. The second was my meeting Joe Nadeau; I already knew his seminal pa- per with Taylor on estimating the number of conserved linkage seg- ments and realized that our interests coincided perfectly while our backgrounds were complementary.
When Nadeau showed up in Montreal for a short-lived appointment in the Human Genetics Department at McGill, it took no more than an hour for him and Sankoff to get started on a major collaborative project. They refor- mulated the Nadeau-Taylor approach in terms of gene content data, freeing it from physical or genetic distance measurements. The resulting simpler model allowed them to thoroughly explore the mathematical properties of the Nadeau-Taylor model and to experiment with the consequences of devi- ating from it.
The synergy between the algorithmic and probabilistic aspects of com- parative genomics has become basic to how Sankoff understands evolution.
The algorithmic is an ambitious attempt at deep inference, based on heavy
142 5 Greedy Algorithms assumptions and the sophisticated but inflexible mathematics they enable.
The probabilistic is more descriptive and less explicitly revelatory of histor- ical process, but the models based on statistics are easily generalized, their hypotheses weakened or strengthened, and their robustness ascertained. In Sankoff’s view, it is the playing out of this dialectic that makes the field of whole-genome comparison the most interesting topic of research today and for the near future.
My approach to research is not highly planned. Not that I don’t have a vision about the general direction in which to go, but I have no spe- cific set of tools that I apply as a matter of course, only an intuition about what type of method or model, what database or display, might be helpful. When I am lucky I can proceed from one small epiphany to another, working out some of the details each time, until some clear story emerges. Whether this involves stochastic processes, combina- torial optimization, or differential equations is secondary; it is the bi- ology of the problem that drives its mathematical formulation. I am rarely motivated to research well-studied problems; instead I find my- self confronting new problems in relatively unstudied areas; alignment was not a burning preoccupation with biologists or computer scientists when I started working on it, neither was genome rearrangement fif- teen years later. I am quite pleased, though sometimes bemused, by the veritable tidal wave of computational biologists and bioinformaticians who have inundated the field where there were only a few isolated researchers thirty or even twenty years ago.