• Tidak ada hasil yang ditemukan

Genome Sequencing: An Overview

Kishor Gaikwad

National Institute for Plant Biotechnology, New Delhi-110012 [email protected]

A genome is a collection of the entirecellular DNA in a manner that allows for proper packaging into chromosomes and specific expression allowing a cell to pass on its genetic information to the next generation. The genomes are present in the nucleus and mitochondria and additionally in plantschloroplast. The timing, specificity and also the amount of expression of these genes are determined by the sequence of the gene itself. Thus it is important to know the sequence of the gene and its adjoining areas on the genome and for that matter the entire genome to understand its complexity and structure. This is the world of structural and functional genomics and biologists are decoding the alphabets of life with an intent of understanding its complexity and harnessing the benefits.

Genome sequencing was considered an uphill task and an expensive proposition in early days.

In the 1960s and 70s it was understood that the whole genomes of big organisms were in proportion to their size. But these theories proved to be in contrast to the experimental evidences as it was shown that the genome of a big mammal was smaller than that of the lily plant. Even the prokaryotes seemed to have too much DNA in their genome than they could handle or package in the constraints of their single cell. It looked as if most of the DNA in the genome was redundant and was initially referred to as the junk DNA occupying space and energy of that organism. Then came the era of rapid advances in the understanding of the genetic code and how it worked in the pro and eukaryotes and one of the first sequence information was generated. In 1977 two methods for sequencing DNA were introduced. One method, referred to as Maxam-Gilbert sequencing, named after the two scientists at Harvard University who developed the technique, uses different chemicals to break radioactively labeled DNA at specific base positions. The other approach, developed by Frederick Sanger in England and called the chain termination method (also called the Sanger method), uses a DNA synthesis reaction with special forms of the four nucleotides that, when added to a DNA chain, stop (terminate) further chain growth. Thus a couple of hundreds reads could be completed in a day and was considered a major turnaround for gene sequencing. At this point, genome sequencing was still not thought of even as a distant possibility.

However all changed in the 90s with advent of PCR and high capacity cloning vectors. New vectors that could take insert DNA of sizes upto 1000kb (1Mb) were developed. These were the Bacterial artificial chromosomes (BACs) and the Yeast artificial chromosomes (YACs).

Thus large insert libraries were developed and sequencing genomes became a reality. With advances in the sequence chemistry new non isotopic fluorescent dyes were developed and the era of automated capillary electrophoresis dawned. At the same time the computer industry

Chapter-21

Page | 133 was going through its own revolution in the form of newer and faster than before machines and easy to use softwares. The synchronization or simple coincidence of these two technologies led to development of new machines. Thus it is possible to sequence the E. coli genome of around 4.5 Mb in less than the time required to watch a movie or even newer and faster machines that can finish human genome sequencing in flat 3 days otherwise which required around a decade.

We have traveled a long way from the first sequence of H. influenza in 1995 to complete sequence of humans in 2001 and rice in 2002.

(DOI: 10.5772/intechopen.69337)

Figure 1: A historical snapshot of DNA sequencing platforms

The most important landmark in genome sequencing was in the latter part of the last decade when the next generation sequencers were launched by 454 and Solexa (Fig 1& 2). This revolutionized the way by which genome sequencing could be achieved. All the earlier projects including humans, Arabidopsis and rice were completed by a particular approach, known as the BAC by BAC approach. This took more time and was a high cost and labour intensive experiment. These approaches also required the availability of high density molecular maps which is available only in few plants. NGS technology changed all that and whole genome sequencing (WGS) became a routine feature where just few microgram of genomic DNA was enough to get a draft genome assembly. The improvements in NGS technology has been rapid and targeted more towards sequencing each base. One could now reach deep into the cell to sequence that one elusive transcript which normally would escape detection by any other methods. Thus the longer reads of a 454 machines and deeper coverage provided by Illumina / Solid/Ion Torrent systems merged together with the Sanger backbone, became a regular approach to sequence and assemble complex genomes.Suddenly the genebank became populated with assemblies of all types of eukaryotes and plants in particular.These included

Page | 134 draft genomes, organelle genomes, deep transcript profiles, small RNA profiles, ethylated C etc.Just when the biologists were grappling with the huge amount of genome and transcriptome data, along came the chemistry of single molecule sequencing. Notably, two systems known as Pacific Biosciences and Oxford nanopore were launched and now provide the luxury of longer reads ranging upto average 10 kb or more. With advent of newer chemistries like HiC, and 10X Genomics, it has now become relatively easier to obtain bigger read lengths.

Combined with Optical mapping system like BioNano, developing a genome sequence to chr level assembly is relatively easier.

Figure 2: A overview of early Next Generation Sequencing technologies

Thus genome sequencing today has become easy atleast in a sense that all genomes can be decoded and analyzed completely and cost is no longer a hurdle. Over the decade the cost has come down to manageable proportions, from millions of dollars for a genome like humans and rice to few hundred for a complete genome today excluding the cost of the platform and data analysis. All the above technologies however come with some disadvantage. From a bioinformatics point of view, assembling each genome is challenging due to various factors.

Assembling a genome sequenced by BAC by BAC approach and Sanger methods is much easier and accurate. But assembling small reads generated by Illumina and Ion Torrent systems is not that easy. Thus preparing a hybrid assembly of all different chemistries seems to be the only way out and mapping each base becomes an uphill task. Similarly annotation of such trancriptomes is not easy due to presence of spliced transcripts and MiRNAs. Thus the

Page | 135 challenges of data analysis now are a bigger worry than the actual sequencing itself. No single pipeline exists that can cater to all the complex eukaryotic genomes; each assembly will require probably newer and faster algorithms.

This requires constant interaction between biology and bioinformatics and development of species targeted interfaces and databases. Today ambitious projects like the 1000 human’s genome project or the 3000 rice genome project have been completed. The 1000 genome human project is expected to map every SNP present in the genome and use the association for disease or particular traits. The 3000 rice genome project aims to capture all the diversity in the rice germplasm for utilization in trait improvement.

Thus resequencing of genome assumes greater significance as this effort will lead to identification of useful associations between genes and phenotype. As evident from the graph below, NGS has strongly contributed to growth of eukaryotic genomes in recent times and it keeps getting bigger and bigger every day (Figure 3).

Figure 3: Growth in genome sequencing

One can only imagine the amount of data these studies will generate and help us understand the organisms and also harness the knowledge generated from such projects. Huge amount of biodiversity in important organisms provides us with immense potential for crop improvement that remained hidden for long. Genome sequencing of entire germplasm of plants, fungi, insects, nematoides, viruses have the potential to unravel new genes that could assist in providingfood and nutritional security to the masses.

Page | 136 References

Pop et. al. Trends Genet. 2008 :249:142-149

Hamilton and Buell,Plant J. 2012 Apr;70(1):177-90 Todd P et.al. Plant Genome, 2013: 6(2):1-7

Shendure et.al Nature, 550, pages 345–353 (2017) Heather et.al Genomics (2016) 107:1-8

Levy and Myers, Annual Review of Genomics

Page | 137