Statistical Approaches to Gene Prediction

Human X ChromosomeMouse X Chromosome

6.12 Statistical Approaches to Gene Prediction

Ala Ala

Val Arg Leu

Leu

Thr Tyr

AUGGCACCGUCGGUGAGUAACGCAUUG TACCGTGGCAGCCACTCATTGCGTAAC

Met

Met Ala

Pro Pro

Ser Ser Ser

Val

Val Asp

Arg

Arg Arg

Leu Trp Trp

His His

Stop Stop

Gly Gly

Stop

Gly Gly

Thr

Glu Glu

Asn Gln Cys

Figure 6.25 The six reading frames for the sequenceATGCTTAGTCTG. The string may be read forward or backward, and there are three frame shifts in each direction.

Many statistical gene prediction algorithms rely on statistical features in protein-coding regions, such as biases incodon usage. We can enter the frequency of occurrence of each codon within a given sequence into a64-element codon usage array, as in table 6.1. The codon usage arrays for coding regions are different than the codon usage arrays for non-coding regions, enabling one to use them for gene prediction. For example, in human genes codons CGCandAGGcode for the same amino acid (Arg) but have very different frequencies: CGCis 12 times more likely to be used in genes thanAGG(table 6.1). Therefore, an ORF that “prefers”CGCoverAGGwhile coding for Argis a likely candidate gene. One can use a likelihood ratio approach²² to compute the conditional probabilities of the DNA sequence in a window, under the hypothesis that the window contains a coding sequence, and under the hypothesis that the window contains a noncoding sequence. If we slide this window along the genomic DNA sequence (and calculate the likelihood

22. Thelikelihood ratiotechnique allows one to test the applicability of two distinct hypotheses;

when the likelihood ratio is large, the ﬁrst hypothesis is more likely to be true than the second one.

6.12 Statistical Approaches to Gene Prediction 199

Table 6.1 The genetic code and codon usage inHomo sapiens. The codon for methio- nine, orAUG, also acts as a start codon; all proteins begin with Met. The numbers next to each codon reﬂects the frequency of that codon’s occurrence while coding for an amino acid. For example, among all lysine (Lys) residues in all the proteins in a genome, the codonAAGgenerates25%of them while the codonAAGgenerates75%.

These frequencies differ across species.

U C A G

UUUPhe 57 UUCPhe 43 UUALeu 13 UUGLeu 13

UCUSer 16 UCCSer 15 UCASer 13 UCGSer 15

UAUTyr 58 UACTyr 42 UAAStp 62 UAGStp 8

UGUCys 45 UGCCys 55 UGAStp 30 UGGTrp 100

CUULeu 11 CUCLeu 10 CUALeu 4 CUGLeu 49

CCUPro 17 CCCPro 17 CCAPro 20 CCGPro 51

CAUHis 57 CACHis 43 CAAGln 45 CAGGln 66

CGUArg 37 CGCArg 38 CGAArg 7 CGGArg 10

AUUIle 50 AUCIle 41 AUAIle 9 AUGMet 100

ACUThr 18 ACCThr 42 ACAThr 15 ACGThr 26

AAUAsn 46 AACAsn 54 AAALys 75 AAGLys 25

AGUSer 15 AGCSer 26 AGAArg 5 AGGArg 3

GUUVal 27 GUCVal 21 GUAVal 16 GUGVal 36

GCUAla 17 GCCAla 27 GCAAla 22 GCGAla 34

GAUAsp 63 GACAsp 37 GAAGlu 68 GAGGlu 32

GGUGly 34 GGCGly 39 GGAGly 12 GGGGly 15

ratio at each point), genes are often revealed as peaks in the likelihood ratio plots.

An even better coding sensor is thein-frame hexamer count²³ proposed by Mark Borodovsky and colleagues. Gene prediction in bacterial genomes also takes advantage of several conserved sequence motifs often found in the regions around the start of transcription. Unfortunately, such sequence motifs are more elusive in eukaryotes.

While the described approaches are successful in prokaryotes, their appli- cation to eukaryotes is complicated by the exon-intron structure. The average length of exons in vertebrates is 130 nucleotides, and exons of this length are too short to produce reliable peaks in the likelihood ratio plot while analyz- ing ORFs because they do not differ enough from random ﬂuctuations to be detectable. Moreover, codon usage and other statistical parameters proba-

23. The in-frame hexamer count reﬂects frequencies of pairs of consecutive codons.

bly have nothing in common with the way the splicing machinery actually recognizes exons. Many researchers have used a more biologically oriented approach and have attempted to recognize the locations of splicing signals at exon-intron junctions. There exists a (weakly) conserved sequence of eight nucleotides at the boundary of an exon and an intron (donorsplice site) and a sequence of four nucleotides at the boundary of an intron and exon (acceptor splice site). Since proﬁles for splice sites are weak, these approaches have had limited success and have been supplanted by hidden Markov model (HMM) approaches²⁴that capture statistical dependencies between sites. A popular example of this latter approach isGENSCAN, which was developed in 1997 by Chris Burge and Samuel Karlin.GENSCANcombines coding region and splicing signal predictions into a single framework. For example, a splice site prediction is more believable if signs of a coding region appear on one side of the site but not on the other. Many such statistics are used in the HMM framework ofGENSCANthat merges splicing site statistics, coding region statistics, and motifs near the start of the gene, among others. However, the accuracy ofGENSCANdecreases for genes with many short exons or with unusual codon usage.

Dalam dokumen AN INTRODUCTION TO BIOINFORMATICS ALGORITHMS (Halaman 197-200)