3.4 Database search using csHMMs
3.4.1 Modeling RNA secondary structures
RNA sequences with secondary structures can be viewed as a kind of biological palindromes.
Palindromes are symmetric sequences that read the same forwards and backwards, such as “I pre- fer pi,” “step on no pets,” and so on. Similarly, the base pairing in an RNA secondary structure gives rise to symmetric (or reverse complementary, to be more precise) regions in its primary se- quence that are analogous to palindromes. As we have seen in Chapter 2, HMMs can be viewed asstochastic regular grammarsaccording to theChomsky hierarchy of transformational grammars[16].
Regular grammars are the simplest among the four classes in the hierarchy, and it is known that they are inherently incapable of describing a palindromic language.2 As a result, regular grammars are not suitable for constructing RNA profiles.
In order to represent complex correlations that are frequently observed in ncRNA sequences, we need more complex models with larger descriptive power than the regular grammars. One model that has been especially popular for representing RNA families is thecovariance model (CM)[25, 26].
CMs can be viewed asprofile-SCFGs (profile stochastic context-free grammars), which are capable of handling nested correlations.
Another possibility is to use the context-sensitive HMMs. In fact, csHMMs can effectively de- scribe the long-range correlations between distant bases, hence they provide a simple and intuitive way for modeling RNAs with conserved secondary structures.
In the following section, we show how the csHMMs can be used for representing RNA sec- ondary structures and finding homologous RNAs.
of RNA secondary structures. Note that the csHMM that represents a given structure is not unique.
The final implementation of the model depends on the specific application.
Figure 3.5 (a) illustrates a typicalstem-loop (hairpin)structure, which is the simplest of all RNA secondary structures. RNA sequences with a conserved stem-loop structure can be represented by a simple context-senstive HMM that is shown in Figure 3.5 (b). In this model, the pairwise- emission stateP1and the context-sensitive stateC1are associated with a stack, and they together generate the stem part of the structure. When we enterC1, it retrieves a symbolxfrom the stack, which was previously emitted byP1. Note thatxrepresents a base in the RNA sequence, which takes one of the four values A, C, G, and U. After retrievingx, the context-sensitive stateC1emits the complimentary base ofx. In this way, the state-pair(P1, C1)can generate the stem. The single- emission stateS1is used for generating the loop, since the bases in the loop are not correlated to other bases. If we need to model abulge, which is a nonpaired base inside a stem, it can be done by adding more states to the HMM as shown in Figure 3.5 (c).
As demonstrated in this example, whenever there exist a pairwise interaction between two bases, we can represent it using a pair of a pairwise-emission state and a context-sensitive state.
For unpaired bases that form a loop, a bulge, and so forth, we can use single-emission states for representing them.
Based on this principle, it is not difficult to construct context-sensitive HMMs that can repre- sent RNA sequences with more complex secondary structures. For example, Figure 3.6 (a) depicts the consensus secondary structure of the so-callediron response element (IRE). The iron response ele- ments are found in the 5’ or 3’ UTRs (untranslated regions) of various messenger RNAs. It is known that theiron regulatory proteins (IRPs)bind to the IREs in order to control the iron metabolism inside the cell [54]. The IRE has a well conserved stem-loop with aninterior loopas shown in Figure 3.6 (a).
These RNAs can be modeled using the csHMM in Figure 3.6 (b).
Another example of an RNA secondary structure is illustrated in Figure 3.7 (a), which shows the typical structure of a tRNA (transfer RNA). The tRNA is a short RNA molecule that usually consists of 74–93 nucleotides. It attaches a specific amino acid to the protein chain that is being synthesized, during thetranslationprocedure of mRNA into protein [11]. The tRNAs have a highly conserved secondary structure with three stem-loops, which is called thecloverleaf structuredue to its shape. As shown in Figure 3.7 (b), the cloverleaf structure can be modeled using four pairs of (Pn, Cn), where a separate stack is dedicated to each state-pair. Note the similarity between the
S1
P1 C1
Start End
Stack 1
(a) (b)
5’ 3’
S1
P1 C1
Start End
Stack 1
(c)
S2 S3
Figure 3.5: (a) A typical stem-loop. The nodes represent the bases in the RNA, and the dotted lines indicate the interactions between bases that form complementary base pairs. (b) An example of a csHMM that generates a sequence with a stem-loop structure. (c) A csHMM that models a stem-loop with bulges.
S1
P1 C1
Start Stack 1 End
(a) (b)
5’ 3’
S2
P2 C2
S3
Stack 2
Figure 3.6: (a) A typical structure of an iron response element. (b) An example of a csHMM that generates sequences with the given secondary structure.
(a)
5’
3’
(b)
Stack 2
Start
P1
S1
C1
P2
C2 S2
S3
P3 C3 S4
S5
C4
P4 S6 S7
End
Stack 4
Stack 3 Stack 1
Figure 3.7: (a) A typical tRNA cloverleaf structure. (b) An example of a csHMM that can generate sequences with the cloverleaf structure.
original consensus RNA structure and the constructed context-sensitive HMM. As every state in the HMM corresponds to one or more base locations in the RNA sequence, the design procedure of context-sensitive HMMs is very simple and intuitive.