Optimal alignment of profile-csHMM - Signal Processing Methods for Genomic Sequence Analysis

. . ( ( ( . . [ [ [ ) ) ) . . ] ] ] 5’ A U C C G A A U C U C G G U C A G A 3’

base pair (a)

(b)

( ( . [ [ . { { ) ) . ] ] . . } } .

(c)

5’ C C A U U A G U G G C A A G C A C A 3’

( ( . [ [ . ) ) . ( ( ( . ] ] . ) ) (-)

5’ A A C G C A U U C U A A G G C U U A 3’

( ( [ . { { < < ) ) ] . . } } . > >

5’ C C A U U A G U G G C A A G C A C A 3’

Figure 4.3: Various types of RNA secondary structures. (a) RNA with 2-crossing property. (b) RNA with 3-crossing property. (c) RNA with 4-crossing property.

doknot family [98], which will be considered in our experiments presented in Section 4.4.4. It will be shown that profile-csHMMs can be used for modeling and predicting the secondary structure of these pseudoknots. Profile-csHMM can also represent RNAs with even more complex secondary structures as those shown in Figure 4.3 (b) (RNA with 3-crossing property) and Figure 4.3 (c) (RNA with 4-crossing property), in a similar manner as described in Section 4.2.1.

( [ . ) ] .

D₁

M₁

D₂

M₂

D₃

M₃

D₄

M₄

D₅

M₅

End Start

(a)

(b)

C U A G

5’ 3’ observed sequence

(structure unknown)

I₅ I₄

I₃ I₂

I₁ I₀

(c)

N₁ C

N₂ U

A -

N₁’ G

N₂’ A 5’

5’

- G

3’

3’ structural alignment optimal state sequence

N₁ N₂ A N₁’ N₂’

5’ 3’

(a)

consensus sequence base-pair

base-pair

Figure 4.4: (a) An RNA sequence with no structural annotation. (b) Optimal state sequence of the observed RNA sequence in the given profile-csHMM. (c) Structural alignment obtained from the optimal state sequence.

observed RNA sequence. Furthermore, the insert state “I5” indicates that the observed sequence has an additional base right after the fifth position of the original consensus sequence. Based on these observations, we can obtain the best structural alignment between the new sequence and the given profile as illustrated in Figure 4.4 (c), and the secondary structure of the new RNA sequence can be immediately inferred from this alignment.

The optimal state sequence of a profile-csHMM can be systematically found by using thesequen- tial component adjoining (SCA) algorithm[129]. The SCA algorithm can be viewed as a generalization of theViterbi algorithm(used for conventional HMMs) [116] and theCocke-Younger-Kasami (CYK) algorithm(used for context-free grammars) [60]. Like these two algorithms, the SCA algorithm finds the optimal path in an iterative manner, first finding the optimal state sequence of short subsequences and then using this information to find the optimal state sequence of longer sequences.

The major difference between the SCA algorithm and the other algorithms is as follows. Firstly, the SCA algorithm can define and use subsequences that consist of multiple non-overlapping re- gions in the original sequence. Instead of using a fixed number of indices to designate a subsequence, the SCA algorithm uses a set of variable number of closed intervals to describe a subse-

quence. For example, given a sequencex=x1x2. . . xL, we can define its subsequences as follows.

N1={[1,3]} → x(N1) =x1x2x3

N₂={[1,1],[5,7]} → x(N₂) =x₁ x₅x₆x₇ N3={[1,2],[5,7],[9,9]} → x(N3) =x1x2 x5x6x7 x9

In general, if we have a set ofInon-overlapping intervalsN ={n1,n2, . . . ,nI}, where the intervals ni= [n^`_i, n^r_i]satisfy

n^r_i < n^`_j for i < j, (4.1)

the corresponding subsequencex(N)is defined as

x(N) =x_n` 1. . . x_n^r

| {z }

n₁

x_n` 2. . . x_n^r

| {z }

n₂

. . . x_n` I. . . x_n^r

| {z }

n_I

This generalization significantly increases the number of ways in which the intermediate subsequences can be defined and extended during the iterative process of finding the optimal path.

Secondly, the SCA algorithm explicitly specifies how the optimal paths of shorter subsequences should be extended and adjoined together to obtain the optimal path of a longer subsequence. As there are innumerable ways of defining the intermediate subsequences, the SCA algorithm cannot simply find the optimal path eitherleft-to-right(as the Viterbi algorithm) orinside-to-outside(as the CYK algorithm). In practice, we have to define a model-dependent “adjoining order” in such a way that takes all the correlations in the given profile-csHMM into consideration. Following the specified order, we iteratively apply a set of sequence adjoining and extension rules until we obtain the optimal probability of the entire sequencex.

Before describing the algorithm, let us first define the notations. Let K be the length of the profile-csHMM, i.e., the number of match states, andLbe the length of the observation sequencex.

We denote the emission probability of a symbolxat a single-emission state or a pairwise-emission statevase(x|v). The emission probability of a symbolxc at a context-sensitive statewis denoted ase(x_c|w, xp), wherex_pis the symbol that was previously emitted at the corresponding pairwise- emission state. We denote the transition probability from statev to statewast(v, w). As before, we letN ={n₁, . . . ,n_I}be an ordered set ofI intervals, and we defineS ={s₁, . . . ,s_I}to be an ordered set of state-pairssi= (s^`_i, s^r_i), wheres^`_iands^r_i denote the hidden statesy_n`

i andyn^r_i at each

end of theith intervalni = [n^`_i, n^r_i]. Finally, we defineα(N,S)to be the maximum log-probability of the subsequencex(N)

α(N,S) = max

y(N)

logP(x(N),y(N))i ,

where its underlying state-sequence

y(N) =y_n` 1. . . y_n^r

| {z }

n₁

y_n` 2. . . y_n^r

| {z }

n₂

. . . y_n` I. . . y_n^r

| {z }

n_I

satisfiesy_n`

i = s^`_i and yn^r_i = s^r_i for all i = 1, . . . , I. We also define the variablesλa(N,S) and λb(N,S)that will be used later for tracing back the optimal state sequencey^∗.

4.3.1 Initialization

Initially, we computeα(N,S)for single bases and single base pairs.

(i) For a positionk(1≤k≤K)in the profile-csHMM whereMkis a single-emission match state, we let

{[n, n]},{(M_k, M_k)}

= loge(x_n|M_k), λa

{[n, n]},{(Mk, Mk)}

= (∅,∅), λb

{[n, n]},{(Mk, Mk)}

= (∅,∅),

for all positions1≤n≤L. Similarly, we let

{[n, n−1]},{(Dk, Dk)}

= 0, λa

{[n, n−1]},{(Dk, Dk)}

= (∅,∅), λ_b

{[n, n−1]},{(D_k, D_k)}

= (∅,∅),

for all1≤n≤L+ 1.

(ii) For positionsj andk(1 ≤j < k ≤ K) whereM_j is a pairwise-emission state andM_k is the corresponding context-sensitive state, we letN₁={[n, n],[m, m]},S₁={(M_j, M_j),(M_k, M_k)}and

compute

α(N1,S1) = loge(xn|Mj) + loge(xm|Mk, xn), λ_a(N₁,S₁) = (∅,∅),

λb(N1,S1) = (∅,∅),

for all positions1≤n < m≤L. Furthermore, we initialize the log-probabilityα(N2,S2)forN2= {[n, n−1],[m, m−1]}andS2={(Dj, Dj),(Dk, Dk)}as follows

α(N2,S2) = 0, λ_a(N2,S2) = (∅,∅), λb(N2,S2) = (∅,∅),

for allnandm(1≤n≤m≤L+ 1).

(iii) For single bases emitted at insert states, we initialize the log-probabilities as follows

{[n, n]},{(I_k, I_k)}

= loge(x_n|I_k), λa

{[n, n]},{(Ik, Ik)}

= (∅,∅), λb

{[n, n]},{(Ik, Ik)}

= (∅,∅),

for0≤k≤Kand1≤n≤L.

4.3.2 Adjoining subsequences

During the initialization process, we computed the log-probability for all subsequences that consist of a single base or a single base pair. Now, these subsequences can be recursively adjoined to obtain the probability of longer subsequences by applying the following adjoining rules.

Rule 1 Consider the log-probabilitiesα(Nâ,Sâ)andα(N^b,S^b)of the two subsequencesx(Nâ)and x(N^b), where

Nâ ={nâ₁, . . . ,nâ_I_a}, Sâ={sâ₁, . . . ,sâ_I_a}, N^b ={n^b₁, . . . ,n^b_I

b}, S^b={s^b₁, . . . ,s^b_I

b}.

We assume that there is no overlap between the symbol sequencesx(N^a)andx(N^b)nor between the underlying state sequencesy(N^a)andy(N^b). In this case, we can apply the following adjoining rule to compute the optimal log-probability of a longer sequencex(N)

α(N,S) = α(Nâ,Sâ) +α(N^b,S^b), λa(N,S) = (Nâ,Sâ),

λ_b(N,S) = (N^b,S^b),

whereN andSare unions of the smaller sets

N =N^a∪ N^b={n1, . . . ,nI}, S =S^a∪ S^b={s1, . . . ,sI},

whereI =Ia+Iband the intervalsni are relabeled such that they satisfy (4.1) andsi ∈ S corre-

spondsni∈ N.

Rule 2 Assume that there exist two intervalsni,ni+1∈ N that satisfyn^r_i+ 1 =n^`_i+1, which implies that the two intervals[n^`_i, n^r_i] and[n^`_i+1, n^r_i+1]are adjacent to each other. For simplicity, let us assume thati=I−1. In this case, we can combine the two intervalsnI−1andnIto obtain a larger interval

n⁰_I−1= [n^`_I₋₁, n^r_I] ={n|n^`_I−1≤n≤ n^r_I},

where the corresponding state-pair iss⁰_I−1= (s^`_I₋₁, s^r_I). Now, the log-probabilityα(N⁰,S⁰)for

N⁰={n1, . . . ,n_I−2,n⁰_I₋₁}, S⁰={s1, . . . ,s_I−2,s⁰_I−1}

can be computed as follows

α(N⁰,S⁰) = max

n^r_I−1 max

s^r_I−1,s^`_I

hα(N,S) + logt(s^r_I−1, s^`_I)i

! , (n^∗, s^∗_r, s^∗_`) = arg max

(n^r_I−1,s^r_I−1,s^`_I)

hα(N,S) + logt(s^r_I−1, s^`_I)i , N^∗ =

n1, . . . ,nI−2,[n^`_I−1, n^∗],[n^∗+ 1, n^r_I] , S^∗ =

s1, . . . ,s_I−2,(s^`_I−1, s^∗_r),(s^∗_`, s^r_I) , λ_a(N⁰,S⁰) = (N^∗,S^∗),

λb(N⁰,S⁰) = (∅,∅).

Fori < I−1, we can similarly combine the two adjacent intervalsniandni+1to obtain the optimal log-probabilityα(N⁰,S⁰)of the updated setsN⁰andS⁰.

For simplicity, we have described the adjoining process in two distinct steps; (i) adjoining non- overlapping subsequences and (ii) combining adjacent intervals in a single subsequence. However, in many cases, it is possible (and usually more convenient) to apply the two rules at the same time. For example, if we know the optimal probability of two adjacent subsequences, where each subsequence consists of a single interval, we can adjoin the two sequences and combine the two intervals to compute the optimal probability of a longer subsequence that has also a single interval.

4.3.3 Adjoining order

As mentioned earlier, when using the SCA algorithm, we have to specify the order according to which the adjoining rules should be applied. This adjoining order can be obtained from the consensus RNA sequence that was used to construct the profile-csHMM. Based on the consensus sequence, we first find out how the bases and the base pairs in the given sequence can be adjoined one by one to obtain the entire sequence. During this procedure, we try to minimize the number of intervals that is needed to describe the intermediate subsequences, as a larger number of intervals leads to a higher computational cost for adjoining the subsequences. An example is shown in Figure 4.5, which illustrates how we can obtain the consensus sequence in Figure 4.2 (a) by se- quentially adjoining its base and base pairs. Note that the numbers inside the squares in Figure 4.5 indicate the original base-positions. We follow this order to compute the optimal log-probability

1 2 3 4 5

1 3 4

2 5

1 4

STEP 1

STEP 2

STEP 3

STEP 4

STEP 5

N₁ N₂ A N₁’ N₂’ 3’

5’

consensus RNA sequence

adjoining order

Figure 4.5: The adjoining order of the profile-csHMM shown in Figure 4.2. This illustrates how we can adjoin the base and the base pairs in the consensus RNA sequence to obtain the entire sequence.

of the subsequences (of the observed RNA sequence) that correspond to the subsequence (of the consensus sequence) at each step.

At each step, we first compute the log-probabilityα(N,S)of those subsequences, whose termi- nal states (i.e.,s^`_i, s^r_i) do not contain insert states. For example, at STEP 1, we computeα(N1,S1) forN1 ={[n, n]},S1={(M3, M3)}andN1={[n, n−1]},S1={(D3, D3)}using the initialization rules described in Section 4.3.1. Note that the states in S1 correspond to the third base position in the original consensus sequence. Similarly, we compute the log-probabilityα(N2,S2)at STEP 2 (and alsoα(N4,S4)at STEP 4) based on the base pair initialization rules in Section 4.3.1. At certain steps, the optimal log-probability is obtained by combining the log-probabilities computed in the previous steps. For example, at STEP 3, we can computeα(N₃,S₃)forN₃={[n₁, n₁],[n^`₂, n^r₂]}and S₃={(M₁, M₁),(M₃, M₄)}, by combiningα(N₁,S₁)andα(N₂,S₂)as follows

α(N3,S3) = max

hα(N1,S1) + logt(v, M₄) +α(N2,S2)i ,

whereN1={[n^`₂, n^r₂−1]},S1={(M3, v)}andN2={[n1, n1],[n^r₂, n^r₂]},S2={(M1, M1),(M4, M4)}.

In the example in Figure 4.5, the log-probabilityα(N1,S1)is computed at STEP 1 andα(N2,S2)is computed at STEP 2.

After computing these log-probabilities, we move on to compute the log-probabilities of those subsequences that have one or more insertions at the beginning and/or end of some intervals. For example, at STEP 1, we can computeα(N1,S1)forN1={[n, n+ 1]}andS1={(M3, I3)}from

α(N₁,S₁) =α(N₁^a,S₁^a) + logt(M₃, I₃) +α(N₁^b,S₁^b),

whereN₁^a ={[n, n]},S₁^a ={(M3, M3)},N₁^b={[n+ 1, n+ 1]},S₁^b={(I3, I3)}. In a similar manner, we can also deal with a left insertion as well as multiple insertions.

4.3.4 Termination

By iteratively applying the adjoining rules as described in Section 4.3.2, we can ultimately obtain the log-probability α(N,S)forN ={[1, L]} andS={(s^`, s^r)}, for alls^`∈ {I0, M1, D1} and s^r∈ {IK, MK, DK}. Based on this result, the log-probability of the optimal state sequencey^∗can be computed from

logP(x,y^∗) = max

logP(x,y)i

= max

s^`,s^r

hlogt(START, s^`) +α(N,S) + logt(s^r,END)i , (s^∗_`, s^∗_r) = arg max

(s^`₁,s^r₁)

logt(START, s^`) +α(N,S) + logt(s^r,END)i ,

λ^∗ =

[1, L],(s^∗_`, s^∗_r) .

Note thatt(START, s^`)is the probability that the profile-csHMM begins at the states^`, andt(s^r,END) is the probability that the model terminates afters^r.

4.3.5 Trace-back

Now that we have obtained the maximum probability of the sequencex, we can easily trace back the algorithm to find the optimal path that gave rise to this probability. For notational convenience, let us defineλ_t= (N,S). We also need a stackT during the process. The trace-back procedure can be described as follows.

STEP 1 Letyi= 0 (i= 1,2, . . . , L).

STEP 2 Pushλ^∗onto the stackT.

STEP 3 Popλt= (N,S)fromT. Ifλt= (∅,∅), go to STEP 6. Otherwise, proceed to STEP 4.

STEP 4 Ifλ_a(λ_t)6= (∅,∅)pushλ_a(λ_t)ontoT. Otherwise,y_n`

i = s^`_i, for alln_i = [n^`_i, n^r_i] ∈ N and the correspondings_i = [s^`_i, s^r_i]∈ S. (Note that whenλ_a(λ_t) = (∅,∅), we haven^`_i =n^r_i ands^`_i =s^r_i.)

STEP 5 Ifλb(λt)6= (∅,∅)pushλb(λt)ontoT.

STEP 6 IfT is empty, proceed to STEP 7. Otherwise, go to STEP 3.

STEP 7 Lety^∗=y1y2. . . yLand terminate.

At the end of the trace-back procedure, we can obtain the optimal state sequencey^∗that maximizes the probability of observingxbased on the profile-csHMM at hand.

4.3.6 Computational complexity

The computational complexity of the SCA algorithm is not fixed, and it depends on the structure of the RNA that is represented by the given profile-csHMM. For example, for a simple RNA with only one stem-loop, the computational complexity will beO(L²K), whereLis the length of the un- folded RNA sequence (the “target” sequence) andKis the length of the consensus RNA sequence (the “reference” sequence) with a known structure. Note thatKis proportional to the number of states in the profile-csHMM. The complexity for aligning RNAs with multiple stem-loops will be O(L³K), and the complexity for aligning most known pseudoknots will beO(L⁴K).² In general, the complexity of the SCA algorithm for analyzing a given RNA secondary structure is identical to or less than that of an existing algorithm (such as the Viterbi, CYK, and PSTAG algorithms) that can be used to analyze the same structure.

Dalam dokumen Signal Processing Methods for Genomic Sequence Analysis (Halaman 115-124)