SRG SCFG SCSG
csHMM
Figure 2.19: The csHMM in the Chomsky hierarchy.
interactions, they cannot directly generate the crossing interactions in the symbol sequence. For example, when modeling thecopy language, the crossing dependencies between symbol pairs can- not be directly generated [25]. Instead, the grammar generates the two related non-terminals in a non-crossing manner, and applies the context-sensitive re-ordering rules later on, in order to obtain the final sequence that has crossing correlations. For this reason, context-sensitive grammars can be quite complex even for simple languages.
Chapter 3
RNA Sequence Analysis Using Context-Sensitive HMMs
Thecentral dogmaof molecular biology states that the genetic information flows from DNA to RNA to protein. This dogma has exerted a substantial influence on our understanding of the genetic activities in the cells. Under this influence, the prevailing assumption until the recent past was that genes are basically repositories for protein coding information, and proteins are responsible for most of the important biological functions in all cells. In the meanwhile, the importance of RNAs has remained rather obscure, and the RNA was mainly viewed as a passive intermediary that bridges the gap between DNA and protein. Except for classic examples such astRNAs(transfer RNAs) andrRNAs(ribosomal RNAs), functional noncoding RNAs were considered to be rare.
However, this view has experienced a dramatic change during the last decade, as systematic screening of various genomes identified myriads ofnoncoding RNAs (ncRNAs), which are RNA molecules that function without being translated into proteins [30, 44]. It has been realized that many ncRNAs play important roles in various biological processes. As RNAs can interact with other RNAs and DNAs in a sequence-specific manner, they are especially useful in tasks that re- quire highly specific nucleotide recognition [30]. Good examples are themiRNAs (microRNAs) that regulate gene expression by targeting mRNAs (messenger RNAs) [5, 52] , and the siRNAs (small interfering RNAs) that take part in theRNAi(RNA interference) pathways for gene silenc- ing [39, 68, 69]. Recent developments show that ncRNAs are extensively involved in many gene regulatory mechanisms [34, 46].
The roles of ncRNAs known to this day are truly diverse. These include transcription and translation control, chromosome replication, RNA processing and modification, and protein degra- dation and translocation [44], just to name a few. These days, it is even claimed that ncRNAs dom-
inate the genomic output of the higher organisms such as mammals, and it is being suggested that the greater portion of their genome (which does not encode proteins) is dedicated to the control and regulation of cell development [67].
As more and more evidence piles up, greater attention is paid to ncRNAs, which have been neglected for a long time.1 Researchers began to realize that the vast majority of the genome that was regarded as “junk,” mainly because it was not well understood, may indeed hold the key for the best kept secrets in life, such as the mechanism of alternative splicing, the control of epigenetic variations [67]. The complete range and extent of the role of ncRNAs are not so obvious at this point, but it is certain that a comprehensive understanding of cellular processes is not possible without understanding the functions of ncRNAs [119].
Although several systematic searches for ncRNAs in recent years have unveiled a large number of novel ncRNAs, it is believed that there are still numerous ncRNAs that are waiting to be discov- ered [30, 44, 67]. Typical estimates of the number of ncRNAs in the human genome are in the order of tens of thousands [67, 120], but the present genome annotation on ncRNAs is too incomplete to derive a more accurate estimate. As a result of several genome sequencing projects, including the human genome project that has been completed very recently [105], a huge amount of genomic data is publicly available these days. Given the vast amount of genomic data, it is practically im- possible to identify all ncRNAs solely by experimental means. In order to expedite the annotation process, we desperately need the help of computational methods that can be used for identifying novel ncRNAs.
In this chapter, we describe how the context-sensitive HMMs, which were proposed in Chap- ter 2, can be used for the computational identification and analysis of ncRNAs. We focus on how to build probabilistic representations of RNA families based on csHMMs, and show how they can be used to identify new ncRNA genes, which are portions of DNA that give rise to ncRNA transcripts.
The main emphasis of the discussion lies on the method that can be used for finding new members (or homologues) of known ncRNA families.
The content of this chapter is mainly drawn from [130, 126, 132] and portions of it have been presented in [123, 125].
1In 2006, the Nobel prize in physiology or medicine was awarded to A. Z. Fire and C. C. Mello for their discovery of
“RNA interference (RNAi)–the gene silencing mechanism by double-stranded RNA (dsRNA)” [39].
3.1 Outline
This chapter is organized as follows. In Section 3.2, we show that RNA secondary structures play important roles in carrying-out the functions of ncRNAs, hence many ncRNAs have well conserved secondary structures.
In Section 3.3, we consider the problem of searching for homologous RNAs. We give an overview of sequence-based homology search methods in Section 3.3.1. In Section 3.3.2, we show that in order to build an RNA homology search tool with a good prediction accuracy, we need more powerful statistical models that can reasonably combine the contributions from the structural similarity as well as the sequence similarity.
In Section 3.4, we show how the context-sensitive HMMs (csHMMs) introduced in Chapter 2 can be utilized in an RNA homology search. In Section 3.4.1, we first give several examples of csHMMs that represent various RNA secondary structures. A database search algorithm that can be used with csHMMs is introduced in Section 3.4.2. Experimental results are given in Section 3.4.3, which demonstrate the effectiveness of the csHMM-based search method.
RNAs with alternative secondary structures are considered in Section 3.5. We show in Sec- tion 3.5.1, how we can use csHMMs to represent the base correlations in RNAs with alternative folding. In Section 3.5.2, we show experimental results which indicate that the proposed approach can effectively discriminate between the RNAs that can alternatively fold and the RNAs that can- not, at a low computational cost.
In Section 3.6, we briefly mention the problem of identifying novel ncRNAs, and we conclude the chapter in Section 3.7.