1.4 Review of some fundamentals in genomics
1.4.2 Protein synthesis
A protein is a complex biomolecule that consists of a long chain of amino acids. The amino acids are linked to each other by strong covalent bonding called peptide bonds, and the amino acid chain is also known as a polypeptide. There are 20 different kinds of amino acids in proteins, where each amino acid has a different side-chain. Therefore, a protein can be conveniently represented as a sequence of amino acids, where each of the 20 distinct amino acids is denoted by a 3-letter code or an 1-letter code. For example, the amino acidalanineis denoted by ‘Ala’ or ‘A,’ andcysteineis denoted by ‘Cys’ or ‘C.’
Proteins are involved in every single biological process in all cells, hence playing a crucial role in all living organisms. The information that is needed for encoding proteins is stored in the DNA.
Portions in the DNA that contain the information for producing proteins are calledprotein-coding genes, or often simplygenes.2 Each gene in the DNA is first copied into an RNA molecule (transcrip- tion), which is then used to produce proteins (translation). Therefore, it can be said that the genetic information flows from DNA to RNA to protein. This basic principle is typically called thecentral dogmaof molecular biology [1], and it explains how the genetic instructions contained in the DNA are used to synthesize RNAs and proteins. Figure 1.9 illustrates this principle in a simple diagram.
The main steps in a typical protein synthesis process are shown in Figure 1.10. Each step in the process is discussed in the following subsections.
2Note that there exist alsoncRNA (noncoding RNA) genes, which are portions of DNA that give rise to functional RNAs that are not translated into proteins.
DNA
RNA
Protein
RNA synthesis (transcription)
Protein synthesis (translation)
Figure 1.9: The central dogma of molecular biology states that the genetic information flows from DNA to RNA to protein.
1.4.2.1 Transcription
The process of copying the content of a gene into an RNA is calledtranscription. The transcription process is carried out by an enzyme calledRNA polymerase, where anenzymeis a protein that cat- alyzes a specific chemical reaction. Initially, the RNA polymerase binds to a special region in the DNA called thepromoter, which is located upstream of a gene and is used to designate the starting point of the transcription process. During transcription, the RNA polymerase uses one strand of the DNA (called thetemplate strand) to copy the content into an RNA molecule. While copying the content from DNA to RNA, a thymine (T) in the original DNA sequence is replaced by a uracil (U) in the RNA that is being synthesized. The resulting transcript of a protein-coding gene is called a pre-mRNA(pre-messenger RNA).
Living organisms can be categorized into two types, namely,prokaryotesandeukaryotes. Prokary- otes are simple organisms (mostly unicellular) that do not have a cell nucleus. Bacteria are com- mon examples of prokaryotes. On the other hand, eukaryotes are organisms that have complex cells with membrane-bound nuclei. Most of them are multicellular, and higher organisms such as worms, plants, insects and mammals belong to eukaryotes. Most protein-coding genes in eukary- otes consist of two types of regions called exons and introns (see Figure 1.10).3 The introns are removed from the pre-mRNA and the remaining exons are concatenated to form amRNA(messen- ger RNA). This process is calledsplicing. Sometimes, one pre-mRNA gives rise to multiple mRNAs
3The protein-coding genes of prokaryotes do not have introns.
Exon 1 Intron Exon 2 Intron Exon 3
5’ UTR 3’ UTR
Gene A Gene B Gene C
DNA
Pre-mRNA
mRNA
Exon 1 Exon 2 Exon 1 Exon 3
Protein
Protein 1 Protein 2
(a)
(b)
(c)
(d)
mRNA 1 mRNA 2
Transcription
Splicing
Translation
Figure 1.10: Illustration of a typical protein synthesis process.
UUU : Phenylalanine UUC : Phenylalanine UUA : Leucine UUG : Leucine
CUU : Leucine CUC : Leucine CUA : Leucine CUG : Leucine
AUU : Isoleucine AUC : Isoleucine AUA : Isoleucine AUG : Methionine, Start
GUU : Valine GUC : Valine GUA : Valine GUG : Valine
UCU : Serine UCC : Serine UCA : Serine UCG : Serine
CCU : Proline CCC : Proline CCA : Proline CCG : Proline
ACU : Threonine ACC : Threonine ACA : Threonine ACG : Threonine
GCU : Alanine GCC : Alanine GCA : Alanine GCG : Alanine
UAU : Tyrosine UAC : Tyrosine UAA : Stop UAG : Stop
CAU : Histidine CAC : Histidine CAA : Glutamine CAG : Glutamine
AAU : Asparagine AAC : Asparagine AAA : Lysine AAG : Lysine
GAU : Aspartic acid GAC : Aspartic acid GAA : Glutamic acid GAG : Glutamic acid
UGU : Cysteine UGC : Cysteine UGA : Stop UGG : Tryptophan
CGU : Arginine CGC : Arginine CGA : Arginine CGG : Arginine
AGU : Serine AGC : Serine AGA : Arginine AGG : Arginine
GGU : Glycine GGC : Glycine GGA : Glycine GGG : Glycine
Figure 1.11: The genetic code.
by combining different exons. This phenomenon is calledalternative splicing, and it is widely ob- served in eukaryotes.
1.4.2.2 Translation
During thetranslationprocess, the mRNA that was transcribed from DNA is decoded by the ribo- some andtRNAs(transfer RNA) to generate a polypeptide (or a protein). A polypeptide is a long sequence of amino acids that are interconnected via peptide bonds. The translation of mRNAs into proteins is governed by thegenetic codethat maps each of the 64codons(triplets of nucleotides) into one of the 20 different amino acids. Figure 1.11 shows the genetic code that holds true for most genes in the vast majority of organisms. However, deviations from the standard code shown in Figure 1.11 are also widespread. For example, in several human mitochondrial mRNAs, the triplet
‘UGA’ was observed to code a tryptophan instead of serving as a stop codon [11].
For a comprehensive introduction to genomics and cell biology, see [1, 11].
(a) 5’ A C G A A A C G U C C A A A G C U U G 3’
A
5’
A A
C G
G C
U A
G
3’
A C
U A
U A
G C C
(b)
A
G
A
A A
A 5’ U U C G
A G C U C G 3’
G C U A
5’ U U C G A A A G C U C G A A A A G G C U 3’
stem loop
stem-loops
pseudoknot
Figure 1.12: Two examples of RNAs with secondary structures. The primary sequence of each RNA is shown along with its structure after folding. The dashed lines indicate interactions between bases. (a) RNA with two stem-loops. (b) RNA with a pseudoknot.