8. Introduction to genomics
8.5. Some results of the HGP
The sequencing of the human genome has brought several interesting, sometimes unexpected results. Our knowledge about it has been continuously expanding since then, and will be expanding for decades. Perhaps one of the most unexpected results was that the human genome contains hardly more than 20 thousand genes. Originally, most experts estimated the number of genes to be around 100 thousand. This story shows how unexpected was this result for the experts: “ Between 2000 and 2003, a light-hearted betting pool known as “GeneSweep” was run in which genome researchers could guess at the number of genes in the human genome. A bet placed in 2000 cost $1, but this rose to $5 in 2001 and $20 in 2002 as information about the human genome sequence increased. One had to physically enter the bet in a ledger at Cold Spring Harbor, and all told 165 bets were registered.
Bets ranged from 25,497 to 153,438 genes, with a mean of 61,710...” (Source:
http://www.genomicron.evolverzone.com/2007/05/human-gene-number-surprising-at-first/). Thus, thelowest bet for the gene number was 25,497, which won ultimately, although it was still higher than the actual gene number, which were around 21 thousand. Here are some statistical data about the human genome, updated in July 2012:
Base Pairs: 3,300,551,249
Golden Path Length: 3,101,804,739
Genebuild last updated/patched Oct 2012 Gene counts
Coding genes A known gene is an Ensembl gene for which at least one known transcript has been
annotated: 20,476
Non-Coding Genes 22,170
Pseudogenes A noncoding sequence similar to an active protein: 13,322 Gene exons The part of the genomic sequence that remains in the transcript (mRNA) after
introns have been spliced out.: 700,947
Gene transcripts Nucleotide sequence resulting from the transcription of the genomic DNA to mRNA. One gene can have different transcripts or splice variants resulting from the alternative splicing of different exons in genes:
194,015
Other
Short Variants (SNPs, indels, somatic mutations): 54,418,495
Structural variants: 9,235,137
Table 8.1. Some statistical data about the human genome.
Source: http://www.ensembl.org/Homo_sapiens/Info/StatsTable
Some interesting results of the human genome, corrected with new data, from e.g.
the 1,000 genome project:
Largest gene: DMD, which codes for dystrophin; size: 2,224,919 bases; location:
Xp21.2
Longest coding sequence: TTN, codes for titin; coding sequence: 104,076 bp;
34,692 amino acid
Longest exon: TTN: 17,106 bp
Most exon: TTN; 351
20% of the genome is gene desert (a region >500 kbp without a gene)
Gene rich chromosomes: 17, 19, 22 (the richest is the 19, with 1,484 genes, and 25.10 genes/Mb)
Gene-poor chromosomes: Y, 4, 13, 18, and X; the poorest is the Y with 72 genes and ~ 1.2 gene/Mb
The 5’ end of the 98.12% of the introns are GT bases and AG at the 3’ end;
0.76% is GC-AG
The recombination is higher in females than in males, but the number of mutations is higher in male meiosis, which means that the majority of the mutations originates from males.
Every new-born receives about 60 mutations from the parents.
Every individual has 250-300 loss-of-function mutations in the annotated genes, among which 50-100 genes are involved in Mendelian diseases. It shows, among others, why it is so dangerous when the parents are relatives. The closer is the kinship, the higher is the probability that the child receives two mutations from the same gene, resulting in recessive diseases, or even multigenetic syndromes.
46% of the human genome consists of repeats. A lot of them are transposons, i.e.
jumping genes, inactivated about 40 million years ago. The most frequent repeats are called Alu, which occupy of the 10.6% of the genome.
Several hundreds of human genes originate from bacteria, through horizontal gene transfer.
There are long repeated regions in the pericentromeric and subtelomeric regions.
At present 156 imprinted genes are known. Imprinting is a genetic phenomenon by which certain genes are expressed in a parent-of-origin-specific manner.
Appropriate expression of imprinted genes is important for normal development, with numerous genetic diseases associated with imprinting defects including Beckwith–Wiedemann syndrome, Silver–Russell syndrome, Angelman syndrome and Prader–Willi syndrome. 56% of these genes are maternally, 44% are paternally imprinted (http://www.geneimprint.com/site/genes-by-species;
http://en.wikipedia.org/wiki/Genomic_imprinting).
There are 27-29,000 CpG islands.
CpG islands or CG islands are genomic regions that contain a high frequency of CpG sites (http://en.wikipedia.org/wiki/CpG_island). The "p" in CpG refers to the phosphodiester bond between the cytosine and the guanine, which indicates that the C and the G are next to each other in sequence. 99% of the methylation occurs at CG dinucleotides, which influence the transcription of the nearby genes, and play important roles in genetic regulation, imprinting and cell differentiation.
The methylation occurs on the cytosine. About 70% of human promoters have a
104 Genetics and genomics
high CpG content. In the ENCODE project it was found that 96% of CpGs exhibited differential methylation in at least one cell type or tissue assayed, and levels of DNA methylation correlated with chromatin accessibility. Methylation in the promoter reduces, in the gene bodies increases the expression of the genes.
In stem cells 25% of the methylation occurs in CA, instead of CG.
Besides methylation of the CpG islands, modifications (methylation, acethylation, etc.) of the histone proteins around the chromosomes also play an important role in the regulation of gene expression. To study these phenomena the Human Epigenome Consortium was founded and the Human Epigenome Project was launched (http://www.epigenome.org/). From these a new scientific area has been formed, called epigenomics, or epigenetics.
There are two different genome region types, which participate in the regulation of gene expression. Promoter regions located near the genes they transcribe, on the same strand and upstream, towards the 5' region of the sense strand; and the enhancer regions that regulate expression of distant genes. Beyond the linear organization of genes and transcripts on chromosomes lies a more complex (and still poorly understood) network of chromosome loops and twists through which promoters and more distal elements, such as enhancers, can communicate their regulatory information to each other. In the ENCODE project more than 70.000 promoter and nearly 400.000 enhancer regions were detected.
Enhancers are often cell-type specific.
Several paralogous genes have been detected. According to the definition, paralogs are two genes or clusters of genes at different chromosomal locations in the same organism that have structural similarities indicating that they derived from a common ancestral gene, and have since diverged from the parent copy by mutation and selection or drift. By contrast, orthologous genes are ones which code for proteins with similar functions, but exist in different species, and are created from a speciation event.
Until October 2012, 13,322 pseudogenes have been detected. In contrast to paralogs, pseudogenes are dysfunctional relatives of genes that have lost their protein-coding ability or are otherwise no longer expressed in the cell.
Duplicated pseudogenes have intron-exon-like genomic structures and may still maintain the upstream regulatory sequences of their parents. In contrast, processed pseudogenes, having lost their introns, contain only exonic sequence and do not retain the upstream regulatory regions. In the human genome, processed pseudogenes are the most abundant type due to a burst of retrotranspositional activity in the ancestral primates 40 million years ago.
Originally thought as functionless, pseudogenes have been suggested to exhibit different types of activity. Firstly, they can regulate the expression of their parent gene by decreasing the mRNA stability of the functional gene through their over-expression. A good example is the MYLKP1 pseudogene, which is up-regulated in cancer cells. The transcription of MYLKP1 creates a non-coding RNA (ncRNA) that inhibits the mRNA expression of its functional parent, MYLK.
Moreover, studies in Drosophila and mouse have shown that small interfering RNA (siRNA) derived from processed pseudogenes can regulate gene expression by means of the RNA-interference pathway, thus acting as endogenous siRNAs.
In addition, it has also been hypothesized that pseudogenes with high sequence homology to their parent genes can regulate their expression through the generation of anti-sense transcripts. Finally, pseudogenes can compete with
their parent genes for microRNA (miRNA) binding, thereby modulating the repression of the functional gene by its cognate miRNA. According to predictions, at least 9% of the pseudogenes present in the human genome are actively transcribed.
There are several web pages containing information about the genomes of human and other organisms (e.g.: http://genome.ucsc.edu/; http://www.ensembl.org/;
http://www.ncbi.nlm.nih.gov/). There is still an important topic, not detailed above, which is about the variations in the genome. We consider it, however, so important that there is a special subchapter for this topic (see below).
The mapping of the human genome has not been finished after the completion of the HGP. The Genome Reference Consortium has been founded, whose main task is to map the missing gaps. These are located in difficult-to-sequence regions, usually in repeat-rich regions. At the completion of the HGP about 350 gaps were in the genome.
These regions are not small; they represent about 5% of the genome. To fill these gaps are far from easy, which is shown by the fact that 6 years after the initiation of this project, in 2009, only 50 such gaps were completed.