Chapter 4. COELACANTH-SPECIFIC ADAPTIVE GENES GIVE
4.3 Materials and Methods
63
64
Table 4.1. Versions of reference sequences of species.
Common name
Scholar name Class Infraclass Order Reference version
Amazon molly Poecilia formosa
Actinopteri Teleosteiei Cyprinodontiformes Poecilia_formosa- 5.1.2
Cave fish Astyanax mexicanus
Actinopteri Teleosteiei Characiformes AstMex102
Cod Gadus morhua Actinopteri Teleosteiei Gadiformes gadMor1
Fugu Takifugu
rubripes
Actinopteri Teleosteiei Tetraodontiformes FUGU 4.0
Medaka Oryzias latipes
Actinopteri Teleosteiei Beloniformes HdrR
Platyfish Xiphophorus maculatus
Actinopteri Teleosteiei Cyprinodontiformes Xipmac4.4.2
Stickleback Gasterosteus aculeatus
Actinopteri Teleosteiei Gasterosteiformes BROAD S1
Tetraodon Tetraodon nigroviridis
Actinopteri Teleosteiei Tetraodontiformes TETRAODON 8.0
Tilapia Oreochromis niloticus
Actinopteri Teleosteiei Perciformes Orenil1.0
Zebrafish Danio rerio Actinopteri Teleosteiei Cypriniformes GRCz10 Spotted gar Lepisosteus
oculatus
Actinopteri Holostei Lepisosteiformes LepOcu1
Coelacanth Latimeria chalumnae
Sarcopterygii Coelacanthiformes LatCha1
Anole Lizard Anolis carolinensis
Reptilia Squamata AnoCar2.0
Chinese softshell turtle
Pelodiscus sinensis
Reptilia Testudines PelSin_1.0
Human Homo sapiens Mammalia Primates GRCh38.p7
Xenopus Xenopus tropicalis
Amphibia Anura JGI 4.2
65
Figure 4.1. Cladogram of Osteichthyes family. Bold lines in the tree indicate the most recent ancestral branches of each lineage. Blue, skyblue, and red indicates Teleostei, Holostei, and coelacanth lineages, respectively.
66 Orthologous gene set alignments
Multiple sequence alignments of suitable coding gene sets were prepared for detection of positive selection with the following steps. Firstly, to exclude possibility of functional changes caused by gene expansion (gain and loss of genes), I focused on genes that show one to one orthologues in 12 fishes. Using coelacanth genome as a representative dataset, I found 4160 coding gene sets in ENSEMBL Biomart (Kinsella et al., 2011). Secondly, I filtered out 28 genes with sequence lengths which are not multiple of 3. After filtering these genes, I aligned 4132 gene sets by using PRANK (Löytynoja and Goldman, 2008) with two options; ‘-codon’ for codon-wise alignments and ‘-F’ for the most accurate alignments to identify homologous sites in each species. Finally, to exclude regions with poorly scored alignment caused by indels and mismatch, I trimmed 4132 alignments by using GBlocks (Talavera and Castresana, 2007) with one option ‘-t = c’ for codon-wise adjustments. Finally, I prepared conserved coding sequence alignments of 3538 genes.
PSGs specific to coelacanth
To identify genes responsible for the evolution of coelacanth, I screened for the molecular signatures under episodic adaptive evolution. This was done by calculating dN (number of non-synonymous substitutions per number of non- synonymous sites of each gene), dS (number of synonymous substitutions per number of synonymous sites of each gene), and dN/dS (ratio of number of non- synonymous substitutions per number of non-synonymous sites to number of synonymous substitutions per number of synonymous sites of each gene) values of 3538 orthologous genes from 12 fishes excluding 4 tetrapods as an outgroup. In order to detect accurate selection signatures and to estimate site-wise selection on the latest ancestral branch of each lineage of coelacanth, spotted gar, and Teleostei fishes in the species tree (Fig. 1), ‘branch-site model’ based on ‘CodeML’ in PAML program (version 4.8) (Yang, 2007) was performed with 3 options; ‘model = 2’ for 2 or more dN/dS ratios for branches, ‘NSsites = 2’ to detect sites under positive selection on a foreground branch, and ‘CodonFreq = 2’ to calculate codon frequencies based on
‘F3X4’. Based on estimated parameters from the test, I compared maximum likelihoods of null and alternative models by using likelihood ratio test (LRT,
67
D = 2 * ∆ l). The statistical significances were calculated by using chi-square test and false discovery rate (FDR) was used for multiple test correction using R program (version 3.2.3.) (Team, 2013). Consequently, I identified sites under positive selection on each lineage with posterior probability. PSGs were detected with strict filtering criteria (dN/dS value of class 2 of foreground branch > 1, D > 0, and adjusted p < 0.05). After identification of significant PSGs, I checked posterior probability of each gene (> 0.95) to find specific sites under positive selection (site class 2) based on the Bayes empirical Bayes (BEB) inference. Finally, PSGs specific to coelacanth were identified through comparing PSGs of coelacanth, Holostei, and Teleostei.
Conserved domain search
To determine whether sites under positive selection are located in functional domains of each gene, I performed domain analysis by using Batch web C-Search tool in NCBI (Marchler-Bauer et al., 2011). Peptide sequences of PSGs unique to coelacanth were used as a query set, and following options were applied: Data source:
CDSEARCH/cdd v3.15; Expected value: 0.01; Composition-corrected scoring:
Applied; Low-complexity regions: Not filtered.
Gene ontology analysis
To check the group functions of PSGs specific to coelacanth, I applied gene ontology analysis with gene set enrichment tests by using DAVID functional annotation (Huang et al., 2009). To compare with other fishes, zebrafish was used as a representative background model. The cutoff of statistical significance of enrichment test was applied as the default p-value < 0.1, due to the small number of coelacanth- specific PSGs. I summarized gene ontology of biological process based on hierarchical clustering with ‘hclust’ function in R (version 3.2.3.) (Team, 2013).
Protein-protein interaction network analysis
To investigate interactions among genes, Search Tool for the Retrieval of Interacting Genes (STRING) online database (http://string-db.org/) was used (Szklarczyk et al., 2014). STRING provides direct (physical) and indirect (functional) associations among genes based on multiple resources (Szklarczyk et al., 2014). I searched interactions between 5 genes of urea cycle and 14 coelacanth-specific PSGs of
68
nitrogen compound metabolic process to generate a network with the following options: Organism: Danio rerio; Active interaction sources: Text-mining, Experiments, Databases, Co-expression, Neighborhood, Gene fusion, and Co- occurrence; minimum required interaction score: medium confidence (0.4).The network was visualized using Cytoscape 3.3.0 (Shannon et al., 2003).
Amino acid changes specific to coelacanth
Target-specific amino acid substitutions (TAAS) analysis (Zhang et al., 2014) was conducted to find mutually exclusive amino acid substitutions between coelacanth and other fishes. The TAAS module and a codon translator were written and executed by Python (version 2.7.9., htttp://www.python.org). For one of homeobox genes, SHOX, I conducted additional TAAS analysis with 100 way multiz-alignment of 100 vertebrates (Blanchette et al., 2004) in UCSC genome browser (Meyer et al., 2013).
69