Analysis of attributes contributing to extreme-stability of proteins
4.3 Results and discussion
4.3.1 Data collection and statistically significant feature generation
The collected homologous protein sequences and structure (same homologous proteins were utilized as used in Chapter 3) were exploited for data collected from six classes of extremophilic proteins and their counterpart non-extremophilic proteins such as The homology search leading to the formation of clusters which had similar extremophilic and non-extremophilic protein pairs were chosen as homologous pairs for six dataset creation namely, T-M (thermophiles-mesophiles), P-M (psychrophiles-mesophiles), T-P (thermophiles-psychrophiles), B-Nb (Barophiles-Nonbarophiles), H-Nh (Halophiles-
Nonhalophiles) and A-B (Acidophiles-Alkaliphiles) datasets (details of datasets are shown in Appendix Table A1.1 – A1.6). Our group previously reported collection of the homologous T-M proteins by BLAST analysis on the basis of homology (above 70%) and the collected homologues were further confirmed by CLUSS21. The amino acid sequences and proteins structures were collected from UniProtKB and RCSB PDB, respectively. For amino acid (AA) and protein structure (ST), datasets were created which had 29 amino acid features and 21 structure features, respectively and were collected for each extremophile/non-extremophile dataset (T-M, P-M, T-P, A-B, H-Nh and B-Nb). The ST of T-M dataset was previously studied by our group, thus ST is not included in the present thesis (Chakravorty et al. 2017a). If ST had less number of attributes then, the AA attributes were clubbed into ST and named as AA+ST. Table 4.1 enlists the collected features for AA and ST of different extremophile/non-extremophile dataset. The numerical features were generated using various software and tools which includes PEPSTATS server (for amino acid percentage), PIC Webserver, VADAR, ESBRI and Promotif (for protein structural features percentage). The tools and software employed in the study are enlisted in Table 4.2. All the amino acid and structure features were normalized with respect to the length of the protein sequence and number of atoms, respectively.
Table 4.1: Attributes used for predication of protein extreme-stability (29 AA and 21 ST).
Protein dataset
Attribute type
Collected attributes Amino acid
(AA) dataset
Standard amino acids
(20)
Alanine (Ala or A), Arginine (arg or R), Asparagine (Asn or N), Aspartic acid (Asp or D), Cysteine (Cys or C), Glutamine (Gln or Q), Glutamic acid (Glu or E), Glycine (Gly or G), Histidine (His or H), Isoleucine (Ile or I), Leucine (Leu or L), Lysine (Lys or K), Methionine (Met or M), Phenylalanine (Phe or F), Proline (Pro or P), Serine (Ser or S), Threonine (Thr or T) Tryptophan (Trp or W) and Tyrosine (Tyr or Y) and Valine (Val or V)
Amino acid classes (9)
Polar (Pol), Non-polar (NPol), Small (Sml), Tiny (Tiny), Aromatic (Aro), Aliphatic (Ali), Charged (Chrg), Basic (Bsc) and Acidic (Acd) amino acids
Structure (ST) dataset
Intra-protein interactions
(11)
Total hydrogen bonds (HB), main-main chain hydrogen bonds (MMH), main-side chain hydrogen bonds (MSH), side-side chain hydrogen bonds (SSH), Hydrophobic interactions (HI), disulfide bonds (DS), Ionic interactions (II), Salt bridges (SB), aromatic-aromatic interactions (AAI), aromatic-sulfur interactions (ASI) and cation-π interactions (CPI)
Secondary and tertiary structure attributes (10)
α-helix (AH), -strands (BST), -hairpins (BH), -sheets (BSH), - turns (BT), γ-turns (GT), packing volume (PV), charged accessible surface area (CASA), polar accessible surface area (PASA) and non- polar accessible surface area (NPASA)
Table 4.2: Software, tools and algorithms used for prediction studies.
Stepwise Methods Tools and their applications Data collection UniProtKB: amino acid (AA) data
RCSB PDB: structural (ST) data Homology search NCBI: Blastp
CLUSS version 1.2: Clustering homologous proteins Numerical features Python script: Codon composition analysis
EMBOSS Pepstat: Amino acid composition analysis
PIC, ESBRI, PDBsum or VADAR webservers: Interactions and structural attribute calculations
Feature selection MATLAB: Kolmogorov–Smirnov test of significance Data mining and
classification model generation
RapidMiner: Attribute weighting [such as Correlation, Information gain, Information gain ratio, Rule, Deviation, Chi squared statistic, Gini index, Uncertainty, Relief, Support Vector Machine (SVM) and Principle Component Analysis (PCA)], unsupervised [such as k-Means, k-Means (kernel), k-Medoids, Support Vector Clustering (SVC), Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Expectation
Maximization Clustering (EMC)]; and supervised machine learning applications [such as -nearest neighbor (k-NN) and Naïve Bayes, logistic regression, SVM, decision trees and artificial neural networks (ANN)]
Ranking model generation
Multi-criteria Decision Making: Analytic hierarchy process; 1-9 scale ranking (using python sript)
The collected amino acid and protein structure features were filtered identifying the statistically significant features by non-parametric two-tail KS-test of significance at 95%
confidence level (or p-value <0.05). Features with p-value <0.05 were retained and considered significant. This statistical test measures the discrepancy between the distributions of two comparing samples. It is used to reasonably assume whether two samples come from the same distribution27. In contrast to t-tests and the usual Wilcoxon rank-sum tests, the KS-test is a better and preferred method to enumerate whether differences in distribution between two populations are meaningful statistically28. The KS statistics is usually referred for determining an appropriate estimate of the overall variability in bipartite data27. Here, the KS-statistical analysis has been performed on MATLAB platform. The datasets were subjected to two-sample KS-test of significance at 95% confidence level. The protein features with p-value <0.05 were retained and considered significant. Table 4.3 summarizes the results obtained in KS-test analysis for statistically significant protein features. The analysis revealed that the significant feature generated by 2-tail KS test were then analyzed through three approaches, (a) relative abundance analysis of attributes for extremophilic proteins and extremophilic proteins;
(b) machine learning for classification model prediction; and (c) multi-criteria decision
making approach (Analytic hierarchy process, AHP) for ranking model prediction to finally be able to devise ranking models (for each extremophilic/extremophilic proteins dataset) for categorizing future proteins and prioritize the attributes.
Table 4.3: Enumerating statistically significant protein features by KS test.
Dataset Total
features Statistically significant codon features
T-M 29 19 AA dataset: Ala, Cys, Asp, Glu, His, Ile, Lys, Asn, Gln, Arg, Ser, Thr, Val, Tiny, Sml, Pol, Chrg, Bsc, Acd
P-M 49 15 AA + ST dataset: Ala, His, Met, Asn, Ser, Thr, Trp, Tiny, Ali, Aro, Bsc, Acd, NPASA, CASA, GT
T-P
29 21 AA dataset: Ala, Cys, Asp, Glu, His, Lys, Met, Asn, Gln, Arg, Ser, Thr, Val, Trp, Tiny, Sml, Aro, Pol, Chrg, Bsc, Acd
20 11 ST dataset: GT, HI, MMH, MSH, II, CPI, SB, NPASA, PASA, CASA, PV
A-B
29 27
AA dataset: Ala, Phe, Asp, Glu, His, Ile, Lys, Leu, Met, Asn, Pro, Gln, Arg, Ser, Thr, Val, Trp, Tyr, Tiny, Sml, NPol, Pol, Ali, Aro, Chrg, Bsc, Acd
20 16 ST dataset: II, MMH, MSH, SSH, HB, SB, PASA, CASA, PV, AAI, ASI, CPI, HI, DS, AH, BSH
H-Nh 49 17 AA+ST dataset: Asp, Ile, Asn, Arg, Thr, Tiny, Sml, NPol, Pol, Chrg, Bsc, Acd, HI, CPI, NPASA, SB
B-Nb 49 13 AA+ST dataset: Gln, Arg, Ser, Tiny, Chrg, Bsc, Acd, HI, II, CPI, SB, BSH, BST