Prologue
1.7. Theoretical prediction models of thermostability
To overcome the demerits of directed evolution approaches numerous in silico algorithms have been proposed which can predict whether conceptualized mutations will be thermostabilizing. These models have been developed by investigating protein features by comparing thermostable proteins with mesostable proteins at different hierarchies of protein organization: from the nucleotide codons in their genes, their amino acid preferences in their protein sequence to their tertiary structures. The algorithms available to date with the capability of distinguishing thermostabilizing mutants are mostly knowledge based (Rohl et al. 2004). Few are support vector machine (SVM) based (Capriotti et al. 2005) and further lesser are based on molecular dynamics (Benedix et al. 2009). Table presents the existing methods that have been used to predict protein thermostability.
TH-1690_10610619
Attaining Protein Thermostability – A Rationalised Approach 2016
Chapter I 22 Table 1.5. Existing popular softwares that predict stability of mutations
Tools Salient Features References
I-Mutant Support Vector machine based, both sequence and structure can be used, single mutation
Capriotti, 2005
Cupsat Sequence as input, single amino acid
mutations Parthiban et al. 2006
MUPRO Support Vector machine based, sequence
as input, Single mutation Cheng et al. 2006 ERIS Structure as input, multiple mutations Yin et al. 2007 iPRESTAB Machine learning based, single mutation Huang et al. 2007
PoPMuSiC Single mutation Dehouck et al. 2009
WET-STAB Machine learning based, multiple
mutation Huang et al. 2009
MUSTAB Support Vector machine based, sequence
as input, multiple mutations Teng et al. 2010 AUTO–MUTE Machine learning based, structure as
input, single mutation Masso et al. 2011 SDM Sequence/structure as input, single
mutation Worth et al. 2011
iSTABLE Support vector machine based, structure/sequence as input, single mutation
Chen et al. 2013
NeEMO Machine learning based, structure as
input, Giollo et al. 2014
ENCoM Neural Network based, single mutation Frappier et al. 2014
iRDP Ensemble of servers Panigrahi et al. 2015
All the methods used for stability prediction presented in Table 1.5 employ machine learning methods on protein datasets to correctly classify thermostable proteins and discriminate between stabilizing and destabilizing mutations. They perform with higher accuracies than most of the statistical and molecular dynamics simulation methods. The latter also have the disadvantage of requiring high
TH-1690_10610619
Attaining Protein Thermostability – A Rationalised Approach 2016
Chapter I 23 computational power and proficiencies. There are various examples where machine learning approaches have been utilized. Such methods were based on support vector machines, neural networks and decision trees which can predict the effects of mutations on thermostability (Bava et al. 2004; Capriotti et al. 2005; Kumar et al.
2000). Large datasets of known primary, secondary, and tertiary structures of proteins were used to train the machine learning algorithms. Gromiha et al. analyzed the amino acid compositions of 3075 mesophilic and 1609 thermophilic proteins by logistic functions, neural networks, support vector machines, decision trees and found that charged residues as well as the hydrophobic residues have higher occurrence in thermophiles (Gromiha et al. 2008). In 2010, Prethermut software was developed, based on machine learning methods, to predict the effect of single- or multi-site mutations on protein thermostability (Tian et al. 2010). Ebrahimi et al. employed various supervised and unsupervised machine learning algorithms to find amino acid composition features that contribute to enzyme thermostability (Ebrahimi et al. 2011).
They reported Gln content and frequency of hydrophilic residues as the most important protein features for thermostability. They also reported that the amino acid sequence is the main indicator of protein function but direct prediction of protein characteristics such as thermostability is not possible from the primary amino acid sequence (Ebrahimi et al. 2011). Consequently, methods to predict thermostability have focused on the three dimensional structures of proteins. From the aforementioned examples it is clear that bulk of the work done on prediction of protein thermostability is on the primary sequence and tertiary structures of proteins.
Moreover though it has been reported that thermophiles can be distinguished by their pattern of synonymous codon usage for several amino acids (Lynn et al. 2002; Lobry et al. 2003), very less work related to model generation at the nucleotide and codon usage levels of thermophiles has been performed. It was also conclusively reported that at elevated temperature selective constraints at all three molecular levels:
nucleotide content, codon usage and amino acid composition are important to stabilize thermophilic proteins (Lynn et al. 2002). Only recently Lu et al. developed a
TH-1690_10610619
Attaining Protein Thermostability – A Rationalised Approach 2016
Chapter I 24 hybrid fractal algorithm to predict thermophilic nucleotide sequences with an average accuracy of 0.945 (Lu et al. 2012).
Although a lot of work has been done for identifying stabilizing mutations, protein engineering methods utilized to achieve them are still random and success rate is probabilistic. It can be said here that the accurate prediction of the thermodynamic consequences caused by mutations through in silico algorithms remains challenging (Seeliger et al. 2010). Khan and Vihinen recently evaluated and compared 11 online stability predictors and found that the predictions were only moderately accurate (Khan et al. 2010). Limitations are that majority of them require complex computational power and proficiencies. Another drawback is that they are based on calculations of features from protein sequences and can consider only single point mutations at a time and also require several empirical parameters or heuristics such as patterning of residues for their calculations. Moreover statistical analysis based on Tm values (the midpoint of the thermal transition), suffers the fact that it is available only for a few proteins in a high resolution protein structural dataset. This limits the ability to examine correlations in a significant way (Kumar et al. 2000). Molecular dynamic simulations of mutation are several orders of magnitude complicated than that with a knowledge-based scoring function (Sleegier et al. 2010). The other concern is that, only few algorithms can predict the effect of multiple mutations.
Multi-site mutations are expected to have more complex effect on protein thermostability than from single point mutations (Tian et al. 2010). For example, a predictive model weighted decision table method-WET-STAB was developed. It is a weighted decision table method for predicting protein thermostability change upon double mutation from amino acid sequences (Huang et al. 2009). However the accuracy drops to 0.57 when it is tested on the hypothetical reverse mutations (Li et al. 2012). The other model Protein Thermostability Random Forest model (PROTS- RF) is based on Random Forest algorithm and achieves an accuracy of 78.7% for multiple mutations (Li et al. 2012). The accuracy achieved until date creates limitation when greater than two mutations are to be performed. Additionally the cumulative effect of all the mutations on the physicochemical features or structural
TH-1690_10610619
Attaining Protein Thermostability – A Rationalised Approach 2016
Chapter I 25 changes associated with the same cannot be as such predicted using the aforementioned algorithms. Also another lacuna is that all these methods give multiple choices of possible stabilizing mutations and do not conclude whether they will actually lead to thermostability. Moreover, in doing so they also fail to select as to which point mutation (single, multiple) or which combination of mutations will actually lead to thermostability of proteins. In short they are unable to rank or prioritize the plausible mutations based on their effect on stability on proteins.
Therefore, a new method is needed that can prioritize features according to their importance in rendering proteins thermostable at a desired temperature. This will give rise to a guided approach to thermostabilize proteins.