• Tidak ada hasil yang ditemukan

Deriving at weights of features and the pairwise comparison matrix

As the features are numeric in nature, there arose a need to derive pro-rata weight for the thermostability datasets. According to Saaty, 2008 a scale of numbers that indicates how many times more important or dominant one feature is over another feature is a pre-requisite for making comparisons. Thus to prioritize the criteria their weights were derived. This led to the formation of a positive reciprocal pairwise comparison matrix. As the thermostable protein dataset consisted of structural homologues the pro-rata weights were derived through the following formula:

ι

=Φ ΤP-Φ ΜP

ν ν (1)

Where, the symbols have the following meanings: ι= 1,..,n where n = number of features;

ι is the difference in the normalized feature, Φ Τν P stands for the normalized feature of the thermostable protein dataset and

Φ ΜP

ν stands for the normalized feature of the mesostable protein dataset. Then the differences in the features were represented by vectors, where each difference in attribute,∆ι takes the value of 1 if the feature of type

ι

is positive, and 0 if negative or there is no difference. This formed our difference matrix (Appendix III Table A3.1). Further the number of proteins in the difference matrix having the value of 1 for each of the 17 features was summed and converted to their percentage scores w.r.t., the total number of proteins. This gave the percentage weight of the number of protein showing increase in a feature w.r.t. the mesostable protein dataset.

In the next step all the percentage weights were scaled down to a 1-9 interval scale weight by a python script which uses the equation:

Weight (Wi) =

( )

( )

ξ

ι

×8 +1 β-α

 

 

 

 

(2) Where, Wi is the derived weight in the 1-9 scale, i= 1,..,n where n = number of features;

ξ

is the value of the weight for feature i, α is the minimum value in the weight for feature and β is the maximum value in the weight of feature. This gave the relative importance of each feature w.r.t. each other. This ratio was supplied to the 17x17 pairwise comparison matrices (Appendix III Table A3.2).

Further step was column wise normalization of the matrices so that the sum of each column was 1. This was followed by calculating the sum of the rows. This gave the priority vectors or the eigen vectors of the matrix. The priority vectors were indicators of the relative importance of each feature over the others for their positive contribution towards protein stability in elevated temperatures. Conclusively it can be pointed out here that higher the value more is its impact towards rendering proteins to be stable in such extreme conditions.

The next step was to calculate the consistency of the matrix from formula 3.

The purpose for doing this was to make sure that the original weights given to the features were consistent.

Consistency index (CI) = λm ax-N

N -1 (3) Where, λmax is the consistency measure of each row in the second matrix which is calculated as the dot product of initial matrix with the eigen vectors which is then divided by N, where N = total number of features. Further consistency ratio is derived according to formula 4.

Consistency ratio (CR) = CI

RI (4) According to Alonso et al. RI is a random index and is equal to 1.6086 for N=17 and and1.6181 for N=18 (Alonso et al. 2006). A matrix is accepted as consistent if and only if CR < 0.1 (Alonso et al. 2006).

4.2.4. Development of RankProt

The aforementioned steps were simplified and automated through the development of robust software written in python. To achieve the same the developed python program for calculating intra protein interactions along with VADAR and Promotif and the eigen vectors for thermostability, calculated through AHP were embedded into an algorithm to rank the mutations. The principle for deriving at ranks was by matrix multiplication of features in the test set by the priorities/eigen vectors of the features. Normalized feature matrix were generated for the test set and multiplied with the previously calculated eigen vectors for thermostability. This gives the dot product of the matrix and is called the ranks of the proteins. This aided in predicting the rank to be considered for any protein undergoing point or multiple mutations as thermostable. Figure 4.3 is an illustration of the typified algorithm used for developing RankProt.

Fig. 4.3: The algorithm of RankProt.

The protocol to run RankProt is that the user submitted test set should consist of two protein structures in .pdb format. One is the wild type and the other mutated.

For deriving at the mutated structure, in silico mutations can be performed through Chimera or the required mutated FASTA file of protein sequence can be subjected to homology modeling. The mutations should be carefully chosen for deriving at stable mutations which do not disintegrate the overall protein structure and activity. The following norms if followed will lead to successful mutations.

(1) Only residues which show high mutability propensity should be chosen. The mutations should not be done on stabilizing centre residues of a protein. These conditions can be checked through the webservers, HotSpot Wizard and Stride respectively.

(2) The mutated residues should not belong to the active site pocket of the proteins.

(2) The mutations should be done on the surface exposed areas of a protein or in their loop regions. Such areas do not hamper the overall stability of a protein.

Thus if wild type and mutated structure are available, it can be predicted whether the mutation will lead to thermostability by the ranks given by RankProt to the wild type and mutated structures. If and only if, such stabilizing mutations increase the number of the highest prioritized features they will lead to thermostabilization of proteins.

Thus if the rank of the mutated structure is higher than the reference structure, then such mutations will qualify as thermostabilizing. The instructions to run RankProt have been provided with the software package available in the DVD attached along with this dissertation.

4.2.5. Performance and validation