CHAPTER 1 INTRODUCTION
2.1 Bioinformatic methods
unknown protein sequences, as the profile generated can often mislead the alignments and provide false homologous hits (Baxevanis, 1998).
2.1.2 Prediction of T cell epitopes in Pc96
In order to evaluate the presence of any putative T cell epitopes on Pc96, online epitope prediction software was used. There are two programs available on the World Wide Web that allow unrestricted predictions, BIMAS (http://www- bimas.dcrt.nih.gov/molbio/hla_bind/) (parker et al., 1994) and SYFPEITHI (htlp://www.syfpeithi.de) (Rammensee et a!', 1999). These programs involved the recognition of T cell epitopes, due to chemically related amino acids in specific positions, corresponding to that of known MHC ligands. This allowed for the definition of a peptide 'motif for every MHC allele (Falk et al., 1991). SYFPEITHI uses motif matrices previously obtained and deduced from refined motifs based on single peptide analysis exclusively of natural ligands. Binders for various mouse, human and rat MHC class I molecules are compared according to the presence of primary and secondary anchor amino acids and other chemically similar residues. The predicted amino acid Pc96 sequence was used in the epitope prediction program SYFPEITHI, for naturally processed MHC class II epitopes, and I for a variety of HLA-types (Rammensee etal., 1999).
2.1.3 Multiple sequence alignments of Pc96 and similar proteins using CLUSTALW
The CLUSTALW algorithm provided by the European Bioinformatics Institute (EBI) on the WWW has been widely used for the comparison of conserved amino acid sequences either between homologous proteins or domains, or between different species (Thompson et a!., 1996). The algorithm is based on the idea of progressive alignment, creating a series of pairwise alignments using the query sequence. A distance matrix .is calculated based on these initial alignments, reflecting the relatedness of sequences. Pc96 and proteins matched using BLAST, were compared
in multiple and pairwise sequence alignments to investigate homology and identify regions within the sequences that have regions of similarity.
2.1.4 Comparison of structural features of Pc96,Pyy178,Pf403
The program PREDICT7 version 1.2 (Carmenes et aI., 1989) was used to determine structural properties of Pc96 and related proteins such as hydropathy, flexibility, surface probability, and antigenicity. Due to the fact that hydrophobic residues are located within the globular protein structure, and hydrophilic on the outside, exposed and interacting with water, an algorithm was designed by Hopp and Woods (1981) to represent this feature. Each amino acid is assigned a numerical value of hydrophilicity, which is averaged along the sequence of the protein, producing a plot of these trends. Flexibility, or sequence mobility, provides structural clues as to the nature and movement within the protein. Functional activity can often be assigned to these conformational variations. Regions of highest mobility exist often on the most highly accessible segments, usually on the surface of the molecule (Karplus and Schulz, 1985). Surface probability measures the proteins contact to solvents. Features of the protein such as buried residues can determine the surface regions of globular proteins (Janin, 1979). Antigenicity plots are based on the amino acid composition of regions of the protein and show a probability of the region being antigenic, due to previous characterisation of known antigens (Welling et aI., 1985). Due to the fact that these plots generate data specific to structural aspects of the protein, they were used primarily to compare homologous regions of interest in the identified protein homologues.
2.1.5 Identification of putative motifs and patterns from the PROSITE profile library and generation of similar 3D structures using 3DPSSM
The PROSITE analysis program was used with the query protein sequence to scan a profile library (Sigrist et aI., 2002). The PROSITE profile library used for searching is an ExPASy database that is a collection of biologically significant motifs or sequence patterns. Due to the fact that some proteins are large and contain many domains and
functions, protein functional classifications are based mostly on domains rather than complete proteins.Pc96 and other similar proteins were used to screen the PROSITE database. Certain functional domains contain sets of conserved regions of amino .acids, and the occurrence and positions ofthese regions along the length of a protein sequence, in relation to certain alignment features can be used to create a signature of the domain, allowing the identification of these patterns within query sequences.
These signatures are a reflection of the 3-dimensional conformation of the domain, and can be used to assign function to regions of un-characterised proteins, providing the profile of that region is a significant match.
The 3-dimensional structure and function of protein sequences was also determined by analysis with the 3D-PSSM (Three Dimensional - Position-Specific Scoring Matrix) server. The server contains a database with known protein structures, which are compared to the query sequence and scored on a basis of compatibility. Factors such as secondary structure elements, and probability of occupying various states of hydrophilicity, in relation to overall shape are analysed (Kelleyet al., 2000).Entries in the PROSITE database make up the so-called BLOCKS database, used to identify families of proteins. This is not a comparison of the sequence itself, due to the fact that homologous proteins may not share sequence similarity. A block relates to motifs, or conserved stretches of amino acids conferring specific function to the structure of a protein. Individual proteins may contain several blocks in similar combination, corresponding to specific structure or function. The query sequence is aligned against all blocks in the database at all available positions. A score is derived from alignments using the position-specific scoring matrix (PSSM), taking into account matches at given positions and the probabilities of amino acids occupying specific positions in the block (Baxevanis, 1998).