CHAPTER 5 General discussion
1.4 Non-ribosomal peptide synthetases (NRPSs)
1.4.2 NRPS substrate specificity prediction and the non-ribosomal code
35
simultaneous removal of the α-amino group of glutamine to form α-keto-glutamine in a variation of the classic transamination reaction. It was therefore proposed to be responsible for the transfer of an amine group to the β-position of the growing acyl chain. The ability to incorporate amine functionality directly into NRPs creates a functional handle for macrocyclization strategies and acts as a potent hydrogen bond donor, which can significantly alter the biology of a given compound (Fischbach & Walsh, 2006).
All of the modifications that have been introduced thus far are catalysed by domains that are embedded into their respective NRPS modules and almost entirely affect only the peptide backbone. Additionally, certain enzymes such as glycosyl transferases, halogenases and hydroxylases are able to modify the peptide’s side chains and are generally encoded in the same biosynthetic gene cluster.
36
pocket and the A domain residues making contact with L-phe provided a structural basis for understanding the specificity of peptide synthetases such as NRPSs (Rausch et al., 2005; Lautru
& Challis, 2004; Challis et al., 2000; Conti et al., 1997).
Expanding on the approach pursued by Conti and colleagues, Stachelhaus et al. (1999) and Challis et al. (2000) compared the L-phe binding pocket of GrsA with the corresponding sequences of over 150 aminoacyl and iminoacyl adenylate-forming domains of NRPSs to pinpoint the essential amino acid residues involved in substrate specificity and binding. Both studies examined a ~100 amino acid stretch between the core motifs A4 and A5 and identified 10 amino acid residues that were within approximately 5.5 Å of the active site-bound phenylalanine and in contact with the substrate (Rausch et al., 2005; Lautru & Challis, 2004;
Conti et al., 1997).
These 10 amino acid residues were determined to be crucial for substrate binding and catalysis, as alignments of consensus A domain residues to the PheA domain revealed that all A domains display the same type of binding pockets, with different key residues that interact with different amino acid substrates. It was postulated that two residues, Asp235 and Lys517, stabilize the α- amino group of the amino acid substrate via two hydrogen bonds and are critical for the correct positioning of the substrate within the active site for ATP-dependent activation. These two residues are located in the conserved core motifs of A4 (Asp235) and A10 (Lys517). The other residues bordering the PheA specificity binding pocket were determined to be Ala236, Ile330 and Cys331 on the one side and Ala322, Ala301, Ile299 and Thr278 on the other side of the pocket. Both sides are separated by the indole ring of tryptophan at position 239 (Trp239), which is located at the bottom of the pocket (Stachelhaus et al., 1999).
More importantly, it was determined that due to the high degree of sequence identity shared between NRPS A domains, the amino acid residues that correspond to those lining the PheA binding pocket could be used to reveal substrate specificity in other A domains. The consecutive order of the 10 residues was determined to constitute the signature sequence involved in the binding pockets of A domains and can be interpreted as the “specificity- conferring code” (also referred to as the non-ribosomal code), as it allows for the prediction of A domain selectivity on the basis of the A domain primary sequence (Figure 1.17) (Table 1.1) (Rausch et al., 2005; von Döhren et al., 1999).
37
Figure 1.17 A multiple sequence alignment of the primary amino acid sequences from known A domains in order to determine the selectivity-conferring residues (a). The sequence of ~100aa between the core motifs A4 and A5 from PheA from GrsA was aligned with the corresponding sequence of AspA from the surfactin synthetase, SrfA, OrnA from the gramicidin synthetase, GrsB3 and ValA from the cyclosporine synthetase. Yellow residues indicate those involved in the binding pocket positions and brown residues indicate conserved motifs which anchor the alignment.(b) Ten highly conserved residues were extracted from the sequence alignment and the consecutive order of the amino acids was determined to constitute the signature sequence involved in the binding pockets of the aligned A domains. The missing residue, Lys517, is highly conserved within motif A10, which is not shown in the protein sequence. The alignment was extended to 160 different A domains to confirm accuracy in determining the signature sequence (adapted from Stachelhaus et al., 1999).
This discovery confirmed previous findings from site-directed mutagenesis and photoaffinity labelling experiments that indicated that these amino acid residues were involved in ATP binding and hydrolysis (Gocht & Marahiel, 1994; Pavela-Vrancic et al., 1994). The non- ribosomal code was initially restricted to amino acid-activating A domains but was extended to carboxy acid-activating A domains when the crystal structure of the stand-alone 2,3- dihydroxybenzoic acid activating domain, DhbE, from Bacillus subtilis was solved (May et al., 2002).
38
Table 1.1 Consensus specificity code for substrates from several adenylation domains
Clusters of signature sequences extracted from A domains activating the same substrates were used to determine the consensus sequences for the recognition of several amino acid substrates. The biosynthetic template from which each A domain specificity code was derived is included, along with the overall similarity of each signature sequence. Variable constituents within each codon are represented by red residues and ‘wobble’-like positions, which reveal a large degree of variability throughout all codons, are indicated in blue. Aad, δ(L-α-aminoadipic acid); Dab, 2,3-diamino butyric acid; Dhb, 2,3-dihydroxy benzoic acid; Sal, salicylate; Phg, L-phenylglycine;
hPhg, 4-hydroxy-L-phenylglycine; Pip, L-pipecolinic acid; Dht, dehydrothreonine; ‘@’ indicates a modification of the residue (Stachelhaus et al., 1999).
39
The successful mutation of all the key residues of the PheA binding pocket, which resulted in relaxation or alteration of its substrate specificity, demonstrated the reliability of the non- ribosomal code. This has resulted in the development of web-based NRPS prediction services such as the NRPS-PKS knowledge base (Ansari et al., 2004), NP.searcher (Li et al., 2009), PKS/NRPS analysis (Bachmann & Ravel, 2009) and antiSMASH 3.0 (Weber et al., 2015), which make use of the “specificity-conferring code” to predict putative A domain substrates in NRPS genes. Although these tools have been relatively successful at predicting the specificities of A domains in new NRPS genes, they do have a number of weaknesses (Challis et al., 2000).
A clear shortcoming is that predictions of substrate specificities are based on known A domain sequences. Since not all A-domain sequences in nature are known and, in particular, since there are relatively few sequences for A domains that bind more unusual substrates, the accuracy of substrate specificity prediction is limited. The specificity of uncharacterised A domains must therefore be deduced from the available code for domains with known specificity. It has been observed that there are deviations from the code. For example, not all A domains specific for phenylalanine have the exact same specificity binding pocket sequence as GrsA, e.g. BarG of the barbamide synthesis gene cluster from Lyngbya majuscula (GenBank accession number:
AAN32981) has DAWTVAAVCK instead of DAWTIAAICK. In addition, there are examples of codes where the predicted substrate specificity does not correspond to the actual activated amino acid, such as in the alanine-activating domain Sare0718 from the marine actinomycete Salinispora arenicola CNS-205 (which has the code for valine) and in the biosynthetic cluster for fusaricidin, a mixture of 12 depsipeptides from Paenibacillus polymyxa PKB1, which displays relaxed substrate specificity and allows for the incorporation of D-amino acids instead of their L-isomers (Xia et al., 2012; Li & Jensen, 2008).
A further shortcoming has been observed in the analysis of the active sites within the A domains of certain types of NRPSs, particularly those belonging to fungi, as the GrsA crystal (from a bacterium) seems to be an inadequate model for fungal NRPSs or because the large number of sequence variants in the active site of fungal NRPSs does not allow for the identification of the key residues required for substrate-specificity prediction (Prieto et al., 2012). Certain positions are considered more variable or ‘wobble’-like than others, particularly 239, 278, 299, 322 and 331, which are highly variant. Furthermore, positions 235 and 517 are considered invariant and positions 236, 301 and 330 are moderately variant (Table 1.1). The variability reflects each position’s importance in contributing to substrate specificity, but also causes dissimilarity
40
between the signature sequences for A domains specific for identical substrate amino acids (Stachelhaus et al., 1999).
It is due to these limitations that Rausch et al. (2005) expanded on the work of Stachelhaus et al. (1999) and Challis et al. (2000) by publishing a machine learning algorithm in which transductive support vector machines (TSVMs) were utilised to statistically propose NRPS A domain specificity using the physico-chemical fingerprint of the residues within 8 Å of the active site of the A domain (a total of 34 residues). The residues are encoded into TSVMs based on their physico-chemical properties such as the number of hydrogen bond donors, polarity, volume, secondary structure preferences, hydrophobicity and isoelectric point and a continuously updated dataset of A domains with known specificity are used in predicting substrate specificity. Due to the occurrence of relaxed/promiscuous specificity in certain A domains, such as in the NRPS responsible for xenematide biosynthesis in Xenorhabdus nematophila, the specificities for substrates with similar physico-chemical properties are clustered together (Crawford et al., 2011; Rausch et al., 2005).
An open source web-based predictor, NRPSpredictor, based on TSVMs, was built on the 34 active site residues in order to predict A domain specificity, which was later refined and replaced by the NRPSpredictor2, which possesses improved prediction performance, two new prediction levels and a larger database (Rottig et al., 2011).
Although the TSVM-based method was based on a more recent database and was able to provide a substrate-specificity prediction in an additional 18% of cases, it is still beneficial to use it in combination with the empirical predictive method developed by Stachelhaus et al.
(1999) and Challis et al. (2000) in order to create an even more accurate prediction tool (Rausch et al., 2005). This is largely due to the fact that most of the weaknesses of the older method remain and by expanding the number of residues considered, it may have amplified the problems associated with including data with little or no influence on the specificity of the A domains. In addition, the clustering of specificities reduces the accuracy of the predictions and, although this is acceptable for A domains which possess relaxed specificity, other A domains are much more specific. There is currently no way to distinguish a highly specific A domain from a relaxed one (Rausch et al., 2005; Mootz et al., 2002; Challis et al., 2000).
41
This situation prompted interest in developing new prediction methods supported by other approaches, such as the use of hidden Markov Models (HMM). Khurana et al. (2010) applied HMM to functionally classify the acyl-CoA synthetase superfamily members. The results of this work suggest that the application of HMM to classify this superfamily outperforms the predictions based on a restricted number of active site residues (Khurana et al., 2010).
Furthermore, a novel two-mode factor analysis model based on latent semantic indexing (LSI) has recently been published by Baranasic et al. (2013). This model is able to predict the specific amino acid that is activated by the A domain in contrast to a cluster of similar amino acids.
The authors suggest that a detailed comparison of prediction quality against those of the NRPSpredictor, showed that the LSI model performed slightly better and is thus the most accurate method currently available for prediction of A domain substrate specificities (Baranasic et al., 2013).