Chapter 5. Development of Deep Learning Models for Predicting Antibiotic-Resistant Genes (ARGs)
5.2. Material and methods
5.2.3. Data-driven modeling
[Data analysis]
The bulk datasets, including ARGs, were subdivided into three sets according to the amount of rainfall described in Table 5.1: before rainfall, during rainfall, after rainfall groups. The spatial distribution and relative abundances of ARGs and OTUs before, during, after rainfall were plotted using the ggplot2 and ggmap packages, and the boxplot function in R, respectively. Shannon’s index (H’) was calculated for the diversities of ARGs and OTUs using the algorithm originally described by Shannon (1948). H’ is combination of the richness and evenness of species (Washington, 1984) and is widely applied for microbial community studies in aquatic environments. The H’ indexes before and after rainfall were described using the vegan package implemented using R. Statistical significance was assessed with a probability value of 5% (p<0.05) using the t-test for comparison among the before, during, after rainfall groups. Pearson’s correlation was used to investigate the relationship between intI1 gene and ARGs, calculated using Hmisc package in R.
Table 5.2. List of abbreviations and qPCR annealing temperature.
Fig. 5.3. Summary diagram for predicting ARGs at a recreational beach.
[Conventional LSTM]
LSTM was developed by Hochreiter and Schmidhuber (1997) to process long sequential data and extract its temporal features. In LSTM, the flow of information is controlled by the “gate” mechanism (Figs. 5.4a and b). These gates control the incoming and outgoing information in an LSTM unit. LSTM employs three gate types that are named based on the function they perform (i.e., the output gate (Γ ), forget gate (Γ), and input gate (Γ)). In addition to gates, LSTM also has “memory”, which is responsible for the flow of information across time. The gates interact with the memory to either remove or add information to the memory. The memory is also called the “state.” There are two types of states within an LSTM unit: the cell state (c"#$) and hidden state (h"#$). At each time step, an LSTM unit generates two outputs that are representative of the cell and hidden states. When LSTM is used in a “many-to- many” scenario, the output from each time step is calculated, which is equivalent to the hidden state at each time step. The time step in this case represents the number of observations processed using LSTM in a given time. The number of observations is also called the “lookback”. The equations used to calculate states and gates in LSTM are given below.
C&"#$= tanh W&[ h"#+,$, x"#$] + b& , (1) Γ = σ W [c"#+,$, x"#$] + b , (2) Γ = σ W [c"#+,$, x"#$] + b , (3)
Γ = σ W[c"#+,$, x"#$] + b , (4) c"#$= Γ ∗ C&"#$ + Γ ∗ c"#+,$, (5)
h"#$= Γ ∗ tanh c"#$ (6)
where W and b are the weights and biases, respectively, which are learnable parameters; and tanh and σ are the hyperbolic tangent and sigmoid, respectively. At each step, an LSTM model receives inputs as x"#$ and produces outputs as c"#$ and h"#$. The hidden state h"#$ is then used in a fully connected network to obtain the final model predictions.
[LSTM-CNN hybrid model]
We applied a one-dimensional (1D) CNN to the output sequence of the LSTM model to enhance the prediction performance by further extracting features from the LSTM output (Fig. 5.4). The input for 1D CNN was same to output at each time step from the LSTM model. 1D CNN can extract features by convolving a weight matrix with the inputs (Eq. 7). The size of the weight matrix determines the total number of parameters in the CNN while the size of the convolution operation is based on the kernel size. The convolution output was further modified by applying a non-linear activation function. The values of these hyperparameters were obtained using Bayesian optimization, and they are provided in Table 5.3. The CNN layer was followed by a maximum pooling layer (Eq. 8) to reduce the output size.
Finally, we applied a fully connected layer to obtain the final prediction from the model.
y = f ∑ x ∗ w + b (7) z = max 0, x (8)
where x and y indicate the input and output of the CNN, respectively. w is the weight matrix, and b is the bias. The activation function (f) used in each model is indicated in Table 5.3.
Fig. 5.4. LSTM-CNN hybrid model structure for time step 1 to t. (a) and (b) indicate the input and LSTM layers of the conventional LSTM, respectively. On the LSTM layer, Xt and Yt are the input and output at time t, respectively. Cell state (Ct) records past information through three gates (f, i, and o), sigmoid (σ), and the activate function (FAct). (c) LSTM output is loaded into the 1D CNN and convoluted based on the optimized filter size and activation function type. (d) ARG prediction values are derived from the fully connected layer.
Fig. 5.5. IA-LSTM model structure. (a) IA model layer was adopted from the DA-RNN proposed by Qin et al. (2017). :; indicates the original inputs of ; number of input features, and <=> represents the attention weight for the >-th input at time t. (b) Driving inputs (?@=) calculated from the IA layer are used instead of the original LSTM inputs for prediction.
[IA-LSTM model]
The LSTM model can extract temporal features from inputs. However, not all of the inputs into an LSTM are equally important. Model performance can be enhanced if it focuses on more relevant inputs. To account for this, many researchers have proposed an “attention” mechanism that forces the NN to focus on a certain part of the input (Luong et al., 2015). One example is the dual-stage attention mechanism (DA-RNN), which was designed for time series prediction problems (Qin et al., 2017). In DA-RNN, the “input attention (IA)” is first applied, which allows the model to focus on relevant inputs.
In the second stage, “temporal attention” is applied, which assists the model in selecting relevant data from the history. In this study, we used an IA mechanism in combination with LSTM as illustrated in Fig. 5.5. The purpose of IA is to adaptively select relevant driving input data. IA-LSTM is not a black box model unlike other deep learning models in that it can calculate the importance of input variable on the model.
e#B = vDEtanh WD Fh#+,; s#+,+ UDxBJ (9) α#B =∑DLMNDDLMNDOPQ
ORQ
SRTU (10) X@#= a,#x#,, a#x#, … a#X x#X (11)
Given n number of input features, the attention weight for the k-th input α#B is calculated using the hidden and cell states of LSTM according to Eq. 10. The parameters UD, vDE, and WD are calibrated during the training process. Eq. 10 is a softmax function that returns an array of attention weights that sum to one (Chollet, 2018). The attention weights for each input are then used to extract the driving inputs X@# according to Eq. 11. The more relevant driving inputs are amplified by their higher weights while the less important ones are reduced to smaller corresponding weights. These inputs are then used as inputs into LSTM to extract the temporal features. Thus, instead of using the original inputs x"#$
(Eqs. 1–4), the IA mechanism uses X@# within LSTM. The remainder of this model is similar to that of the conventional LSTM as described in Conventional LSTM Section, which produces the final model prediction.
[Model training]
The predictions from each model were used to calculate the model “loss” values by comparing them with the observed ARG values. To reduce this value, a gradient-based learning mechanism called
“backpropagation” was used (Rumelhart et al., 1986). During backpropagation, the model weights are updated to reduce the loss value. The weight changes are controlled by the learning rate parameter. We used the adaptive motion estimation, Adam optimizer (Kingma and Ba, 2014), an adaptive learning rate
Table 5.3. Optimized hyperparameters for ARG prediction with conventional LSTM and LSTM-CNN.
ARGs LSTM CNN
Batch size Lookback Learning rate LSTM units Activation function Filter size Activation function (a) Single ARGs prediction with conventional LSTM
blaTEM 24 8 1.00E-07 128 relu N/A N/A
(b) Single ARGs prediction with LSTM-CNN
aac(6’)-Ib-cr 12 16 9.59E-07 256 relu 128 relu
blaTEM 4 16 3.04E-07 64 relu 256 relu
sul1 4 14 9.30E-07 64 tanh 128 tanh
tetX 24 16 9.70E-07 256 relu 128 tanh
(c) Multi-ARGs prediction with LSTM-CNN aac(6’)-Ib-cr +
blaTEM + sul1 + tetX
4 16 3.82E-07 64 relu 128 relu
* N/A: not available
algorithm. This algorithm adjusts the learning rate during the training process and has been used in many deep learning benchmarks (Gregor et al., 2015; Xu et al., 2015). The NN was designed to produce continuous predictions even though the observed ARG values were sparsely distributed. To account for the missing observations, we employed a masking scheme that allowed the model to calculate the loss value only for predictions where corresponding observed values were available. This masking scheme is similar to that of Che et al. (2018), which informs the NN which values to ignore during the loss calculation. However, they employed the masking operation to account for missing input values while we used this scheme to account for missing output values.
We used Tensorflow API version 1.15 (Abadi et al., 2016) in the Python programming language to build the models. The models were trained using an Intel® Core™ i7-8700 processor with an NVIDIA GeForce GTX 1060 graphic card, 6 Gigabytes of dedicated GPU memory, and 32 Gigabytes of random access memory.
[Hyperoptimization]
We used a Bayesian optimization algorithm with Gaussian processes as a surrogate function to optimize the hyperparameters. The performance of an NN is strongly influenced by the hyperparameters used to build it (Hutter et al., 2015). These can include the batch size, lookback, learning rate, and weight matrix sizes in LSTM and CNN, and activation functions for LSTM and CNN. The batch size refers to the number of samples fed to the NN before the loss calculation and parameter update. The learning rate determines the model parameter change amount at each step. A higher learning rate results in large variations in model parameters and vice versa. The size of the weight matrix is directly proportional to the LSTM or CNN learning capacity. However, larger values can result in overfitting.
The activation function determines the type of non-linearity applied to the LSTM or CNN output. The most widely used activation functions include the rectified linear unit (ReLU) (Nair and Hinton, 2010) and hyperbolic tangent.
Bayesian optimization is a derivative-free algorithm that optimizes parameters by treating the objective function as a black box (Lakhmiri et al., 2019). The surrogate function is the probability model of objection function, and it calculates the probability of losses for the input values. In the Bayesian optimization method, this surrogate function is optimized instead of the actual objective function. The algorithm for selecting the new parameter from the parameter space was the “expected improvement”
(Jones et al., 1998; Mockus et al., 1978). We used Bayesian optimization because it is a common algorithm for optimizing NN hyperparameters (Frazier, 2018; Shahriari et al., 2015; Snoek et al., 2015).
During each optimization iteration, the model was trained, and its validation loss was calculated. The conventional LSTM, LSTM-CNN, and IA-LSTM models were trained for 25,000 epochs, 10,000 epochs, or 500 epochs in each optimization iteration. We used the minimum validation loss as the
objective function in optimization. The optimization algorithm was implemented using the Scikit- Optimize library in the Python programming language (Kumar and Head, 2017). The hyperparameters were optimized for each individual and multi-ARG NN prediction.
The objective plots allow us to visualize the 1D partial dependence of the surrogate model on each optimized parameter (Figs. 5.6 and 5.7). The contour plots display the 2D partial dependence.
Partial dependence plots were proposed by Friedman (2001) to interpret the importance of input features used in gradient boosting machines. They help us to visualize the effect of one variable on the objective function if the influences of all other parameters are averaged. A flat partial dependence curve for a variable demonstrates that the variable was less critical to the optimizing algorithm. The evaluation plots were a grid of size n_dims by n_dims, where n_dims is the number of parameters being optimized (Figs. 5.8 and 5.9). The grid diagonal displays the histograms for each dimension. The diagrams below the histogram illustrate a two-dimensional (2D) scatter plot of all the points. A red point represents the location of the minimum found through optimization. The Bayesian algorithm explored the values for each parameter that minimized the objective function, i.e., values closer to the red circle. The convergence plots during hyperparameter optimization for ARG prediction with LSTM-CNN and IA- LSTM are displayed in Figs. 5.10 and 5.11, respectively. These figures illustrate how the validation MSE of the models decreased during the optimization process. Additionally, when optimizing the conventional LSTM model, the convergence, evaluation, and objective plots were not generated; thus, only the optimized hyperparameter values are displayed in Table 5.3.
[Performance evaluation]
Out of the 218 ARG samples, we used 150 to train the NN, and the remaining 68 were used to evaluate the model performance. We used the mean absolute error (MSE) to train the model. We also evaluated the model performance using the coefficient of determination (R2), root MSE (RMSE), and percent bias (PBIAS). The equations to calculate these metrics are as follows:
RMSE = ]∑SRTU XR+MR ^, (12) PBIAS = ∑SRTU∑ R+MR
S R
RTU × 100, (13)
where o and p are the observed and predicted data, respectively, and n indicates the number of datasets.
(a)
(b)
(Continued)
(c)
(d)
(Continued)
(e)
Fig. 5.6. Objective plot of Bayesian optimization for six NN hyperparameters with LSTM-CNN hybrid model.
(a)
(Continued)
(b)
(c)
(Continued)
(d)
Fig. 5.7. Objective plot of Bayesian optimization for six NN hyperparameters with IA-LSTM model.
(a)
(Continued)
(b)
(c)
(Continued)
(d)
(e)
Fig. 5.8. Evaluation plot of Bayesian optimization for six NN hyperparameters with LSTM-CNN hybrid model.
(a)
(b)
(Continued)
(c)
(d)
Fig. 5.9. Evaluation plot of Bayesian optimization for six NN hyperparameters with IA-LSTM model.
(a)
(b)
(c)
(d)
(Continued)
(e)
Fig. 5.10. Convergence plot of Bayesian optimization for six NN hyperparameters with LSTM-CNN hybrid model.
(a)
(b)
(c)
(Continued)
(d)
Fig. 5.11. Convergence plot of Bayesian optimization for six NN hyperparameters with IA-LSTM model.