Boosting Adversarial Training Using Robust Selective Data Augmentation

(1)

ZU Scholars ZU Scholars

All Works

5-20-2023

Boosting Adversarial Training Using Robust Selective Data Boosting Adversarial Training Using Robust Selective Data Augmentation

Augmentation

Bader Rasheed Innopolis University Asad Masood Khattak Zayed University

Adil Khan

Innopolis University; University of Hull Stanislav Protasov

Innopolis University Muhammad Ahmad

National University of Computer and Emerging Sciences

Follow this and additional works at: https://zuscholars.zu.ac.ae/works Part of the Computer Sciences Commons

Recommended Citation Recommended Citation

Rasheed, Bader; Masood Khattak, Asad; Khan, Adil; Protasov, Stanislav; and Ahmad, Muhammad,

"Boosting Adversarial Training Using Robust Selective Data Augmentation" (2023). All Works. 5831.

https://zuscholars.zu.ac.ae/works/5831

This Article is brought to you for free and open access by ZU Scholars. It has been accepted for inclusion in All Works by an authorized administrator of ZU Scholars. For more information, please contact scholars@zu.ac.ae.

(2)

https://doi.org/10.1007/s44196-023-00266-x RESEARCH ARTICLE

Boosting Adversarial Training Using Robust Selective Data Augmentation

Bader Rasheed¹ · Asad Masood Khattak³ · Adil Khan^1,2 · Stanislav Protasov¹ · Muhammad Ahmad⁴

Received: 7 February 2023 / Accepted: 2 May 2023

Abstract

Artificial neural networks are currently applied in a wide variety of fields, and they are near to achieving performance similar to humans in many tasks. Nevertheless, they are vulnerable to adversarial attacks in the form of a small intentionally designed perturbation, which could lead to misclassifications, making these models unusable, especially in applications where security is critical. The best defense against these attacks, so far, is adversarial training (AT), which improves the model’s robustness by augmenting the training data with adversarial examples. In this work, we show that the performance of AT can be further improved by employing the neighborhood of each adversarial example in the latent space to make additional targeted augmentations to the training data. More specifically, we propose a robust selective data augmentation (RSDA) approach to enhance the performance of AT. RSDA complements AT by inspecting the quality of the data from a robustness perspective and performing data transformation operations on specific neighboring samples of each adversarial sample in the latent space. We evaluate RSDA on MNIST and CIFAR-10 datasets with multiple adversarial attacks. Our experiments show that RSDA gives significantly better results than just AT on both adversarial and clean samples.

Keywords Deep neural networks · Robustness · Adversarial attacks · Data augmentation · Adversarial training Abbreviations

RSDA Robust selective data augmentation AT Adversarial training

DL Deep learning

FGSM Fast gradient sign method

IFGSM Iterative fast gradient sign method BIM Basic iterative method

PGD Projected gradient descent NT Normal training

1 Introduction

With the growing popularity of Deep Learning (DL), DL- based systems are being applied in a wide variety of areas [1, 2]. While the performance of DL-based systems is compelling for their widespread adoption and deployment, they have their own limitations. When these systems are introduced to the adversary, the large number of domains they serve often become vulnerable [3]. These systems are susceptible to adversarial attacks, which are specific input samples to DL-models that an attacker has designed to cause a misclassification. Moreover, the generated adversarial examples for one model can be transferred to attack other models [4]. Different defense methods have been proposed to protect models against these attacks. Among these methods, Adversarial Training (AT) [5] is the most

* Bader Rasheed

b.rasheed@innopolis.university Asad Masood Khattak Asad.Khattak@zu.ac.ae Adil Khan

a.khan@innopolis.ru; a.m.khan@hull.ac.uk Stanislav Protasov

s.protasov@innopolis.ru Muhammad Ahmad mahmad00@gmail.com

1 Institute of Data Science and Artificial Intelligence, Innopolis University, Universitetskaya Str, Innopolis 420500, Russia

2 School of Computer Science, University of Hull, Hull HU67RX, UK

3 College of Technological Innovation, Zayed University, Abu Dhabi Campus, 144534 Abu Dhabi, United Arab Emirates

4 Department of Computer Science, National University of Computer and Emerging Sciences, Chiniot-Faisalabad Campus, Islamabad 35400, Pakistan

(3)

popular defense, which works by training the model not only on clean samples but on generated adversarial samples as well. AT is considered a data augmentation method, where we augment the training data with samples, lead to improving the robustness of the model.

Data augmentation has a proven efficiency in increasing models generalization by expanding the training dataset with creating additional synthesized samples. Manag- ing the quality of training data plays an essential role in improving models performance. To achieve this, many data augmentation has been provided [6]. Such techniques improve the generalization of the model on out- of-distribution samples, and adversarial samples belong to this family of samples. The area of data augmentation has drawn a remarkable improvement recently [7]. Differ- ent methods have been proposed to augment data for any modality in a generic way [6, 8]. However, it is important to consider not using all data samples for augmenting, but only important samples that cause an increment in the generalization of the model [9]. To achieve this goal, the dataset should be analyzed to find important influential samples for predicting a test sample.

Our goal is to design a robust selective data augmentation (RSDA) approach for optimizing training data that leads to increasing the neural networks models’ robustness without a big drop in the accuracy on clean samples. The idea is to generate new training samples as a shift of other original samples in the direction that helps in increasing the model’s robustness. The generated samples are not intended to be produced to resemble the original samples but instead, have features that increase the model’s generalization on adversarial examples. The approach is supposed to be model- agnostic and works for any data modality.

We propose RSDA to train the model on robustly aug- mented training data to achieve the most accurate model on both clean and adversarial samples. For each training example, we construct an adversarial example, find the training samples that lead to this incorrect classification of the adversarial sample in the adversarial class, and apply transformations in the latent space in the directions that lead to less incorrect classification without affecting the accuracy on the test set. The approach stems from the point that there could be different training samples from the original training sample that formulate the decision boundary of the model and lead to this incorrect classification.

Accordingly, in this paper, we propose a novel robust selective data augmentation method, which uses data transformations in the latent space to boost AT. Instead of simply considering the classification loss on adversarial and clean samples, as in AT, we consider also finding an optimum transformation of the specific samples, which helps in decreasing the sample space for adversarial examples. The overall architecture consists of three components: a feature

generator, a discriminator, and a classifier. These components help to produce a robust compact latent space.

In short, the contributions of this paper are directed at improving the generalization of AT on both adversarial and clean samples by formulating the problem as a data augmentation task where different transformations are performed in the latent space on samples that lead to better robustness. Since these transformations are performed in the latent space, our method can be applied to any input modality, including images, texts, graphs, etc. We evaluate RSDA using MNIST and CIFAR-10 datasets on FGSM, PGD, BIM adversarial attacks. Experimental results show that RSDA enhances the generalization of AT on all conducted tests.

2 Related Work

In this section, we review the existing works in deep learning robustness, automatic data augmentation, and finding influential samples since RSDA depends on these works to boost adversarial training.

2.1 Adversarial Attacks and Defenses

The vulnerability of DL models to adversarial attacks was first introduced by [10]. The existence of adversarial attacks was explained by the low occurrence probability of adversarial samples in real data, which makes classifying them difficult. After that, Fast Gradient Sign Method (FGSM) attack was introduced in [5] and the existence of adversarial samples was attributed to the high dimensional linearity of DL models. FGSM is very fast because it performs one-step attack, but sometimes its success rate is low. Therefore, an iterative FGSM (I-FGSM) was proposed by [11] where the loss function is increased in multiple small steps instead of one large step. Subsequently, many adversarial attacks algo- rithms were proposed including the basic iterative method (BIM), projected gradient descent (PGD) [11], Carlini and Wagner (C &W) attacks [12], and others.

Meanwhile, many defenses against adversarial attacks were introduced recently, including heuristic and certificated defense [13]. Heuristic defense refers to a defense that increases the robustness of the model against specific attacks without giving theoretical guarantees. The most effective heuristic defense is adversarial training (AT) [5], which augments the training data with adversarial samples generated by the previously mentioned attacks. Empirically, PGD adversarial training achieves the best accuracy against a wide range of adversarial attacks on several datasets [14].

Many other heuristic defenses were proposed in the literature which mainly depends on performing transformations and denoising in the input space and feature space, but these defenses are shown to be helpless against adaptive attacks

(4)

[15]. On the other hand, certified defenses can provide a guarantee for their lowest accuracy under a pre-defined group of adversarial attacks. A popular certified defense is to formulate an adversarial polygon and to convexly relax the problem to define its upper bounds [16]. This upper bound guarantees that no attack with specific limitations can surpass the certificated attack success rate. However, these defenses are still restricted to small datasets and small models, making them inapplicable to real life scenarios.

Since AT is the most successful applicable defense, we use it as a baseline and propose a new method to boost it.

2.2 Automatic Data Augmentation

In practice, many transformations are performed on the original samples to augment the training data. For image data, for example, different processing techniques are used like flipping, cropping, color shifting, and others. The success of image data augmentation motivated applying data augmentation in all possible domains where DL is applied.

To make the augmentations universal for all data modali- ties, [8] proposes a generic automated data augmentation approach called MODALS. MODALS tries to perform a smooth transition from real samples in the embedding space to artificial samples with preserving their class identity.

MODALS exploits four universal automated data transformations in the latent space: Hard example interpolation, Hard example extrapolation, Gaussian Noise, and Differ- ence Transform. Deep learning models, however, may not work properly if they are trained on all training data, and the model generalization can be instead enhanced by drop- ping some unfavorable samples [9]. In the first round of the model training, the unfavorable samples are inspected, then they are dropped, and then retrain the model from scratch with the reduced training dataset in the second round. Find- ing important examples could also be done early in training instead of waiting until the end of training, as in [17].

For finding important examples early in training, a scoring method is used to identify important and difficult examples early in training, and prune the training dataset without enor- mous sacrifices in test accuracy. Another direction is to learn the importance or weight of the samples and assign these weights to training examples based on their gradient directions, as in [18]. Ren et al. [18] performs validation at every training iteration to determine the example weights of the current batch and assigns importance weights to examples in every iteration.

2.3 Influential Samples

ML systems are complicated, and explaining where the model came from to make the predictions is very challeng- ing. This problem can be tackled using influence functions to

identify training points most responsible for a given prediction [19]. To formalize the impact of a training point on a prediction, the authors of [19] ask the counterfactual: “what would happen if this training point did not exist, or if the values of this training point were changed slightly?”. The authors study the effect of a training sample on a test sample by approximating the effect of removing this training sample on the loss of the test sample. Another approach for finding prediction important samples is by decomposing the pre- activation prediction of a neural network into a linear combination of activations of training points, with the weights corresponding to what is called “representer values”, which thus capture the importance of that training point on the learned parameters of the network with positive representer values corresponding to excitatory training points, and negative values corresponding to inhibitory point [20]. A positive representer value indicates that a similarity to that training point is excitatory, while a negative representer value indicates that a similarity to that training point is inhibitory. A new approach (TracIn) for calculating the influence of a training sample on the prediction of a test sample is introduced in [21] inspired by the fundamental theorem of calculus. The fundamental theorem of calculus decomposes the difference between a function at two points using the gradients along the path between the two points. Analogously, TracIn decomposes the difference between the loss of the test point at the end of training versus at the beginning of training along the path taken by the training process. For a particular training example x_i , the authors approximate the idealized influence by summing up an approximation in all the iterations in which x_i was used to update the parameters.

In this paper, we take advantage of the success of the previously mentioned methods to arrive at a robust augmentation approach to enhance AT. We will compare different approaches for finding influential samples and discuss which approach might work better in the case of adversarial attacks. Also, we propose different data augmentations inspired by [8].

3 The Proposed Approach

In this section, we describe our proposed RSDA approach to boost AT. We are given an original training dataset D_tr = (x_(i), y_(i))^m_i=1 , where x_(i)∈X is the input sample, y_(i) is the corresponding label, and m is the size of the dataset. In clean scenarios, we need to learn a predictive function f ∶X→Y that reduces the loss on test dataset D_tst . However, the function f is not robust against adversarial attacks. Adversarial training considers reducing the loss on both original samples from X and adversarial samples from X^adv by augmenting the dataset D_tr with perturbed data X^adv received from attacking the original dataset. However, AT

(5)

augments the dataset with only adversarial labels and does not consider adding any other artificial samples that may increase the robustness. RSDA makes two observations to boost the model’s robustness. Firstly, we observe that for each x∈X^adv there may be a set of samples S⊆D_tr that, if altered, may increase the model robustness against x.

Examples of such samples are mislabeled samples, multi- labelled samples, and highly similar samples from two different classes, etc. Secondly, we observe that if the latent space of a deep neural network is learned properly, the class region is mostly convex and clustered. This allows to make smooth alterations to the selected samples by preforming linear transformations on them in the latent space without changing their class identity. We empirically prove that these observations are valid using MNIST and CIFAR datasets.

Accordingly, in this work, we propose, implement and validate a novel RSDA approach to boost/complement AT.

In RSDA, for each training example x∈D_tr , first we construct an adversarial sample xâdv . Next, we find its corresponding S that should be altered to help the model correctly classify xâdv . Finally, for each x∈S we move it in the direction that leads to a reduction in the model’s classification error on xâdv without affecting its accuracy on the test set.

The three steps of RSDA are shown using a toy example in Fig. 1. The approach stems from the point that for each adversarial sample, there could be some interesting samples in the training data that formulate the decision boundary of the model near the adversarial sample. Identifying these samples and making some targeted alterations to them might assist the model in correctly classifying the corresponding adversarial sample. Therefore, to implement RSDA we need to solve three issues:

1. First, we need to generate the adversarial samples on the fly during training. For each training sample x∈D_tr , we

can move in the gradients’ direction that maximizes the loss (similar to FGSM), or we can use stronger attack methods (Like PGD, BIM). We discuss this in Sect. 3.1.

2. For each adversarial sample x^adv , we need to find the influential training set S. Finding S should be done effec- tively without the need to retrain the model for every sample. In Sect. 3.1.1, we discuss our approach for finding S.

3. Finally, we need to find the direction in which to move the influential samples. The movement should lead to not only classifying the adversarial examples correctly, but also to preserving the loss on clean samples as much as possible. The possible directions in which such samples can be moved are discussed in Sect. 3.1.2.

3.1 Finding Adversarial Samples

Several attack methods have been introduced in literature to find x^adv . In our work, we use and compare three such techniques. The simplest method is fast gradient sign method (FGSM) [5]. It tries to maximize the loss function by finding the gradients of the loss with respect to the input sample and update along the direction of the gradient with a restric- tion on the L_∞ norm of perturbation so that the difference between adversarial and clean sample is imperceptible. The L_∞ norm is used over other L_p norms because it restricts the maximum amount of perturbation added, and this makes the perturbation less perceptible, especially for images, but we can easily generalize the attack to other L_p norms.

Mathematically:

x^adv=x+𝜖 ∗ sign(∇xJ(h(x), y_true)) (1) s.t ∶‖

‖‖x^adv−x‖

‖‖^∞≤Δ.

XX X X

X X X XXXX X X X X X XX

X X X XXXX XX X X X X X X

X X X X XXXX XX X X X X

(a) (b) (c) (d)

XX X X X X X XXXX X X X X X

Fig. 1 Our proposed RSDA approach presented in a toy example. a A binary classifier classifying between two classes, o and x. b Execut- ing an adversarial attack on a sample from the o class, resulting in an adversarial sample x^adv (with red color) from the x class. c Finding influential set S (with blue color) for the adversarial sample, and find-

ing the proper directions (represented with blue arrows) in which to move these influential samples. d Moving the influential samples and retraining the model so that the adversarial sample is classified correctly from the o class (the old decision boundary is represented with dashed line and the new decision boundary with solid line)

(6)

The second method that we employ is a stronger iterative attack called the basic iterative method (BIM) [11]. It is similar to FGSM, but runs for multiple iterations. It creates iterative perturbations as:

where t determines the iteration number, 𝜖 is the attack step or the attack learning rate, and clip(input, a, b) restricts the adversarial sample to reside in the range [a, b], and how this constraint is satisfied depends on the L_p norm used.

Finally, we also use the Projected Gradient Descent (PGD) [11] attack, which also works in an iterative man- ner and restricts the maximal perturbation by projecting the perturbed sample into a feasible area. Differently from BIM, which initializes the first point as the original sample, PGD initializes the first point randomly within the area around the original sample. The noisy initialization of PGD leads to a stronger attack that converges better [11].

3.1.1 Finding Influential Samples

Among those discussed previously in related works, we study three potential approaches that might be applied in the study to find influential set S:

1. Using influential functions [19], which is based on finding the change in the parameters of the model as a training sample is removed from the training set. To formalize the impact of a training point on a prediction, the authors of [19] ask the counterfactual: “what would happen if this training point did not exist, or if the values of this training point were changed slightly?”.

The authors study the effect of a training sample on a test sample by approximating the effect of removing this training sample on the loss of the test sample. However, one of the problems of this method is that it assumes that the neural network is optimally trained before finding the influence, but optimality in deep neural networks can rarely be relied upon. We tested the utilization of this method to find influential samples for adversarial samples, but this method is extremely slow compared to the other two methods we discuss here. Considering that we will find adversarial samples and their influence samples (2) x^adv₀ =x

(3) xâdv_t+1=xâdv_t +𝛼∗ sign(∇_xâdv

t J(h(x^adv_t ), y_true))

(4) xâdv_t+1= clip(xâdv_t+1, xâdv_t+1−𝜖, xâdv_t+1+𝜖),

iteratively during training, we decided to exclude this method from the experiments.

2. Using Representer Point Selection [20], which decom- pose the pre-softmax logits values into a linear combination of training point activations. The representer values are the weights in this linear combination, and they correspond to the importance of each training sample.

Positive representer values indicate excitatory influence of the training sample on the prediction of a test sample, while a negative value means it has a negative influence.

The decomposition looks like: 𝜙(x_t,𝜃^∗) =∑n

i k(a_i, x_t, x_i) where 𝜃^∗ is the optimal parameters of the model, n is the total number of training samples, 𝜙 is the logits layer of the neural network, a_i= −_2𝜆n¹ ^𝜕L(x_𝜕𝜙(xⁱ^,yⁱ^,𝜃)

i,𝜃) , and k(x_t, x_i, a_i) =a_if_i^Tf_t , and f_i is the last intermediate features layer for input x_i . Given such a representer theorem, k(x_t, x_i, a_i) can be seen as the contribution of the training data x_i on the testing prediction of x_t . Such for- mulation provides insight both as to why the neural network prefers a particular prediction, as well as why it does not, which is typically difficult to obtain via other sample-based explanations. This approach works for any model with linear matrix multiplication before the activation 𝜎 , and with L2 weight decay applied as a regu- larization term in the loss function.

3. Estimating Training Data Influence by TracIn (Tracing Gradient Descent) [21], which approximates the influence of a training sample on a test sample by finding how much the loss at the test point changes when training the model with the specific training sample only. So, training samples that reduce the loss are proponents and samples that cause an increment in the loss are oppo- nents. In order not to replay the training process for each training sample, we use first-order heuristic approximation via checkpoints and the primary notion of influence of a training sample z on a test sample z^′ becomes:

TracInCP(z, z^�) =∑k

i=1𝜂_i▿L(w_t_i, z).▿L(w_t_i, z^�) where k is the number of training iterations, 𝜂_i is the step size used between checkpoints i−1 and i.

We use the same naming convention used in representer point selection [20]. We term the group of training samples from S that have positive influence on the prediction of a test sample for a particular class as excitatory samples, and we term the group of training samples from S that have negative influence as inhibitory samples. Using this terminology, the excitatory samples for a test sample excite the activation values and push the prediction towards a particular class, while the inhibitory samples

(7)

suppress the activation values and indicate training points that lead the network away from a particular class.

3.1.2 Finding Movement Direction

Since we aim for RSDA to work for any data modality, performing label-preserving augmentation in the input space will be inappropriate for discrete data, such as text, graphs, etc. Therefore, we consider applying different modality-agnostic transformations in the latent space. The idea is to find a direction to move the sample in the latent space and apply it to the sample such that augmenting the training data with that newly gained sample leads to an increment in the robustness without changing the class identity.

Thus, to achieve robust automatic data augmentations, we need to learn a continuous latent space where we can apply class-preserving transformations, and we need to find effective movement direction. We leverage and compare the following four latent space transformations on the influential samples to find the best movement direction:

1. Excitatory samples interpolation: Let z^c_i be the latent representation of the x_i sample in class c, and let z^c_i^�

adv be the latent representation of the adversarial sample com- ing from an adversarial attack on x_i in adversarial class c^′ . Let U= {u}^q_j=1 be the latent representation of the q excitatory samples for x^c_i^�

adv (U contains the first q samples that lead to the adversarial sample being classified in class c^′ ). Let P= {p}^q_l=1 the latent representation of the q excitatory samples for x^c_i

adv (P contains the first q samples that lead to the adversarial sample being classified in the correct class c). For every sample u_i∈U , we find the nearest sample p∈P such that u_i and p have the same ground-truth class., then we move u_i towards p. Therefore, excitatory samples interpolation is expressed as:

where P^c represents items from P that have the same ground-truth class of u_i and 𝜆₁ is a scaling factor.

2. Inhibitory samples interpolation: Let Υ = {𝜐}^q_j=1 be the latent representation of the q inhibitory samples for x^c_i^� ( Υ contains the first q samples that prevent the adver-adv

sarial sample from being classified in adversarial class c^′ ). Let K= {k}^q_l=1 be the latent representation of the q inhibitory samples for x^c_i

adv (K contains the first q samples (5)

̂

u_i=u_i+𝜆₁(p−u_i);

(6) p= argmin

p∈P^c

p.u_i

‖p‖��u_i��

that prevent the adversarial sample from being classified in the correct class c). For every sample k_i from K, we find the nearest sample 𝜐 in Υ such that k_i and 𝜐 have the same ground-truth class. Then, we move k_i towards 𝜐 . Therefore, inhibitory samples interpolation is expressed as:

where K^c represents items from K that have the same ground-truth class of k_i and 𝜆₂ is a scaling factor.

3. Excitatory samples class center interpolation and extrapolation: We attempt to send samples that cause the adversarial sample being classified incorrectly in a direction that reduces this influence. We perturb the excitatory sample u_i by sending it towards its class mean:

On the other side, we attempt to send samples that cause the adversarial sample being classified correctly in a direction that increases this influence. We perturb the excitatory sample p_i by extrapolating it from its class mean:

4. Inhibitory samples class center interpolation and extrapolation: We attempt to send samples that prevent the adversarial sample from being classified correctly in a direction that reduces this influence. We perturb the inhibitory sample k_i by sending it towards its class mean:

On the other side, We attempt to send samples that prevent the adversarial sample from being classified incorrectly in a direction that increases this influence. We perturb the inhibitory sample 𝜐_i by extrapolating it from its class mean:

4 Model

Our goal is to train a classification model with different data augmentations on the latent space to achieve the best performance on clean and adversarial samples. Given an (7) k̂_i=k_i+𝜆₂(𝜐−k_i);

(8) 𝜐= argmin

𝜐∈K^c

𝜐.k_i

‖𝜐‖��k_i�

�

(9)

̂

u_i=u_i+𝜆₃(𝜇_c1−u_i)

(10)

̂

p_i=p_i+𝜆₄(p_i−𝜇_c2)

(11) k̂_i=k_i+𝜆₅(𝜇_c3−k_i)

(12)

̂

𝜐_i=𝜐_i+𝜆₆(𝜐_i−𝜇_c4)

(8)

input space X and an output space Y, our classification model can be divided into a feature generator G_f , a classifier G_y , and a discriminator G_d . The feature extractor takes a sample from the input space and maps it to the latent space:

G_f(x,𝜃) ∶X→Z where Z is the latent space, and then the classifier maps form Z to Y. Thus, during training, for each input sample x_i∈X , we use the current state of G_f, G_y to generate the adversarial example xâdv_i , yâdv_i , then we find the previously explained excitatory and inhibitory influential samples set S for the adversarial sample xâdv_i , and then the label-preserving transformation operations are performed on the latent sample of these influential samples to produce Z^′ . We employ three losses to produce the required augmentation results. The overall architecture is shown in Fig. 2, and the algorithm is given in Algorithm 1.

4.1 Adversarial Loss

To produce class-preserving data augmentation, we need our latent space to preserve continues transformation inside each class in the latent space, such that the produced data do not exist in invalid regions outside the demanded class. This problem can be solved by enforcing the latent samples to be produced from a Gaussian distribution similar to VAE [22].

Hence, we add a discriminator G_d(z,𝜙) component to the model to discriminate the samples produced by the feature generator in the latent space and samples produced from Gaussian distribution. By playing this game between the feature generator and the discriminator, the feature generator will be enforced to produce samples that are sampled from a Gaussian distribution to fool the discriminator. Thus, the adversarial loss of the feature generator L_adv(𝜃) and the discriminator L_G

d(𝜙) becomes

4.2 Triplet Loss

As mentioned before, to accomplish label-preserving robust transformations, we need to decrease the probability of the (13) L_adv(𝜃) = −1

M

∑M i=1

logG_d(z_i)

L (14)

G_d(𝜙) = − 1 M

∑M i=1

(logG

d(𝜖i) + log[1−G

d(z

i)]);∀𝜖i∼N(0,I)

data being outside the class hull in the latent space corresponding to a valid sample of this class. Thus, the class distribution in the latent space should be as compact as possible. Triplet loss helps to achieve that by pushing latent representations of the same class together and pushing away representations of other classes [23]. Triplet loss operates on three samples as input: an anchor (in our case the latent representation of the sample) z_a , a positive sample z_p which is the latent representation of a sample has the same class as the anchor, and a negative latent representation z_n that has a different class from the anchor. Triplet loss works by minimizing the distance in the feature space between the anchor sample and positive sample and maximizing the distance of the anchor sample z^a and negative sample. Mathematically, the triplet loss function is defined as follows:

where m is the margin by which the distance between the anchor and the positive sample is at least larger than the distance between the anchor and the negative sample.

4.3 Classification Loss

The ultimate goal is to have the input sample being classified correctly, whether the sample is adversarial or clean. As mentioned before, the feature extractor G_f operates on the input sample to get z_i , then we augment z_i using one of the augmentation strategies to get z^′ , and the classifier G_y operates on z^′ as an input and produces the final class for the input sample. The classifier contains many fully connected layers. The classification loss is thus given by:

H(.) here is the classification loss function, and we use the cross-entropy loss, and z^′ is the latent representation for any sample from the clean and adversarial domains (Fig. 2).

Our final objective function becomes as follow:

And we should not forget also that we have the discriminator G_d(z,𝜙) component of the model that is trained to reduce the discriminator loss L_G

d(𝜙) mentioned in the adversarial loss subsection.

L_trip= ∑ (15)

z^a,z^p,zⁿ

max(d(z^a, z^p) −d(z^a, zⁿ) +m, 0),

L_cls= ∑ (16)

(x_i,y_i)∈(D_c∪D^d_a)

H(G_y(z^�), y_i)

(17) min

𝜃_g,𝜃_c

{𝜆₇L_cls+𝜆₈L_adv+𝜆₉L_trip}

(9)

5 Experiments

In principle, since we are applying the augmentation in the latent space, our method can be applied to any dataset and any adversarial attack. For comparison against Adversar- ial Training, we focus on two datasets: MNIST [24] and CIFAR10 [25], and three adversarial attacks: PGD, BIM, and FGSM. For all experiments, we normalize the pixel values to the range [0, 1].

5.1 Experiment Setup

In every training iteration, we use FGSM, PGD, and BIM to generate three untargeted adversarial samples on the fly. To evaluate the method performance, we compare RSDA with:

1. Normal Training (NT) with cross-entropy loss [26] on the clean training data.

Fig. 2 The architecture of the proposed method

(10)

2. Adversarial Training (AT) with the cross-entropy loss on the clean training data and the adversarial examples from the FGSM, PGD, and BIM.

For each dataset, we train a vanilla model (NT), Adversarial Training (AT), and the four kinds of selective augmentation represented before, with perturbation 𝜖 for the generated adversarial samples, and evaluate these models on FGSM, PGD, and BIM attacks bounded by the same 𝜖 . We consider L_∞ as a measure of perturbation in all attacks. The experiments were implemented on a single GeForce GTX 1080 Ti.

In principle, any conventional image classification model can be used. The features’ generator comprises a stack of convolutional layers, while the discriminator and classifier comprise a stack of fully connected layers. While any optimization method can be used for training, we choose Adam optimization [27] for training all the components with a batch size of 1024, 200 epochs and ( 𝛽₁=0.9,𝛽₂=0.99 ).

The learning rate starts as 0.005 and decayed by 2 every 30 epochs. After training, the discriminator can be removed and the robust feature generator and the classifier can be used instead of the conventional image classifier.

Table 1 Components

architecture for MNIST Feature extractor Classifier Discriminator

Conv2d(3, 64, 5) nn.Linear(32*4*4, 100) Linear(32*4*4, 128) +Relu

BatchNorm2d(64) nn.BatchNorm1d(100) + Relu Dropout()

MaxPool2d(2) + LeakyReLU Dropout() Linear(128, 64) +Relu

Conv2d(64, 32, 5) nn.Linear(100, 50) Dropout()

BatchNorm2d(32) nn.BatchNorm1d(100) + Relu Linear(64, 64) +Relu

MaxPool2d(2) + LeakyReLU nn.Linear(50, 10) Linear(64, 1)

Fig. 3 The test sample from class 2 with the excitatory and the inhibitory samples

Fig. 4 The adversarial sample classified in class 1 with the excitatory samples leading to misclassifying the sample as an adversarial sample and the inhibitory samples that prevent the adversarial sample from being classified into the adversarial 1 class

Fig. 5 The adversarial sample classified in class 1 with the excitatory samples leading to classifying the sample correctly as a clean sample from class 2 and the inhibitory samples that prevent the adversarial sample from being classified correctly as a clean sample from class 2

(11)

5.2 Experimental Results 5.2.1 Results on MNIST

Since MNIST is a comparatively easy dataset, we use simple network architectures for the different components, as shown in Table 1.

The allowed adversarial perturbation 𝜖 , in this case, is 0.3, and the maximum number of iteration for BIM and PGD is 30.Finding influential samples To examine the efficiency of finding influential samples set S, we present an example of a test image of digit 2 and we perform an untargeted PGD attack ending in an adversarial sample classified by the classifier as digit 1. The first thing we notice from Fig. 3 is that the majority of the excitatory samples for the clean test sample using both Representer and TracIn are from the same class (Digit 2 in our example) which seems reasonable since the samples from this class formulate the decision boundary near the test sample. However, more interestingly, we see that from the four most inhibitory samples for the clean test sample, three of them using Representer and TracIn are from digit 1 which is the same class as that of the untargeted adversarial sample derived from the test sample, and these same samples appear in the most excitatory samples for the adversarial sample being classified in the adversarial class 1, as shown in Fig. 4. Furthermore, they appear in the most

inhibitory samples that prevent the adversarial sample from being classified in the clean sample, as shown in Fig. 5. This is more evident using TracIn than with Representer. This experiment was repeated many times, and we noticed the following pattern.

• The adversarial class for a particular clean sample appears noticeably in the inhibitory samples for the clean samples.

• Among the inhibitory samples for the adversarial sample being classified in the adversarial class (Fig. 4), we see the majority are from class 2 (the clean class) and class 6, and in this case class 6 is the second most probable class for untargeted adversarial attack after class 1 (The untargeted attack finds the easiest way to fool the model, which in this case leads to an adversarial sample from original class 2 being classified in class 1, so the easiest adversarial class is 1. The next most easy class is 6).

• This means that what prevents the adversarial sample from being classified in the adversarial class is the excitatory samples for the clean sample and the excitatory samples for the second proper adversarial sample. This makes our proposed method work properly since mag- nifying the clean sample’s excitatory samples not only leads to increasing the accuracy on clean samples, but also reduces the effect of adversarial samples

Finding movement direction We take q to be equal to 10.

As we enlarge q, we get better performance, but training the model needs more time. We test different values for 𝜆₁ , 𝜆₂ , 𝜆₃ and 𝜆₄ . Obviously, there is a tradeoff between robustness and accuracy here. As we increase 𝜆 , we get farther from the real latent sample and decrease the effect of the samples that cause the adversarial sample to be classified incorrectly; but also at the same time, we produce latent representations that are less realistic. We choose here 𝜆₁=𝜆₂=𝜆₃=𝜆₄=0.2

Results To see what the best method we can use for choosing the influential samples, we repeat the experiments using both Representer and TracIn. The accuracy results are reported in Table 2 for Representer, and in Table 3 for TacIn, and the best results are shown in bold. Using both, NT has the best accuracy on the clean data, but has the worst robustness or accuracy on adversarial samples. The adversarial accuracy on clean samples is almost the same between AT and our method. The best accuracy is achieved with excitatory samples class center interpolation and extrapolation.

Our methods reduce the decrement of clean accuracy significantly. From Tables 2 and 3, we notice that both Representer and TracIn give good results, although TracIn works better when the augmentation is related to inhibitory samples.

Table 2 Accuracy results on MNIST using TracIn

Defense Clean% FGSM BIM PGD

Vanilla 98.63 11.33 10.56 10.14

AT 97.44 91.05 89.13 87.09

Excitatory samples interpolation 97.83 91.02 89.83 87.11 Inhibitory samples interpolation 98.31 91.11 89.45 87.14 Excitatory samples class center

interpolation and extrapolation 98.52 91.91 91.09 88.81 Inhibitory samples class center

interpolation and extrapolation 98.34 91.44 91.52 87.65

Table 3 Accuracy results on MNIST using Representer

Vanilla 98.63 11.33 10.56 10.14

AT 97.44 91.05 89.13 87.09

(12)

5.2.2 Results on CIFAR10

Since classifying CIFAR10 is harder than MNIST, we use the VGG architecture. The convolution layers compose the feature extractor, the last fully connected layers form the classifier, and the discriminator, which do not change from MNIST as shown in Table 4.

Here, the maximum allowed perturbation is 𝜖=0.031 and the number of iterations for BIM and PGD is 30.

Finding influential samples We repeat the same experiment we performed on MNIST, we present an example of a test sample from class 9 (truck) and perform an untargeted PGD attack ending in an adversarial sample classified by the classifier as class 1 (automobile). The same observations we noticed on MNIST are confirmed here. We notice from Fig. 6 that the majority of the excitatory samples for the clean test sample using both Representer and TracIn are from the same class (Truck). We also observe that from

Table 4 Components

architecture for CIFAR10 Feature Extractor Classifier Discriminator

Conv2d(3, 64, 3) nn.Linear(32*4*4, 100) Linear(32*4*4, 128) +Relu

BatchNorm2d(64) +Relu nn.BatchNorm1d(100) + Relu Dropout()

Conv2d(64, 64, 3) Dropout() Linear(128, 64) +Relu

BatchNorm2d(64) +Relu nn.Linear(100, 50) Dropout()

MaxPool2d(2) nn.BatchNorm1d(100) + Relu Linear(64, 64) +Relu

Dropout() nn.Linear(50, 10) Linear(64, 1)

Conv2d(64, 128, 3) BatchNorm2d(128) +Relu Conv2d(128, 128, 3) BatchNorm2d(128) +Relu MaxPool2d(2) + Dropout() Conv2d(128, 256, 3) BatchNorm2d(256) +Relu Conv2d(256, 256, 3) BatchNorm2d(256) +Relu MaxPool2d(2) + Dropout() Conv2d(256, 512, 3) BatchNorm2d(512) +Relu Conv2d(512, 512, 3) BatchNorm2d(512) +Relu MaxPool2d(2) + Dropout()

Fig. 6 The test sample from class nine (truck) with the excitatory and the inhibitory samples

Fig. 7 The adversarial sample classified in class 1 (automobile) with the excitatory samples leading to misclassifying the sample as an adversarial sample and the inhibitory samples that prevent the adversarial sample from being classified into the adversarial 1 class

(13)

the four most inhibitory samples for the clean test sample, three of them using Representer and four of them using Tra- cIn are from class 1 (automobile) which is the same class of the untargeted adversarial sample derived from the test

sample, and these same samples appear in the most excitatory samples for the adversarial sample being classified in the adversarial class one as shown in Fig. 7, and they appear in the most inhibitory samples that prevent the adversarial sample from being classified in the clean sample in Fig. 8 (This is more evident using TracIn than with Representer).

This experiment was repeated many times, and we noticed a tendency that the adversarial class appears noticeably in the inhibitory samples for the clean sample. Among the inhibitory samples for the adversarial sample being classified in the adversarial class (Fig. 7), we see the majority are from class 9 (the clean class) and class 8 (ship), and in this case class eight is the second most probable class for untargeted adversarial attack after class 1.

Finding movement direction The same configurations are applied as in MNIST

Results The accuracy results are reported in Table 5 for Representer, and in Table 6 for TacIn. Using both, NT has the best accuracy on the clean data, but has the worse robustness or accuracy on adversarial samples. However, our augmentations enhance the performance of AT on both clean samples and adversarial samples. The best accuracy is achieved with excitatory samples class center interpolation and extrapolation. However, our methods reduce the decrement of clean accuracy significantly. From Tables 2 and 6, we notice that both Representer and TracIn give good results, although TracIn works better when the augmentation is related to inhibitory samples.

6 Ablation Study

We conduct an ablation study to show the effect of additional losses we propose. We also show the impact of adding these additional losses to AT in the first row of table 7. We used excitatory samples class center interpolation and extrapolation augmentation and used TracIn for finding the influential samples.

The experiments on CIFAR show that both adversarial loss and triplet loss have positive impact on improving the model performance against FGSM adversarial attacks in both At and RSDA. However, the impact that these losses add is larger in

Fig. 8 The adversarial sample classified in class 1 (automobile) with the excitatory samples leading to classifying the sample correctly as a clean sample from class 9 (The truck class) and the inhibitory

samples that prevent the adversarial sample from being classified correctly as a clean sample from class 9

Table 5 Accuracy results on CIFAR using TracIn

Best results are shown in bold

Vanilla 83.35 14.99 15.62 11.07

AT 82.12 64.22 61.12 49.35

Table 6 Accuracy results on CIFAR using representer

Best results are shown in bold

Vanilla 83.35 14.99 15.62 11.07

AT 82.12 64.22 61.12 49.35

Table 7 Comparison of model’s average accuracy on FGSM adversarial samples when trained with different losses

Best result is shown in bold

Loss Lcls Lcls+Ladv Lcls+Ltrip Lcls+Ladv+Ltrip

AT (without aug) 64.22 64.38 64.84 65.05

with aug 63.55 64.81 65.63 RSDA 67.19

(14)

RSDA since our method take advantage of these looses to provide correct robust augmentation in the latent space.

7 Conclusion

In this paper, we design a RSDA approach to boost the performance of adversarial training on adversarial samples and clean samples. The proposed approach reduces the effect of adversarial attacks by finding robust transformations in the feature embedding space. The investigation of samples that affect the robustness of the models is estab- lished theoretically and proved experimentally. The experimental results show that RSDA increases the generalization of the model on both clean and adversarial samples.

Acknowledgements This research has been financially supported by The Analytical Center for the Government of the Russian Fed- eration (Agreement No. 70-2021-00143 dd. 01.11.2021, IGK 000000D730321P5Q0002). This research work was also supported by Zayed University Office of Research.

Author Contributions all authors have made substantial contributions to the conception or design of the work.

Funding No funding was received for conducting this study.

Availability of Data Previously reported datasets were used to support this study and are available at DOI: 10.1109/MSP.2012.2211477 and DOI:10.1.1.222.9220. These prior studies and datasets are cited at rel- evant places within the text.

Code Availability Not applicable.

Declarations

Conflict of interest The authors declare that they have no conflicts of interest to report regarding the present study.

Ethics approval Not applicable.

Consent to participate Not applicable.

Consent for publication Not applicable.

Open Access This article is licensed under a Creative Commons Attri- bution 4.0 International License, which permits use, sharing, adapta- tion, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.

References

1. Neu, D.A., Lahann, J., Fettke, P.: A systematic literature review on state-of-the-art deep learning methods for process prediction.

Artif. Intell. Rev. 1–27 (2021). https:// doi. org/ 10. 1007/ s10462- 021- 09960-8. arXiv: 2101. 09320

2. Khattak, A., Khan, A., Ullah, H., Asghar, M.U., Arif, A., Kundi, F.M., Asghar, M.Z.: An efficient supervised machine learning technique for forecasting stock market trends. EAI Springer Innov. Commun. Comput. (2022). https:// doi. org/ 10. 1007/

978-3- 030- 75123-4_7

3. Rasheed, B., Khan, A., Kazmi, S.M.A., Hussain, R., Piran, M.J., Suh, D.Y.: Adversarial attacks on featureless deep learning mali- cious URLs detection. Comput. Mater. Contin. 68(1), 921–939 (2021). https:// doi. org/ 10. 32604/ cmc. 2021. 015452

4. Papernot, N., McDaniel, P., Goodfellow, I.: Transferability in machine learning: from phenomena to black-box attacks using adversarial samples (2016). arXiv: 1605. 07277

5. Arrieta, A.B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., García, S., Gil-López, S., Molina, D., Benjamins, R.: Explainable artificial intelligence (xai): concepts, taxonomies, opportunities and challenges toward responsible ai.

Inf. Fusion 58, 82–115 (2020)

6. Shorten, C., Khoshgoftaar, T.M.: A survey on Image Data Aug- mentation for Deep Learning. J. Big Data 6(1), 1–48 (2019).

https:// doi. org/ 10. 1186/ s40537- 019- 0197-0

7. Khan, A., Fraz, K.: Post-training iterative hierarchical data augmentation for deep networks. In: Advances in Neural Informa- tion Processing Systems, vol. 2020-Decem (2020). (visited on 05.01.2023). https:// proce edings. neuri ps. cc/ paper/ 2020/ hash/

07417 7d3eb 6371e 32c16 c55a3 b8f70 6b- Abstr act. html

8. Cheung, T.-H., Yeung, D.-Y.: Modals: Modality-Agnostic Auto- mated Data Augmentation in the Latent Space. In: Iclr, pp. 1–18 (2021). (visited on 05.01.2023). https:// github. com/ james tszhim/

modals

9. Wang, T., Huan, J., Li, B.: Data dropout: Optimizing training data for convolutional neural networks. In: Proceedings-International Conference on Tools with Artificial Intelligence, ICTAI, vol.

2018-Novem, pp. 39–46 (2018). (visited on 05.01.2023). https://

doi. org/ 10. 1109/ ICTAI. 2018. 00017. https:// ieeex plore. ieee. org/

abstr act/ docum ent/ 85760 15/

10. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: 2nd International Conference on Learning Representa- tions, ICLR 2014-Conference Track Proceedings (2014). arXiv:

1312. 6199

11. Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. In: Artificial Intelligence Safety and Security, pp. 99–112. Chapman and Hall/CRC, (2018)

12. Carlini, N., Wagner, D.: Towards Evaluating the Robustness of Neural Networks. In: Proceedings-IEEE Symposium on Security and Privacy, CW, pp. 39–57 (2017). https:// doi. org/ 10. 1109/ SP.

2017. 49. http:// nicho las. carli ni. com/ code/ nn

13. Liang, H., He, E., Zhao, Y., Jia, Z., Li, H.: Adversarial attack and defense: a survey. Electronics (2022). https:// doi. org/ 10. 3390/ elect ronic s1108 1283

14. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.:

Towards deep learning models resistant to adversarial attacks. In:

6th International Conference on Learning Representations, ICLR 2018-Conference Track Proceedings. International Conference on Learning Representations, ICLR (2018). arXiv: 1706. 06083 15. Carlini, N., Wagner, D.: Adversarial examples are not eas-

ily detected: bypassing ten detection methods. In: AISec