Enhanced sgRNA On-Target Cleavage Efficacy Prediction using Conditional GANs
Item Type Conference Paper
Authors Fanaras, Konstantinos;Antoniadis, Charalampos;Massoud, Yehia Mahmoud
Citation Fanaras, K., Antoniadis, C., & Massoud, Y. (2023). Enhanced sgRNA On-Target Cleavage Efficacy Prediction using
Conditional GANs. 2023 IEEE International Symposium on Circuits and Systems (ISCAS). https://doi.org/10.1109/
iscas46773.2023.10181826 Eprint version Post-print
DOI 10.1109/iscas46773.2023.10181826
Publisher IEEE
Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.
Download date 2023-12-02 17:47:30
Link to Item http://hdl.handle.net/10754/693182
Enhanced sgRNA On-Target Cleavage Efficacy Prediction using Conditional GANs
Konstantinos Fanaras, Charalampos Antoniadis and Yehia Massoud Innovative Technologies Laboratories (ITL),
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia {charalampos.antoniadis, yehia.massoud}@kaust.edu.sa
Abstract—The wide usage of the Clustered Regularly Inter- spaced Short Palindromic Repeats associated with the Cas9 enzyme (CRISPR/Cas9) system, which is one of the latest genome-editing methods, has shed light on pathways that were inconceivable before, from enhancing fruit nutrition to treating incurable diseases. The precise prediction of single guide RNA (sgRNA) on-target knockout efficacy poses a substantial challenge to the practical application of CRISPR/Cas9 systems. Although many Machine Learning (ML)-based techniques have yielded encouraging results, prediction accuracy still needs improvement.
CRISPR/Cas9 datasets do not contain many examples for train- ing the models. Generative Adversarial Networks (GAN)s are good at learning a model for generating new training data given a dataset. This work proposes a novel Machine Learning model that combines Convolutional Neural Networks and conditional GANs (coGAN)s to predict sgRNA on-target knockout efficacy.
Experimental results showed that the proposed model could outperform other models proposed in the literature in regression and classification tasks for predicting the on-target sgRNA cleavage efficacy in CRISPR/Cas9 systems.
Index Terms—CRISPR/Cas9, sgRNA, CNN, Conditional GANs
I. INTRODUCTION
CRISPR/Cas9 is one of the latest and most promising gene- editing technologies we know today. In essence, the Clustered Regularly Interspaced Short Palindromic Repeats associated with the Cas9 enzyme (CRISPR/Cas9) system imitates the immune system of bacteria against DNA viruses. More specif- ically, a single-guided RNA (sg-RNA) molecule drives the Cas9 enzyme to the viral DNA segment by looking for the complementary base pair of sg-RNA in the DNA molecule.
Because the viral DNA segment is followed by a particular DNA sequence known as Photospacer Adjacent Motif (PAM), which activates the cutting mechanism of the Cas9 enzyme when sg-RNA is paired with the infected DNA segment Cas9 enzyme knockouts that segment.
A primary reason that hinders the straightforward use of the CRISPR/Cas9 system is that sg-RNA may bind with non- target DNA sites, known as off-target effects. In addition, the effectiveness of cleavage efficacy of the on-target sites is helpful to be known before performing an experiment in-vivo that is time-consuming and may be harmful to the subject, or we may want to know the conditions that would increase the knockdown efficacy. Thus, the research on CRISPR/Cas9 systems is interested in predicting precisely the off-target effects and the on-target knockdown efficacy of sg-RNA.
Numerous ML-based strategies [1]–[12] have been pre- sented for predicting the cleavage propensity of a genomic region for a particular sgRNA with various epigenetic char- acteristics. DeepCrispr [1] uses a ”one-hot” encoding for the sg-RNA and epigenetic sequences and trains an AutoEncoder to learn suitable vector embeddings for these sequences. Then, the pre-trained Encoder of the AutoEncoder is employed as the embedding layer in two Deep Learning (DL) models, one for the sg-RNA on-target and the one for the off-target cleavage efficacy prediction. CNN-SVR in [2] first codes the sgRNA and epigenetic information like DeepCrispr. Then, it uses two pipelines with CNN layers, one for the sg-RNA nucleotides and the other for the epigenetic information, to extract a suit- able feature vector from these data sequences. Then, a Random Forest (RF) model selects the most significant features from the extracted feature vector, and finally, predictions are made with a Support Vector Regression (SVR) model trained on top of the selected features with the RF. Moreover, the CNN-XG model in [3] preserves exactly the same pipeline with [2] but substitutes the SVR model with an XGBoost model [13] to make predictions. Moreover, AttnToCrisp CNN model in [4]
employs Transformer and CNN layers to learn effective fea- tures and then make predictions using Fully Connected (FC) layers. Finally, CnnCrispr in [5] first embeds the sg-RNA and epigenetic sequences into a vector space with the GloVe model and then extracts features from these trained embeddings using BiLSTM and CNN layers to build a predictive model.
Although the above ML models have shown remarkable results in sg-RNA on/off-target cleavage efficacy, the models’
accuracy and generalizability still needs to be enhanced. In this work, influenced by the works in [2] and [3] we propose a model that initially extracts useful features from sg-RNA and epigenetic information sequences using CNNs and then makes predictions using a regression/classification model based on conditional GANs (coGANs) to enhance the performance of the on-target sg-RNA cleavage efficacy of previous works.
The rest of the paper is organized as follows. Section II presents the datasets, and explains the coGANs briefly.
Section III presents the proposed model for the prediction of the on-target cleavage efficacy and the architectures of the proposed coGANs. Finally, Section IV demonstrates the better predictions we can obtain by using the proposed model in both regression and classification tasks, while conclusions are drawn in Section V.
II. BACKGROUND A. Datasets
First, we pre-trained the CNNs in our model on approx- imately 200,000 non-redundant sgRNAs with biologically meaningful knockout efficacies [1].
For the performance evaluation of the proposed model CNN-coGAN, we used 1071 sgRNAs from four experimental, validated, sgRNA on-target cleavage efficacy, independent human datasets, namely the HCT116, HEK293T, HELA, and HL60 [1]. Each data entry in the dataset contains a 23- nt sgRNA sequence, four kinds of corresponding symbolic epigenetic features, and numerical and binary cleavage efficacy values.
B. Conditional GANs
GANs are Machine Learning models that learn how to produce new data that follow the distribution of the training data. A GAN consists of a Generator (G) model that generates new input data and a Discriminator (D) model that decides whether a newly generated example is legit or not. The training of GANs is formulated as a min-max problem with a value function (J(G, D)) as follows:
min
G max
D J(G, D) =Ex∼pdata(x)[log(D(x))]
+Ez∼pz(z)[log(1−D(G(z)))], (1) where we adjustGparameters to minimizelog(1−D(G(z))) (or to confuseD to classify the generated inputG(x)as gen- uine i.e.log(1−D(G(z)))to be as much negative as possible) and adjust the parameters for D to maximize log(D(x))and log(1−D(G(z)))(or to classify correctly the generated input G(z) as fake and ascertain that D will classify a sample drawn from pdata(x) as real), i.e., to make log(D(x)) and log(1−D(G(z))) values as close to zero as possible.
Conditional GANs (coGANs) [14] are a type of GANs that exploit also the labels in the dataset during the training process. Given a random vector and a label, the Generator learns how to generate new data following the same statistics with the portion of the training set that corresponds to the same label. Similarly to vanilla GANs, the training of coGANs is formulated as a min-max problem with a value function (J(G, D)) as follows:
min
G max
D J(G, D) =Ex∼pdata(x)[log(D(x|y))]
+Ez∼pz(z)[log(1−D(G(z|y)))], (2) where now the model parameters of G and D are adjusted given a labely.
C. Conditional GANs for Regression
Therefore, coGANs suit with data augmentation based on the above definition. However, although coGANs have been studied primarily for data augmentation (see eq. 2), little or no attention has been paid to regression problems. If we consider the following min-max problem i.e.
z
y
truey
fake realor fake
x
G D
Fig. 1. Training a coGANs for regression.
minG max
D J(G, D) =Ey∼pdata(y)[log(D(y|x))]
+Ez∼pz(z)[log(1−D(G(z|x)))], (3) where now the model parameters of G and D are adjusted given the output y we can train G to learn the conditional density p(y|x). Effectively, p(y|x) is the answer to the re- gression problemy=f(x) +zwithz∼ N(0,Σ), where the stochasticity in the regression model may stem either from missing or noisy input data [15] (see Fig. 1).
III. PROPOSEDMODEL
The proposed model (CNN-coGAN) for the prediction of the on-target sg-RNA cleavage efficacy is shown in Fig. 2.
CNN-coGAN first harnesses (pre-trained) CNNs to extract features directly from the sequences of sgRNA nucleotides and the epigenetic information. Then, it uses a Random Forest model to select a subset of the extracted features corresponding to the most significant features. Finally, these features are fed to a predictive model based on coGANs (to be more precise;
to the Generator model of the coGAN) that makes the sgRNA on-target efficacy predictions. More specifically, CNN-coGAN consists of two distinct CNN pipelines, one to extract features from sgRNA genetic information and the other to extract features from epigenetic information. The features discovered by these two pipelines are merged to form a single feature vector. The CNN layers are first trained on a benchmark dataset [1] using only Dense layers after the Convolutional layers (consider a Neural Network that can be trained using the labels of the dataset). Then, the coGAN model is trained on the selected features from the RF model for 60 epochs and batches of size 500.
The input in the CNN-coGAN model is two 4×23 matrices, one for the sgRNA part and one for the epigenetic part. For the sgRNA part, the input is a one-hot encoding of a sequence of 23 nucleotides. There are four types of nucleotides, namely A, G, C, and T. Consequently, four channels corresponding to each nucleotide are needed to encode them, with 1 in one of the four channels indicating the presence of the correspond- ing nucleotide while the rest of the channels are zero. For the epigenetic part, the input is again a one-hot encoding of a 23-long sequence with epigenetic characteristics. Four different epigenetic characteristics are considered in this work.
Therefore, we use four channels corresponding to the different epigenetic characteristics at every location, with each channel
1 0 0 0 1 0 0 01 0 1 0 0 0 1 0 0 01 0
1 0
0 1 10 A G C T
A G C T
1 0 0 0 1 0 0 00 1 1 0 0 0 1 0 0 00 1
1 0
0 1 10 0
00 1 1 1 0 1 1 1 1
CTCF Dnase H3K4me3 RRBS
A G C T
Random Forest CNN
Feature extract
Prediction
sgRNA seq.
enc.
Epigenetic seq.
enc.
Feature Selection
CNN for the sg-RNA sequence
CNN for the epig.
info sequence
Input Conv layer1 Pool layer1 Conv layer2Pool layer2 Conv layer3 Pool layer3 Flatten Layer Fully Connected Layer
23 x 4 21 x 64 11 x 64 9 x 128 5 x 128 3 x 256 2 x 256 512 256 128 64 32
Convolutions 64 filters 3 filter size
Max Pool 2 window size
Convolutions 128 filters 3 filter size
Max Pool 2 window size
Max Pool 2 window size Convolutions
256 filters 3 filter size
Flatten Fully connected weights Dropout
Input Conv layer1 Pool layer1 Conv layer2 Pool layer2 Conv layer3 Pool layer3 Flatten layer Fully Connected Layer
23 x 4 21 x 64 11 x 64 9 x 128 5 x 128 3 x 256 2 x 256 512 256 128 64 32
Convolutions 64 filters 3 filter size
Max Pool 2 window size
Convolutions 128 filters 3 filter size
Max Pool 2 window size
Max Pool 2 window size Convolutions
256 filters 3 filter size
Flatten Fully connected weights Dropout
Fully connection
sgRNA part
Epigenic part
CNN
Feature extract
importance
sg-RNA cleavage
efficacy regression co-GAN
Fig. 2. The Proposed model Architecture for the prediction of the sg-RNA on-target cleavage efficacy.
taking the value 1 when the corresponding characteristic is found there or 0 if not.
A. Architecture of the Generator
The Generator model in our proposed regression coGAN consists of an input layer, six hidden layers, and an output layer (see Fig. 3). The input layer is the 15-dimensional features extracted from the Random Forest model in the CNN- coGAN along with 1-dimensional additive Gaussian Noise which has a mean value of zero and standard deviation of one. The hidden layers are all fully connected layers with 100, 75, 75, 75, 75, and 50 neurons, respectively, using the Exponential Linear Unit (ELU) activation function. Finally, the last layer constitutes one neuron corresponding to the 1-
x
z yfake
100
75 75 75 75
50 1
Hidden Layers
Fig. 3. The Architecture of the Generator.
dimensional output. For the training process, we chose the Adam optimizer with a learning rate of 0.0001.
B. Architecture of the Discriminator
The Discriminator model in our proposed regression coGAN has a similar architecture to the Generator model (see Fig. 4). The input layer is the 15-dimensional features extracted from the Random Forest model in CNN-coGAN along with a 1-dimensional y-label which may be a fake label coming from the Generator model or be the true label from the training set. Next, our input is passed to the six hidden layers. The hidden layers are all fully connected layers with 100, 50, 50, 50, 50, and 25 neurons, respectively, using the ELU activation function. Finally, the last layer constitutes one neuron corresponding to the 1-dimensional output deciding if the y-label given as input is true or fake. For the training process, we chose the Adam optimizer with a learning rate of 0.001.
IV. EXPERIMENTAL RESULTS
In this section, we compare the CNN-coGAN accuracy in predicting the on-target sgRNA cleavage efficacy with the accuracy of DeepCrispr [1] CNN-SVR [2], CNN-XG [3], and AttnToCrispr CNN [4]. We consider two types of tests for the proposed model evaluation that are in accordance with the provided information in the datasets. The first evaluation is a regression problem and refers to the accuracy of the model in predicting the efficacy of the sgRNA cleavage i.e a floating point number between 0 and 1. However, for different cell types, the effectiveness of the cleavage for the same efficacy may differ so the second test is a binary classification problem and measures the accuracy of the prediction in classifying the sgRNA cleavage into the effective and no-effective class for given subjects.
Since G in our model learns p(y|x), i.e., for a given x we consider a whole distribution, in order to conclude to one y value we harness the Mean Absolute Error (MAE) metric which is defined as
M AE= 1 N
N
X
i=1
|yi−ybi|,
and minimize it looking for the ybi value, i.e., the median of p(y|xi). More specifically, we generate a sample of 100 y values for the given xi and take the median of the sample points to beybi [15].
A. Evaluation of Regression Task
We used the Pearson coefficient as a metric value for the regression problem. Table I 1 shows the Pearson coefficient results of our proposed model CNN-coGAN, and the other models we consider in our experimental analysis. See that although AttnToCrispr CNN presents a better Pearson corre- lation coefficient than the proposed model in HCT116 and HEK293T datasets it presents a worse Pearson correlation coefficient than the proposed model by 7% on average.
In summary, the proposed model presents a better Pearson correlation coefficient on average 3% than CNN-XG, 5% than
1 The numbers in bold in Tables I and II indicate the highest score.
x
y real/fake
100
50 50 50 50
25 1
Hidden Layers
Fig. 4. The architecture of the Discriminator.
CNN-SVR, and 7% than AttntoCrisp CNN in the regression task.
TABLE I
MODEL COMPARISON WITH THESTATE OF THEART MODELS IN REGRESSIONTASK USING THEPEARSONCORRELATIONCOEFFICIENT.
Model Proposed CNN-XG CNN-SVR AttnToCrispr CNN
HCT116 0.6696 0.6370 0.6125 0.728
HEK293T 0.7417 0.7098 0.7000 0.846
HELA 0.6247 0.6079 0.5878 0.568
HL60 0.5913 0.5849 0.5719 0.311
Average 0.6568 0.6349 0.6230 0.6132
B. Evaluation of Classification Task
We used AUROC scores as our metric value for the classi- fication task. Table II compares the AUROC results obtained from the proposed model and the other models we consider in our experimental analysis. See that the proposed model out- performs the other models in all datasets for the classification task. More specifically, the proposed model exhibits a better AUROC score on average by 0.7% against CNN-XG, by 4%
against CNN-SVR, and by 17% against DeepCrispr.
TABLE II
MODEL COMPARISON WITH THESTATE OF THEART MODELS IN CLASSIFICATIONTASK USING THEAUROCSCORE. Model Proposed CNN-XG CNN-SVR DeepCrispr
HCT116 0.9817 0.9732 0.9304 0.874
HEK293T 0.9962 0.9905 0.9834 0.961
HELA 0.9723 0.9714 0.9296 0.782
HL60 0.9842 0.9706 0.9309 0.739
Average 0.9836 0.9764 0.9435 0.722
V. CONCLUSION
In this work, we proposed a model for the prediction of the on-target sgRNA cleavage efficacy using conditional GANs.
More specifically, after extracting a feature vector using two pipelines of CNN layers one for the sgRNA sequence and one for the epigenetic sequence, and then picking the most important features from that vector using a Random Forest model, we employed coGANs suitably tailored for regression.
Experimental results showed that the proposed model leads from 3% to 7% improvement on average of Pearson correlation coefficient in regression tasks and from 0.7% to 17% improve- ment on average of AUROC in classification tasks on four commonly available datasets: HCT116, HEK293T, HELA, and HL60 compared to other recently published models.
REFERENCES
[1] Q. Liu, X. Cheng, G. Liu, B. Li, and X. Liu, “DeepCRISPR: optimized CRISPR guide RNA design by deep learning,”Genome Biol, vol. 19, 2018.
[2] G. Dai, Z. Dai, and X. Dai, “A Novel Hybrid CNN-SVR for CRISPR/Cas9 Guide RNA Activity Prediction,”Front. Genet, vol. 10, 2020.
[3] B. Li, D. Li, and X. Liu, “CNN-XG: A Hybrid Framework for sgRNA On-Target Prediction,”Biomolecules, vol. 12, no. 3, 2022.
[4] Q. Liu, D. He, and L. Xie, “ Prediction of off-target specificity and cell-specific fitness of CRISPR-Cas System using attention boosted deep learning and network-based gene feature,”PLOS Comput. Biol, vol. 15, 2019.
[5] Q. Liu, X. Cheng, G. Liu, B. Li, and X. Liu, “Deep learning improves the ability of sgRNA off-target propensity prediction,”BMC Bioinform, vol. 21, 2020.
[6] J. Doench, N. Fusi, M. Sullender, M. Hegde, E. Vaimberg, K. Donovan, I. Smith, Z. Tothova, C. Wilen, and R. W. et al, “Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR- Cas9,”Nat Biotechnol, vol. 34, pp. 184–191, 2016.
[7] H. Kim, S. Min, M. Song, S. Jung, J. Choi, Y. Kim, S. Lee, S. Yoon, and
H. Yoon, “Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity,”Nat. Biotechnol, vol. 36, pp. 239–241, 2018.
[8] K. Rahman and M. Rahman, “CRISPRpred: A flexible and efficient tool for sgRNAs on-target activity prediction in CRISPR/Cas9 systems,”
PLoS ONE, vol. 12, 2017.
[9] J. Lin and K. Wong, “Off-target predictions in CRISPR-Cas9 gene editing using deep learning,”Bioinformatics, vol. 34, 2018.
[10] L. Xue, B. Tang, W. Chen, and J. Luo, “Prediction of CRISPR sgRNA Activity Using a Deep Convolutional Neural Network,”J. Chem. Inf.
Model, vol. 59, pp. 615 – 624, 2018.
[11] J. Listgarten, M. Weinstein, B. Kleinstiver, A. Sousa, J. Joung, J. Craw- ford, K. Gao, L. Hoang, M. Hoang, and J. D. et al, “Prediction of off-target activities for the end-to-end design of CRISPR guide RNAs,”
Nat. Biomed. Eng, vol. 2, pp. 38 – 47, 2018.
[12] L. Wang and J. Zhang, “Prediction of sgRNA on-target activity in bacteria by deep learning,”BMC Bioinform, vol. 20, 2019.
[13] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,”
in Proceedings of the 22nd ACM SIGKDD International Conference, 2016, pp. 785–794.
[14] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”
CoRR, 2014.
[15] K. Aggarwal, M. Kirchmeyer, P. Yadav, S. S. Keerthi, and P. Gallinari,
“Regression with conditional GAN,”CoRR, 2019.