First report on chemometric modeling of tilapia fish aquatic toxicity to organic chemicals: Toxicity data gap filling

(1)

Science of the Total Environment 907 (2024) 167991

Available online 26 October 2023

First report on chemometric modeling of tilapia fish aquatic toxicity to organic chemicals: Toxicity data gap filling

Siyun Yang, Supratik Kar

^*

Chemometrics and Molecular Modeling Laboratory, Department of Chemistry, Kean University, 1000 Morris Avenue, Union, NJ 07083, USA

H I G H L I G H T S G R A P H I C A L A B S T R A C T

•Robust QSAR and q-RASAR models were developed for two Tilapia genera covering three species.

•Models identified features causing tilapia toxicity, aiding eco-friendly chemical design.

•Understanding insights into chemicals' toxicity to major Tilapia species.

•In silico models for predicting aquatic toxicity of future chemicals.

•Developed models predicted toxicity for 297 external chemicals for three Tilapia species.

A R T I C L E I N F O Editor: Beatrice Opeolu Keywords:

Aquatic toxicity LC50

QSAR Q-RASAR USEPA Tilapia Risk assessment

A B S T R A C T

The Toxic Substances Control Act (TSCA) mandates the Environmental Protection Agency (EPA) to document chemicals entering the US. Due to the vast range of toxicity endpoints, experimental toxicological study for all chemicals is impossible to conduct. To address this, in silico methods like QSAR and read-across are strategically used to prioritize testing for chemicals lacking ecotoxicity data. Aquatic toxicity is one of the most critical endpoints directly related to aquatic species, mainly fish, followed by direct to indirect effects on humans through drinking water and fish as food, respectively. Therefore, we have employed the ToxValDB database to curate acute LC50 toxicity data for three Tilapia species covering two different genera, an ideal species for aquatic toxicity testing.

Employing the curated dataset, we have developed multiple robust and predictive QSAR and quantitative read- across structure-activity relationship (q-RASAR) models for Tilapia zillii, Oreochromis niloticus, and Oreochromis mossambicus which helped to understand the toxicological mode of action (MoA) of the modeled chemicals and predict the aquatic toxicity of new untested chemicals followed by toxicity data gap filling. The best three QSAR models showed encouraging statistical quality in terms of determination coefficient R²(0.94, 0.74, and 0.77), cross- validated leave-one-out Q²(0.90, 0.67 and 0.70), and predictive capability in terms of R²pred (0.95, 0.77, and 0.74) for T. zillii, O. niloticus, and O. mossambicus datasets, respectively. The developed best mathematical models were used for the prediction of aquatic toxicity in terms of pLC50 for 297 untested organic chemicals across three major Tilapia species ranging from 1.841 to 8.561 M in terms of environmental risk assessment.

* Corresponding author at: Department of Chemistry, Kean University, 1000 Morris Avenue, Union, NJ 07083, USA.

E-mail address: [email protected] (S. Kar).

Contents lists available at ScienceDirect

Science of the Total Environment

journal homepage: www.elsevier.com/locate/scitotenv

https://doi.org/10.1016/j.scitotenv.2023.167991

Received 30 August 2023; Received in revised form 2 October 2023; Accepted 19 October 2023

(2)

1. Introduction

Tilapia is one of the most consumed fish, ranked 4^thin the USA market. Tilapia is a good source of protein (29 g protein/4 oz. serving), omega-3, and low in fat, which is comparable to various seafood like mahi-mahi, lobster, and yellow-fin tuna, etc. The price of Tilapia is extremely affordable which makes it distinguishable from other fish and seafood. Along with the USA, Tilapia is an equally affordable and pop- ular fish throughout the world as suggested Food and Agriculture Or- ganization of the United Nations that tilapia is now farmed in over 135 countries (Food and Agriculture Organization of the United Nations, 2014; Uddin et al., 2021). As per a report on 16th January 2023, the global tilapia industry touched a revenue of US$ 14.1 billion and it is expected to reach US$ 22.3 billion over the next 10 years suggesting a staggering 58.2 % increase (Fact.MR, 2023).

The species selected for this study include Tilapia zillii (also known as redbelly tilapia or Coptodon zillii) within the Tilapia genera, as well as Oreochromis niloticus (Nile tilapia) and Oreochromis mossambicus (Mozambique tilapia) within the Oreochromis genera. All of these fish belong to the Cichlidae family and were originally classified under the tilapia genera; however, in recent decades, Oreochromis was reclassified from the tilapia group (Nagl et al., 2001). These fish are distinguished for their adaptability, rapid growth, and nutritional value. Widely consumed in numerous Asian countries, tilapia serves as a crucial di- etary protein source (Engle et al., 2023). Yet, as global tilapia consumption rises and aquaculture practices intensify, concerns mount over potential exposure of these fish to diverse chemical contaminants.

Pharmaceuticals and other anthropogenic chemicals can infiltrate aquatic ecosystems via improper disposal and wastewater discharge (Khan et al., 2023). Assessing the potential toxicity of such substances to tilapia species emerges as a fundamental step for ensuring their ongoing survival, upholding ecological equilibrium, and safeguarding human health.

The United States Environmental Protection Agency (US EPA) oversees the release of chemical contaminants throughout the US. In Europe, the European Chemicals Agency (ECHA), and in the US, the

“Code of Federal Regulations (40 CFR)” or “Title 40” of the US EPA, regulate the release of various hazardous substance groups, including pharmaceutical products. The authority advocate for the use of alter- native testing strategies (ATS) to minimize animal testing and encourage the application of in silico tools, primarily read-across and quantitative structure-activity relationships (QSAR), for regulatory testing purposes (United States Environmental Protection Agency, 2022.; Khan et al., 2023; Madden et al., 2020). QSAR modeling plays a crucial role in delivering information on the physicochemical properties, environmental fate, and human health impacts of chemical compounds (Lessi- giarska et al., 2006). Regulatory agencies are developing and evaluating advanced predictive models to assess the physical, chemical, and biological properties of individual chemical substances using applications tailored to decision-making frameworks for safety assessments (Kar et al., 2020; Kar et al., 2022). Employing QSAR modeling in toxicological predictions can facilitate the identification of potential adverse effects of chemical compounds, thus contributing to risk assessment, chemical screening, and prioritization processes (Valerio et al., 2007).

The Read-Across Structure-Activity Relationship (RASAR) is a novel approach that combines the advantages of both QSAR and Read-Across algorithms within a machine learning framework, aiming to improve predictive ability and interpretability in identifying essential chemical features (Banerjee et al., 2022). The quantitative RASAR (q-RASAR) extension incorporates similarity and error-based descriptors, further enhancing the method. Compared to traditional QSAR analysis and Read-Across-based predictions, q-RASAR models demonstrate improved predictive capability and lower mean absolute error (Banerjee et al., 2023). This approach achieves a balance between simplicity, interpretability, transferability, and reproducibility while maintaining a high level of predictive accuracy.

In this study, we focused on the three tilapia species previously mentioned, using the US EPA's ToxValDB database for analysis. We employed small dataset modeler software (http://teqip.jdvu.ac.in /QSAR_Tools/DTCLab/), DTC-QSAR (https://dtclab.webs.com/softw are-tools), and RASAR-related software (https://sites.google.com/jada vpuruniversity.in/dtc-lab-software/home) to construct QSAR and q- RASAR models for predicting their toxicity followed by identification of major structural and physicochemical features. A total of 297 externally sourced molecules were evaluated using the developed models, resulting in the identification of five common highest toxicity molecules and five common least toxicity molecules (Kar and Roy, 2010). Our findings provide valuable insights into the potential toxicological effects of these chemical compounds on the selected tilapia species and pave the way for further research and improved risk assessment strategies.

2. Materials and method 2.1. Dataset for modeling

To perform the modeling, we have collected the experimental acute toxicity data for all Tilapia species from US EPA's ToxValDB database (combination of ECOTOXicology Knowledgebase (ECOTOX) of US EPA and The European Chemicals Agency (ECHA) database)(Agency USEPA, 2023; Judson, 2019). The current version of ToxValDB is accessible through the US EPA's CompTox Chemicals Dashboard (“CompTox”) (https://comptox.epa.gov/dashboard)(Williams et al., 2017). The detail about the studied dataset is given in the Supplementary materials from Table S2 to S7. The experimental covariates will be following study type- mortality, study duration- 1 h-4 h, exposure route-static and renewal, exposure method-drinking water, types of chemicals-industrial organic chemicals and pharmaceuticals. Later from the processed dataset, we excluded all the metals and salts to model only organic chemicals. After further processing, the final datasets yielded 30, 57, and 74 molecules, respectively corresponding to the species T. zillii, O. niloticus, and O.

mossambicus. The toxicity data are expressed as -log10(LC50) or pLC50 in Molar (M) unit throughout the manuscript.

2.2. Descriptor calculation

Chemical descriptors were calculated using PaDEL-Descriptor (Yap, 2011), Molecular physicochemical properties, E-state indices, extended topochemical atom (ETA) indices, atom type electro topological state and other 1D & 2D descriptors were calculated employing the PaDEL- Descriptor software. A total of approximately 1800 descriptors were calculated and used for model built up.

2.3. Small dataset modeler

In our study, we employed the small dataset modeling technique (htt p://teqip.jdvu.ac.in/QSAR_Tools/DTCLab/), which uses the double cross-validation (DCV) method for modeling small datasets without separating them into training and test sets(Ambure et al., 2019). This approach omits the generation of a “modeling set” in the inner loop and instead generates all possible combinations of validation and calibration sets. The tool allows users to define the number of compounds in the validation set (r), determining the calibration and validation sets accordingly (De and Roy, 2021).

2.4. Dataset splitting and QSAR model development

All three datasets were randomly divided into training and test sets considering 70:30 ratio. Followed by a descriptor reduction process was implemented employing genetic algorithms (GA) resulting in the reduction of descriptors for all three datasets to 48, 41, and 45, respectively, based on recurrent descriptor occurrences within the initial GA model population. It is important to emphasize that the test set was

(3)

not incorporated during feature selection to eliminate any potential bias in model selection.

The optimal descriptor combination was identified using the best subset selection (BSS) method, available at http://teqip.jdvu.ac.in/QS AR_Tools/. Subsequently, the chosen descriptors in the selected models were subjected to multiple linear regression (MLR) model(Khan et al., 2019), accessible via http://teqip.jdvu.ac.in/QSAR_Tools/, with the aim of establishing a more reliable model and mitigating the like- lihood of inter-correlation among descriptors.

2.5. Calculation of q-RASAR descriptors

The Read-Across Structure-Activity Relationship (RASAR) combines Read-Across and QSAR concepts for q-RASAR analysis, requiring calculation of similarity and error-based RASAR descriptors for training and test sets (Banerjee and Roy, 2022; Luechtefeld et al., 2018). The RASAR-Desc-Calc-v2.0 tool (https://sites.google.com/jadavpurunivers ity.in/dtc-lab-software/home) calculates 15 descriptors using one of three similarity-based approaches. The RA function, a composite function involving similarity measures, can be incorporated within a linear model generation framework(Banerjee et al., 2022). Other calculated descriptors, including SD_Activity, SE, CVact, MaxPos, MaxNeg, Abs Diff, Avg. Sim, SD_Similarity, CVsim, gm (Banerjee-Roy coefficient), gmAvg. Sim, gmSD_Similarity, Pos.Avg.Sim, and Neg.Avg.Sim, contribute to prediction confidence or assessing query compound activity probability. High SD_Activity and SD_Similarity values indicate reduced prediction confidence due to significant dispersion. The detail about the mentioned descriptors is illustrated in Supplementary Mate- rials in Tables S1.

2.6. q-RASAR feature selection and model development

Upon calculating RASAR descriptors for the training and test sets, these were combined with previously available structural descriptors, and feature selection was conducted using the BestSubsetSelection_v2.1 tool (https://dtclab.webs.com/software-tools). This tool generates all possible model combinations for a user-defined number of descriptors while adhering to the inter-correlation cutoff. The final MLR-based q- RASAR model was assessed using the MLRPlusValidation 1.3 tool (https ://dtclab.webs.com/software-tools). Various validation metrics reported in Table 3 adhered to OECD Guidelines (https://www.oecd.or g/chemicalsafety/risk-assessment/validationofqsarmodels.htm). Inter- correlation of the bivariate model's two descriptors was analyzed using MINITAB 14 (https://minitab.informer.com/14.1/). q-RASAR utilizes composite functions like the RA function and gm, which act as latent variables, extracting information from various structural or physicochemical variables. Consequently, q-RASAR models can be developed with limited RASAR descriptors, improving model degrees of freedom (Banerjee et al., 2022). Univariate or bivariate q-RASAR models may enhance model predictive ability and increase prediction confidence.

2.7. Validation, applicability domain, and Y-randomization

The QSAR model and the q-RASAR models were verified using measures such as the goodness-of-fit (R²), the internal validation tool of leave-one-out cross-validation (Q²_LOO), r_m²indicators, and the mean absolute error with 95 % confidence (MAE(95 %)). All the statistical metrics are mathematically defined in Table 1. The applicability domain (AD) was also studied employing Leverage approach generating Wil- liams plot (Gramatica, 2007). Model's Y-randomization was studied to check if the developed model is generated by chance or not (Roy et al., 2015). The Y-randomization process involved running the model's cal- culations 100 times after shuffling the dependent variables while keeping the original independent variables constant. This process was carried out using the ‘MLR Y-Randomization Test 1.2,’ an open-access

Table 1

Metrics define statistical quality of the regression based QSAR and q-RASAR models.

Parameters Equation Description

Determination coefficient

(R²) R²=1−

∑ (Yobs− Ypred)₂

∑ (Yobs− Ytraining)₂ The metric for assessing the fit quality of a regression model gauges the difference between observed and forecasted data. The highest potential value for R²is 1, signifying an ideal correlation. Yobs represents the observed response values for the training set, while Ypred signifies the predicted response values for the training set of compounds. Ytrainingis the average observed response for the compounds in the training set.

Leave-one-out cross- validation (Q²_LOO)

Q²LOO=1−

∑ (Yobs(training)− Ypred(training)

)₂

∑ (Yobs(training)− Ytraining)₂

The cross-validated R² denoted as Q²is used for internal validation.

Yobs(training)represents the observed response, while Ypred(training)is the forecasted response for the training set molecules using the leave- one-out (LOO) method.

Mean absolute

error (MAE) MAE=1 n X

∑⃒⃒Yobs− Ypred

⃒⃒ This is often referred to as average absolute error (AAE) and is viewed as a more suitable measure of errors for predictive modeling research.

r²_mmetrics

r²_m=r²_m+r′²_m

2 andΔr²_m=∣r²_m− r^′²_m∣

where r²m=r²X(1−

̅̅̅̅̅̅̅̅̅̅̅̅̅̅

r²− r²₀

√ )

r′²_m=r²X( 1−

̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅

r²− r^′²0

√ )

The value r²represents the squared correlation coefficient between observed and predicted response values.

Meanwhile, r02 and r'02 are the squared correlation coefficients when the regression line goes through the origin by switching the axes. For a prediction to be deemed acceptable, all Δr²_mmetrics should preferably be lower than 0.2 provided that the value of r²_mis more than 0.5.

Predictive R²or

R²_predor Q²_ext_(F1) Q²_ext(F1)=1−

∑ (Yobs(test)− Ypred(test)

)₂

∑ (Yobs(test)− Ytraining)₂

This metric is used to evaluate external predictability. It gauges the correlation between the observed and forecasted data from the test set.

Yobs(test)represents the observed response, while Ypred(test)indicates the predicted response for the test set molecules. Ytraining

signifies the average observed response from the training set.

Q²_ext(F2)

Q²_ext(F2)=1−

∑ (Yobs(test)− Ypred(test)

)₂

∑ (Yobs(test)− Ytest)₂

It helps in the judgment of predictivity of a model using the test set (Ytest).

(4)

tool available at (https://dtclab.webs.com/software-tools). After per- forming the Y-randomization process, the study computed the average values of two metrics, R²and Q², for the 100 randomly generated models. To establish the validity of the developed model, both the average R²and Q²values were expected to be less than 0.5. This criterion was likely used as a threshold to determine whether the model's predictive performance was significantly better than what could be expected by chance alone. If both the average R²and Q²values were below 0.5, it would suggest that the developed model's predictive abilities were indeed significant and not merely a result of random chance.

2.8. External dataset for data gap filing

The external dataset consists of 297 organic chemicals collected from earlier study (Kar and Roy, 2010) is reported in Supplementary materials file, is used to predict aquatic toxicity for all three Tilapia species.

All the prediction reliability is also checked by ‘Prediction Reliability Tool’ discussed in Section 2.9. The tool helped us not only to identify the prediction reliability through AD parameter but also gave qualitative prediction parameter in terms of ‘Good’, ‘Moderate’ and ‘Bad’.

2.9. Prediction reliability check

We assessed the prediction reliability of the external dataset con- sisting of 297 chemicals using the Prediction Reliability Indicator (PRI) tool (Roy et al., 2018). The PRI tool estimates the reliability of model predictions for test and true external sets by categorizing them into three groups: good (composite score 3), moderate (composite score 2), and bad (composite score 1) based on three criteria. These criteria include:

1) Mean absolute error of leave-one-out predictions for the 10 closest training compounds for each query molecule, 2) Applicability domain in terms of similarity based on the standardization approach, and 3) Proximity of the predicted value of the query compound to the experimental mean training response. The tool calculates an optimum weightage for each criterion based on the percentage of correct predictions for a test set with known observed response values. Alterna- tively, users can manually select the weightage. In this study, we used the most frequently appearing weightage scheme of 0.5:0:0.5. The Prediction Reliability Indicator tool is freely available at http://dtclab.

webs.com/software-tools.

3. Results and discussion

3.1. QSAR and q-RASAR models for undivided dataset

For each species, one QSAR and one q-RASAR models were developed. Therefore, a set of 6 mathematical models were developed employing 1D and 2D interpretable descriptors. As the datasets size is not that big; therefore, we have initially developed models using the entire data sets to evaluate the statistical quality. Based on the internal validation metrics, we found that QSAR models are better than the q- RASAR models. Although, it's important to mention that traditionally q- RASAR models are used to improve the prediction capability of the model which mostly judged through external validation criteria. All the developed models passed the threshold for the goodness-of-fit followed by internal validation criteria considering Q²_LOOand Q²_LMO. The detailed statistical qualities of all models for undivided dataset can be found in Table 2.

3.2. Mechanistic interpretation of best models for undivided dataset 3.2.1. Model 1 (T. zillii) for undivided dataset

The first descriptor AATS5e belongs to the autocorrelation type, which is based on the weighted average of atomic Sanderson electronegativities, lagged by a topological distance of 5. The descriptor indicates that an increase in the weighted electronegativities will have a

positive contribution to the pLC50 values. The second descriptor VR3_DzZ is a 3D-MoRSE descriptor that represents the atomic van der Waals volumes weighted by atomic valence connectivity. The descriptor's positive contribution means that an increase in the atomic van der Waals volumes and valence connectivity will result in higher pLC₅₀ values, which is indicative of increased toxicity. The third descriptor VP- 3 is a vertex degree descriptor and reflects the total number of atoms with three connected vertices in the molecular structure. The positive contribution of this descriptor indicates that the presence of more atoms with three connected vertices will increase the pLC50 value.

3.2.2. Model 2 (O. niloticus) for undivided dataset

The first descriptor, nCl, represents the number of chlorine atoms in the compound. The positive contribution of this descriptor suggests that an increase in the number of chlorine atoms results in a higher pLC₅₀ value, indicating greater toxicity for O. niloticus. Compounds with a higher number of chlorine atoms exhibit increased toxicity. The second descriptor, MIC2, is a 2D autocorrelation descriptor that considers the atomic mass at topological distance 2. The negative contribution of this descriptor implies that compounds with greater mass at a topological distance of 2 exhibit reduced pLC₅₀values, meaning lower toxicity levels. The third descriptor, SIC2, is another 2D autocorrelation descriptor, which focuses on the atomic Sanderson electronegativity at topological distance 2. The positive contribution of SIC2 suggests that an increase in atomic electronegativity at this distance leads to higher pLC50 values, corresponding to increased toxicity. The fourth descriptor, MATS4i, is a topological descriptor that represents the Moran autocorrelation of lag 4 weighted by ionization potential. The negative contribution of MATS4i means that compounds with higher ionization potential values at a topological distance of 4 have lower pLC50 values, and therefore, lower toxicity. The final descriptor, ATS0p, is a topological descriptor that reflects the Broto-Moreau autocorrelation of lag 0 weighted by polarizability. The positive contribution of this descriptor indicates that an increase in polarizability values at a topological distance of 0 leads to higher pLC50 values, signifying increased toxicity levels.

3.2.3. Model 3 (O. mossambicus) for undivided dataset

The QSAR model for O. mossambicus incorporates five descriptors to predict the pLC50 values, which represent the toxicity levels of the compounds. The first descriptor, SHCsats, signifies the sum of hydrophobic carbon atoms within saturated chains. The positive contribution of SHCsats suggests that an increase in hydrophobic carbon atoms leads to higher pLC₅₀values, indicating greater toxicity to O. mossambicus. The second descriptor, AATS1m, is an autocorrelation descriptor that reflects the average weighted atomic mass at a topological distance of 1.

The positive contribution of AATS1m indicates that compounds with a greater average weighted mass at this distance exhibit increased pLC50

values, signifying higher toxicity levels. The third descriptor, ATSC7e, is a topological descriptor representing the centered Broto-Moreau autocorrelation of lag 7 weighted by atomic Sanderson electronegativity.

The positive contribution of ATSC7e implies that an increase in the weighted electronegativity at a topological distance of 7 results in higher pLC50 values, corresponding to increased toxicity. The fourth descriptor, SRW6, is a molecular walk count descriptor that measures the total number of walks of length 6 in the compound. The positive contribution of SRW6 means that compounds with a higher number of walks of length 6 have increased pLC50 values, and therefore, greater toxicity. The final descriptor, maxdssC, represents the maximum topological distance between pairs of carbon atoms in the compound. The positive contribution of maxdssC suggests that an increase in the maximum topological distance between carbon pairs leads to higher pLC50 values, indicating increased toxicity levels.

(5)

3.3. QSAR and q-RASAR models for divided dataset 3.3.1. Models quality

The divided datasets are used to develop QSAR and q-RASAR model for each species which generated a total of six models (Models 7–12). In case of T. zillii and O. mossambicus, QSAR models Model 7 and Model 9 stands out as the better model compared to q-RASAR models Model 10 and 12, respectively in terms of internal as well as external validation.

While, in the case of O. niloticus, QSAR (Model 8) and q-RASAR (Model 11) are comparable to each other in terms of external validation, while Model 8 is slightly better compared to Model 11 based on internal validation criteria. Therefore, for the prediction and interpretation purpose, we have considered Models 7, 8 and 9, respectively for T. zillii, O. niloticus, and O. mossambicus, respectively. For the stringent validation of the developed models, we have also computed the Golbraikh and Tropsha's criterion (Golbraikh and Tropsha, 2002) and all the models passed the stipulated threshold values for different metrics which can be found in Supplementary materials in Table S2.

The scatter plots (Fig. 1a, c, e) of the observed and predicted pLC50

values suggested that all the dataset compounds are very near to the best fitted line which further confirmed the model acceptability. To check the prediction reliability of the test set compounds, we have also performed the AD analysis by Leverage method and plotted Williams plot (Fig. 1b, d, f). Based on the introspection of these plots, we found that all test compounds are within the domain of applicability of the respective developed models. Therefore, prediction of all test compounds for all three models are 100 % acceptable. In the case of AD of O. niloticus, Fig. 1d suggested that two training compounds being greater than the

leverage critical value and these compounds behave as influential observations which are considered as X outliers.

We have also performed Y-randomization test to validate whether the model was obtained by any chance or not. After shuffling the descriptors values, 100 random models were generated, and we found that the average R²and Q²of those random models are 0.15 and − 0.33, 0.14 and − 0.23, and 0.10 and − 0.16, respectively for Models 7, 8 and 9, respectively, which are much lower than the acceptable limit of 0.5 for both parameters. Details about the modeled descriptors, AD and values of all random models can be found in Supplementary materials in Tables S3 to S8. The results suggested that the models were not chanced correlation.

Correlation of response is also checked with the modeled descriptors in correlation plot (Fig. 2). Strong positive correlations exist between the response variable pLC50(Y) in Model 7 and all three descriptor variables:

VP-3 (0.82), VR3_DzZ (0.70), and AATS5e (0.86). This suggests that increases in these descriptor values correspond to increases in pLC50(Y).

Among the descriptor variables, VP-3 has a moderately positive correlation (0.37) with VR3_DzZ and a strong positive correlation (0.70) with AATS5e. The correlation between VR3_DzZ and AATS5e is moderately positive (0.44). For model 8, pLC50(Y) has a moderately strong positive correlation with TIC2 (0.43) and SRWS (0.51), indicating that an increase in these descriptor values is associated with an increase in pLC50(Y). Notably, even though pLC50(Y) exhibits a weak negative correlation with MATS4i (− 0.11) and no evident correlation with ATSC2m (− 0.01), the relationships are insufficient to draw conclusive conclusions. The robust positive correlation between pLC50(Y) of Model 9 and maxsssCH (0.56), AATS1m (0.55), and SHCsats (0.52) indicates that an Table 2

QSAR and q-RASAR based model's statistical quality and mathematical equations for undivided dataset.

Method Equation Dataset

size (N) No. of

descriptors R² Q²LOO Q²LMO MAE(95%

data)

QSAR Model1(T.zillii):pLC₅₀(Y) = − 4.615( +/− 1.11) +1.031( +/− 0.155)AATS5e+

0.094( +/− 0.012)VR3DzZ+0.25( +/− 0.043)VP− 3 30 3 0.942 0.923 0.922 0.888 Model2(O.niloticus):pLC₅₀(Y) = − 3.151( +/− 1.197) +0.636( +/− 0.099)nCl−

0.114( +/− 0.027)MIC2+11.182( +/− 2.174)SIC2− 2.816( +/− 0.863)MATS4i+ 0.069( +/− 0.008)ATS0p

57 5 0.691 0.635 0.634 0.511

Model3(O.mossambicus):pLC50(Y) = − 0.943( +/− 0.667) +0.48( +/− 0.1)SHCsats+ 0.01( +/− 0.002)AATS1m+0.277( +/− 0.073)ATSC7e+0.621( +/− 0.119)SRW6+ 0.585( +/− 0.169)maxdssC

74 5 0.764 0.724 0.723 0.618

RASAR Model4(T.zillii):pLC₅₀(Y) = − 4.615( +/− 1.11) +1.031( +/− 0.155)AATS5e+

0.094( +/− 0.012)VR3DzZ+0.25( +/− 0.043)VP− 3 30 3 0.895 0.863 0.862 0.377

Model5(O.niloticus):pLC₅₀(Y) =4.014( +/− 0.337) − 3.752( +/− 1.017)MATS4i+

0.009( +/− 0.003)TIC2+8.922( +/− 1.568)gm*SD Similarity 57 3 0.568 0.511 0.510 0.648

Model6(O.mossambicus):pLC₅₀(Y) = − 5.462( +/− 1.665) +0.052( +/− 0.024)VR3DzZ+0.978( +/− 0.294)AATS5e+0.494( +/− 0.181)RA function(LK)

74 3 0.707 0.668 0.667 0.591

Table 3

QSAR and q-RASAR based model's statistical quality and mathematical equations for divided dataset.

Method Model/species/equations NTr/NTest R Q²(LO r²_m(LOO) Δr²_m(LOO) Q²F1 Q²F2 r²_m(_test₎ Δr²_m(_test)

QSAR Model7(T.zillii):pLC50(Y) = − 5.044( ±1.635) +0.243( ±0.054)VP− 3+

0.089( ±0.016)VR3DzZ+1.099( ±0.230)AATS5e 21/9 0.94 0.90 0.86 0.05 0.95 0.94 0.85 0.05

Model8(O.niloticus):pLC50(Y) =1.662( ±0.57) +0.018( ±0.003)TIC2− 3.078( ± 1.015)MATS4i+0.436( ±0.094)SRW5− 0.001( ±0.0001)ATSC2m+

0.014( ±0.003)AATS2m

40/17 0.74 0.67 0.55 0.16 0.77 0.75 0.67 0.16

Model9(O.mossambicus):pLC₅₀(Y) =0.530( ±0.512) +1.487( ± 0.428)maxsssCH+0.637( ±0.138)IC4+0.272( ±0.085)ATSC7e+0.380( ± 0.119)SHCsats+0.011( ±0.003)AATS1m

52/22 0.77 0.70 0.59 0.17 0.74 0.72 0.63 0.20

q-RASAR Model10(T.zilli):pLC50(Y) = − 0.076( +/− 0.804) +1.020( +/− 0.126)RA function(LK) +0.012( +/− 1.981)CVact(LK) −

0.293( +/− 1.345)SD similarity(LK)

21/9 0.84 0.69 0. 59 0.07 0.91 0.87 0.81 0.10

Model11(O.niloticus():pLC₅₀(Y) =1.669( +/− 0.589) +0.018( +/− 0.003)TIC2− 3.0153( +/− 1.045)MATS4i− 0.001( +/− 0.0001)ATSC2m+ 0.014( +/− 0.003)AATS2m+6.630( +/− 1.552)gm*SD Similarity

40/17 0.73 0.66 0.54 0.16 0.77 0.75 0.65 0.18

Model12(O.mossambicus):pLC₅₀(Y) = − 0.529( +/− 0.487) +0.161( +/− 0.088)ATSC7e+0.007( +/− 0.003)AATS1m+

0.923( +/− 0.116)RA function(LK)

52/22 0.74 0.70 0.58 0.20 0.60 0.57 0.47 0.08

(6)

Fig. 1.Scatter plots and Williams plots for Models 7 (a, b), 8 (c, d) and 9 (e, f).

(7)

increase in these descriptor values is associated with an increase in pLC₅₀(Y). pLC₅₀(Y), on the other hand, exhibits a strong negative correlation with IC4 (− 0.66), indicating that a decrease in IC4 values cor- responds to an increase in pLC50(Y). The weakly positive correlation between pLC₅₀(Y) and ATSC7e (0.22), indicating a potential, albeit limited, association. Correlation plot for Models 7, 8 and 9 is illustrated in Fig. 2.

3.3.2. Mechanistic interpretations

The QSAR model (Model 7) for T. zillii employs three descriptors to predict the pLC50 values, which are indicative of the toxicity levels of the compounds. The first descriptor, VP-3, is a vertex degree descriptor that represents the total count of atoms in the molecular structure with three connected vertices. Analyzing Endrin (VP-3 value: 10.08) and Mor- pholine (VP-3 value: 0.85), the correlation between the descriptor and

the pLC50 values is evident. Endrin, with a higher VP-3 value, exhibits a pLC₅₀of 8.09, while Morpholine, with a considerably lower VP-3 value, shows a pLC50 of 1.94. This directly correlates with our hypothesis that an increase in the VP-3 value (more atoms with three connected vertices) results in higher pLC₅₀values, signifying greater toxicity. The second descriptor, VR3_DzZ, is a 3D-MoRSE descriptor that accounts for the atomic van der Waals volumes weighted by atomic valence connectivity. With a positive contribution, VR3_DzZ indicates that as the atomic van der Waals volumes and valence connectivity increase, pLC50

values rise, denoting increased toxicity levels. Comparing p,p-DDT (VR3_DzZ value: 29.65) and O-Xylene (VR3_DzZ value: 2.79), it's evident that a compound with a higher VR3_DzZ value such as p,p-DDT manifests a higher pLC50 value compared to O-Xylene which has a lower pLC50 value of 3.43. This supports our initial observation that as VR3_DzZ values increase, so does the toxicity of the compound. The

Fig. 2. Correlation plots (a), (b) and (c) of the modeled descriptors for Models 7, 8 and 9, respectively.

(8)

third descriptor, AATS5e, belongs to the autocorrelation type and calculates the weighted average of atomic Sanderson electronegativities lagged by a topological distance of 5. Analyzing Phenthoate which has AATS5e value of 8.19 against Toluene with the value of 6.85, Phen- thoate with the higher AATS5e value exhibits a pLC₅₀of 7.15, which is considerably greater than the pLC50 of 2.84 shown by Toluene. This further confirms that as the weighted electronegativities increase, the compound becomes more toxic. (Fig. 3). For Molecules with a higher count of atoms with three connected vertices (higher VP-3 values) might be more prone to bioactive interactions with target proteins, leading to increased toxicity. Similarly, a higher VR3_DzZ value might suggest larger van der Waals surface areas which can enhance molecular con- tacts with biological targets. The increased weighted electronegativity in AATS5e values may indicate the molecule's propensity to engage in polar interactions, which in turn can affect its toxicological profile.

Model 8 for O. niloticus employs five descriptors to predict pLC50

values, representing toxicity levels. These descriptors include TIC2, an autocorrelation type based on atomic Sanderson electronegativity at a

topological distance of 2. Carbosulfan, with a TIC2 value of 243.23, manifests a pLC₅₀of 6.35, suggesting a substantial toxicity. In contrast, Formaldehyde, with a TIC2 value of 0, shows a much lower pLC50 value of 2.27. The positive coefficient for TIC2 indicates that as the value of this descriptor increases, the predicted toxicity also rises. MATS4i, a topological descriptor calculating the Moran autocorrelation of lag 4 weighted by ionization potential. Oxadiazon, with a MATS4i value of 0.3, exhibits a pLC50 of 3.75. On the other hand, Triazophos, with a negative MATS4i value, presents a higher pLC50 of 6.95. This negative relationship emphasizes that compounds with a reduced MATS4i value tend to be more toxic. SRW5, a molecular walk count descriptor measuring the total number of walks of length 5 in the compound. When considering the SRW5 descriptor, Dieldrin, with an SRW5 value of 4.51, manifests a pLC50 of 7.38, indicating significant toxicity. In stark contrast, Hexazinone, having an SRW5 value of 0, results in a considerably lower pLC50 of 3. The positive coefficient for SRW5 suggests that as the total number of molecular walks of length 5 increases, so does the predicted toxicity. ATSC2m, a topological descriptor representing the

Fig. 3.Mechanistic interpretation of QSAR Model 7 for Tilapia zillii.

(9)

centered Broto-Moreau autocorrelation of lag 2 weighted by atomic mass, with a negative influence on pLC50 values. For instance, Oxadia- zon has a value of 1226.49 and a pLC50 of 3.75, while Pretilachlor, with a negative ATSC2m value of − 145.39, has a significantly higher pLC50 of 8.19. The negative coefficient for ATSC2m in the model implies that as this descriptor decreases, the predicted toxicity is higher. AATS2m, an autocorrelation descriptor calculating the average weighted atomic mass at a topological distance of 2, with a positive influence on pLC50

values. By analyzing Endosulfan and Formaldehyde, it becomes evident that an elevated AATS2m value correlates with higher toxicity. This positive relationship might be because larger atomic masses in a given topology can lead to increased molecular interactions that promote toxicity (Fig. 4). The compounds with higher TIC2 and AATS2m values might have enhanced electronegative or atomic mass interactions that augment toxicity. Conversely, compounds with decreased ATSC2m and MATS4i values could exhibit molecular features making them more prone to bioactive interactions. The SRW5 descriptor, indicating molecular complexity, suggests that more complex molecules tend to have a higher potential to interact with biological targets, resulting in augmented toxicity.

Model 9 for O. mossambicus utilizes six descriptors to predict pLC50

values, indicating toxicity levels. The first descriptor, maxsssCH, represents the maximal number of singly bonded carbons in the compound.

maxsssCH has a positive effect on pLC50 values, indicating that an increase in maxsssCH correlates with increased toxicity. Examining Aldrin, with a maxsssCH value of 1.18, it manifests a pLC50 of 7.8, signifying considerable toxicity. Conversely, Urea, who has a maxsssCH value of 0, exhibits a much lower pLC50 of 1.76. This affirms that as the number of singly bonded carbons in a molecule rises, the predicted toxicity level also increases. The second descriptor, IC4, is an information content descriptor that measures the structural complexity of the compound at a topological distance of 4, and its positive contribution indicates that a higher IC4 value results in greater toxicity and increased pLC50 values. Rotenone, showcasing an IC4 value of 5.11, presents a pLC50 of 6.73, indicating notable toxicity. In contrast, Methanol, with an IC4 value of 0, has a mere pLC50 of 0.32. The positive contribution of IC4 confirms that molecules with higher structural complexity are typically associated with increased toxicity. ATSC7e, the third descriptor, represents the centered Broto-Moreau autocorrelation of lag 7 weighted by atomic Sanderson electronegativity, with a positive contribution indicating that a higher ATSC7e value is associated with increased toxicity.

p,p-DDT, with an ATSC7e value of 3.00, records a pLC₅₀of 7.18, reflecting significant toxicity. MCPA, however, with a negative ATSC7e value of − 2.41, demonstrates a lower pLC50 of 2.56. This suggests that an elevated ATSC7e value correlates with heightened toxicity levels. The fourth descriptor, SHCsats, refers to the total number of hydrophobic Fig. 4. Mechanistic interpretation of QSAR Model 8 for Oreochromis niloticus.

(10)

carbon atoms within saturated chains; its positive influence suggests that more hydrophobic carbon atoms result in higher pLC₅₀values, indicating increased toxicity. For instance, Malathion, with an SHCsats value of 3.57, shows a pLC50 of 5.16, indicating a strong toxic profile. In contrast, 2,4-Dichlorophenoxyacetic acid, having an SHCsats value of 0, results in a lower pLC50 of 2.22. This reiterates that an increase in hydrophobic carbon atoms within saturated chains typically leads to heightened toxicity. AATS1m is the final autocorrelation descriptor that calculates the average weighted atomic mass at a topological distance of 1. A positive contribution indicates that compounds with greater AATS1m values have higher pLC50 values, indicating greater toxicity.

When evaluating the AATS1m descriptor, Endosulfan, with a value of 221.97, exhibits a pLC50 of 8.19, signaling profound toxicity. In com- parison, Toluene, with an AATS1m value of 0, yields a pLC50 of 3, indicating that molecules with a higher weighted atomic mass at a topological distance of 1 tend to be more toxic. (Fig. 5). The mechanistic explanations for these observations might be rooted in various molecular interactions and structural complexities. Compounds with a higher

count of singly bonded carbons might possess structural elements that increase the chances of bioactive interactions, enhancing their toxicity.

Additionally, molecules with higher structural complexity shown in elevated IC4 values might be more apt to engage in multi-faceted interactions with biological systems. The weighted atomic mass and hy- drophobicity of certain atoms further fine-tune these interactions, either promoting or reducing the toxic effects.

3.4. External dataset prediction

The predictive capabilities of both the Small Dataset QSAR models (undivided dataset) and the classical QSAR models (dataset divided into training and test sets) were evaluated using a total of 297 external compounds. Specifically, models 1, 2, 3, 7, 8 and 9 were used to predict.

The quantitative prediction, prediction reliability along with AD information for each species can be found in Tables S9 to S11. Using the PaDEL-Descriptor software, the required descriptors for external compounds were determined. We have reported the ten most toxic and the

Fig. 5.Mechanistic interpretation of QSAR Model 9 for Oreochromis mossambicus.

(11)

ten least toxic chemicals for T. zillii, O. niloticus and O. mossambicus in the external dataset, considering only predictions categorized as ‘good’ and

‘within the application domain’ employing average predictions of models 1 and 6, 2 and 7, 3 and 8, respectively. Based on the prediction, we have reported the top 10 highest and lowest toxicants (Fig. 6) among the external dataset chemicals for individual species in Supplementary Materials in Tables S12 and S15, respectively. In addition, a combined analysis revealed the top five common least toxic (Table S14) and most toxic (Table S15) chemicals across all three tilapia fish species (Fig. 6).

When considering the entire spectrum of predictions within the application domain, alongside the requisite standard of predictive accuracy, we observe certain noteworthy patterns for T. zillii, O. niloticus, and O. mossambicus. In the case of T. zillii, the externally projected pLC50

values range from 2.424 to 8.561. Conversely, 1,2-ethanediol is observed to induce the least amount of toxicity. Regarding O. niloticus, the range of predicted pLC50 values extends from 2.697 to 7.771. Herein, cyfluthrin emerges as the chemical constituent with the highest toxicity, possessing a pLC50 value of 7.719. On the other end of the toxicity spectrum, benzene displays the least toxicity with a value of 2.697. The pLC50 values for O. mossambicus are predicted to fall within the bracket of 1.841 and 7.845. Like the pattern observed in T. zillii, heptachlor epoxide is expected to render the highest toxicity to O. mossambicus.

Contrarily, ethanol appears to exert the least toxic effect. Toxicity values are expressed in Molar unit.

4. Overview and conclusions

This work presents the first aquatic toxicity mathematical QSAR and

q-RASAR model for a fish species consumed globally as a significant source of animal protein. The importance of the study is undeniable due to the direct relationship between chemical toxicity to tilapia fish, human consumption, and environmental impact. The major novelty and significance of the present research are following:

1. Understanding Toxicological Mechanism of Action (MoA):

•The developed mathematical models provide a comprehensive understanding of how studied chemicals affect individual tilapia species at a toxicological level.

•The models offer insights into the structural and physicochemical properties of the chemicals responsible for toxicity in tilapia species.

2. Designing Safer Chemicals:

•The models' ability to identify specific features of studied chemicals contributing to tilapia species' toxicity aids in the creation and design of safer and environmentally friendly chemicals in the future.

3. Robustness and Applicability of Models:

•The study highlights the robustness and practicality of the developed QSAR and q-RASAR models for toxicity and risk assessment.

•These models successfully predict toxicity for compounds not originally part of the model development, showcasing their versatility.

4. Prediction and Assessment of Untested Chemicals:

•The developed QSAR and q-RASAR models enable the prediction of aquatic toxicity for 297 untested chemicals across major three Tilapia species.

Fig. 6. 10 least and highest toxicants for individual species and 5 common least and highest toxicants respecting all three species for external dataset.