| 1 INTRODUCTION ZahraMortezaei|MahmoodTavallaei|SayedMostafaHosseini – smallcelllungcanceratdifferentcancerstages,exhibitsimportantgenesandpathways Consideringsmokingstatus,coexpressionnetworkanalysisofnon

(1)

R E S E A R C H A R T I C L E

Considering smoking status, coexpression network

analysis of non – small cell lung cancer at different cancer stages, exhibits important genes and pathways

Zahra Mortezaei | Mahmood Tavallaei | Sayed Mostafa Hosseini

Human Genetic Research Center, Baqiyatallah University of Medical Sciences, Tehran, Iran

Correspondence

Sayed Mostafa Hosseini, Human Genetic Research Center, Baqiyatallah University of Medical Sciences, Sheikh Bahaee Avenue, Molla Sadra, Vanak Square, Postal code: 14351‐16471, Tehran, Iran.

Email: [email protected]

Abstract

Non–small cell lung cancer (NSCLC) is the most common subtype of lung cancer among smokers, nonsmokers, women, and young individuals. Tobacco smoking and different stages of the NSCLC have important roles in cancer evolution and require different treatments. Existence of poorly effective therapeutic options for the NSCLC brings special attention to targeted therapies by considering genetic alterations. In this study, we used RNA‐Seq data to compare expression levels of RefSeq genes and to find some genes with similar expression levels. We utilized the “Weighted Gene Co‐expression Network Analysis” method for three different datasets to create coexpressed genetic modules having relations with the smoking status and different stages of the NSCLC. Our results indicate seven important genetic modules having important associations with the smoking status and cancer stages. Based on investigated genetic modules and their biological explanation, we then identified 13 newly candidate genes and 7 novel transcription factors in association with the NSCLC, the smoking status, and cancer stages. We then examined those results using other datasets and explained our results biologically to illustrate some important genes in relation with the smoking status and metastatic stage of the NSCLC that can bring some crucial information about cancer evolution. Our genetic findings also can be used as some therapeutic targets for different clinical conditions of the NSCLC.

K E Y W O R D S

cancer stages, coexpression network analysis, non–small cell lung cancer (NSCLC), RNA‐Seq, smoking status, transcription factors

1 | I N T R O D U C T I O N

Non–small cell lung cancer (NSCLC) is a subtype of lung cancer and is the most common type of lung cancer in smokers, nonsmokers, women, and young individuals. The NSCLC is one of leading cancer‐related death worldwide and in several areas of oncology, outstanding progress has been made, but still, the prognosis of the NSCLC remains

unknown and its mechanism is still under investigation.

Tobacco smoking and environmental tobacco smoke are important risk factors for the NSCLC. A fraction of nonsmokers developing the NSCLC indicates that the genetic factors may have important roles in determining the susceptibility of the NSCLC.^1-4

Early stage diagnose of the NSCLC is usually not possible and there exist poorly effective therapeutic

(2)

options for that kind of cancer. For example, usually NSCLC is resistant to chemotherapy.⁵Therefore, understanding of the molecular etiology and choices for nonsurgical treatments like targeted therapies will be useful. Some previous studies demonstrated some biomarkers for the NSCLC.⁶Carcinogenesis is the outcome of complex mechanisms including the interaction between some genes with similar expression levels.⁷Thus, it can be useful to find novel molecular biomarkers that can predict clinical outcomes and cancer stages. By affecting chemical reactions involved in a special molecular biomarker, targeted therapy can prevent the spread and the growth of the NSCLC.^8,9

Treatment of the NSCLC depends on some factors like cancer stages, lung functions, and patient conditions. In comparison with other types of lung cancer, the NSCLC usually involves one area of the lung. For that reason, testing cancerous tissue for specific genetic abnormalities or mutations can be useful for targeted therapy. One way to describe the location of cancer, its spreading and affecting other parts of the body is staging. Knowledge about stages of lung cancer can help to recommend appropriate treatment plans.

Considering the fact that in lung cancer, only certain stages can be cured. Understanding genetic variations in association with each stage of the NSCLC can recommend personalized treatment through genes and targeted therapies.^10,11

RNA‐Seq technology is a useful tool for the transcriptome analysis and in comparison with the micro- array platform, sheds more light on pathological mechanisms of different cancers and has benefits of having less background noises.¹² In the past, most studies concen- trated on differentially expressed genes and less attention has been paid to genes with similar expression patterns while their analyses can express some biological mean- ingful results. For example, Li et al¹³used RNA‐Seq data and compared gene expressions among cancer samples considering smoking and nonsmoking histories. More recently, systems biology approaches have been used to analyze the RNA‐Seq data, to extract coexpressed genes and explore relationships between gene sets and clinical features.^14,15To our knowledge, the number of research are limited focusing on the coexpression network analysis of the NSCLC considering clinical information in relation to the smoking history and the cancer stages.

The aim of the current study is to employ the systems biology approaches to jointly analyze the RNA‐Seq data and the clinical information and to identify key genes relating to some clinical features like the smoking status and the cancer stages. Transcription factors, matrix metalloproteinases (MMP) and signaling pathways in association with the important genes were also identified.

Our findings can be proposed as some therapeutic options for the NSCLC with different smoking status and cancer stages.

2 | M A T E R I A L A N D M E T H O D S 2.1 | ^RNA ‐ Seq and clinical data

Transcriptome sequencing data from 87 NSCLC Korean patients has been downloaded from the National Center for Biotechnology Information (NCBI),¹⁶ Gene Expres- sion Omnibus (GEO) under accession number (GSE40419). In that dataset, using aligned RNA sequencing reads, for 36,742 RefSeq genes, the expression levels were measured. Then the number of reads aligned to each row, were normalized using Read Per Kilobase per Million mapped reads.¹⁷ Among 36,742 genes, we have selected 4000 highly variant genes for further analysis.

For 87 individuals with RNA‐Seq from cancer speci- mens, validated clinical information including smoking status and some stages of the NSCLC were also provided.

The smoking history from 87 patients was grouped into smoking and nonsmoking status. Also, the clinical information indicating stages of the NSCLC has been grouped into two stages. We grouped stages 1 and 2 into the initial step, and stages 3 and 4 of the NSCLC into the metastasis step. We also downloaded and utilized GSE84339 NSCLC dataset from the NCBI, GEO, which contains measured expression levels of 24555 genes among 12 individuals and also the TCGA dataset that contains 20532 RefSeq genes among 179 samples.^18,19

2.2 | Coexpression network construction

For identifying highly relevant genes to the given clinical information, the overall connectivity and the gene expression levels were compared using coexpression level correlations. We used Weighted Gene Coexpression Network Analysis (WGCNA) approach for network construction and module detection.²⁰Based on that, the network modules were detected using expression correlation patterns among genes. Then, highly correlated genes in the network modules are likely to be functionally related or involve in similar biological processes. Also, between the genetic modules and the clinical statuses, a higher correlation indicates more related genes to the given clinical information.

For pair‐wise genes, Pearson’s correlation matrix was calculated between the gene expressions which indicates linear dependency between them. Then networks of genes were created when pairwise Pearson correlations between the gene expressions represent edges in those

(3)

networks. The absolute value of that Pearson correlation to the soft‐thresholding power “β= 7”, amn= |cmn| ,β

represents a weighted adjacency matrix of the coexpressed network. The soft‐thresholding power is a scale‐

free topology fit index which indicates the scale‐free property of the network topology and was used to reduce noises in the correlations represented by the adjacency matrix. The power indicating high similarity with a scale‐

free network was selected using a pickSoftThreshold command in WGCNA package of R.²⁰

The connectivity of the coexpressed network was calculated using the topological overlap matrix (TOM) constructed from the adjacency matrix and considered topological similarity. Using a weighted sum, TOM counts for neighbors and its low value indicates weaker connections. Then dissimilarity matrix, (1‐TOM), created from TOM used to generate a hierarchical clustering tree of genes using distance structure. Next, dynamicTreeCut command in the WGCNA package of R was used with the cut‐off value of 20, to create genetic modules with a minimum size of 20.

2.3 | ^Module ‐ trait association

In this study, module‐trait associations have been estimated between the genetic modules and the trait vectors by comparing the genetic modules for their level of association with the given traits. Identification of clinically significant modules was performed using correlations between the module eigengenes and the clinical traits. Module eigengenes are the first principal component derived from Principal Component Analysis (PCA) that can help to meaningfully reduce the data. The module eigengenes summarize the genes inside the genetic modules and each module can be represented by an expression value of its module eigengene. Clinical traits have been chosen in this study are the smoking status and the stages of the NSCLC. Based on the existing clinical information, samples have been grouped into smoking/nonsmoking statuses and beginning/metastasis stages.

Two important parameters which have been used to indicate and compare the correlations between the genetic modules and the clinical traits are module membership and gene significance. Using those parameters, in each module, we can identify highly significant genes for a specific trait as well as high module memberships. The module membership indicates a correlation between the module eigengenes and the gene expression profiles. It quantifies closeness of genes to the given modules and high values of the module membership for specific genes, represents them as the module overall expression profile. Then the gene significance

used to quantify correlations between the clinical traits and the gene expression profile of each individual gene.

We used those parameters to compare the genetic modules based on their existing correlations with the clinical traits. Then the high correlation between the module membership and the gene significant illustrates that the most important module members are often highly significantly associated genes with the trait.

2.4 | Genetic module analysis

In addition to the network construction of the coexpressed genes that have been performed in this study, we also used text‐mining in some coexpression related public databases to find genes reported previously having expression levels similar to the module members identified in this study. Coxpresdb.jp is a database we used for this part of the analysis that contains previously reported coexpressed genes, their functions, biological pathways, and some more explanations.²¹After detection of coexpressed genes with the module members identified in this study, we then worked on those existing correlations between genes and if any gene among them has been previously reported as a lung cancer gene in some public databases like NCBI. Then in a case, that module members are coexpressed with previously reported lung cancer genes, we can highlight them as some genes having some important associations with the NSCLC.

In this study, we also searched for some metastatic genes among the module members. For that reason, we looked for the metastatic gene list of the NSCLC that exist in public databases like NCBI, UCSC,²²etc. In addition, it has been reported previously that an individual gene may have some associations with metastasis stages of different cancers.²³ Thus, we searched for our module members among breast cancer metastatic genes using “Mamma- print”as a breast cancer risk assessment database.²⁴For a next step in the genetic module analysis, we looked for protein classes of all module members and if any module contains Transcription Factor (TF) as an important genetic marker that can regulate the expression levels of some other genes.

2.5 | Functional and pathway enrichment analysis

In this step, we looked for some biological meanings behind the module members, using an online Database for Annotation, Visualization and Integrated Discovery²⁵ which is a powerful enrichment analysis tool that provides a comprehensive biological view for our genetic findings. We also used the Protein Analysis Through

(4)

Evolutionary Relationships (PANTHER)²⁶ database to classify the genes according to their molecular functions, biological processes, and cellular components. Some of our findings in this step include biological meanings behind some of our genetic findings that have been reported previously in some literature to be involved in lung cancer initiation or progression. Based on that, and in a case that molecular functions or biological processes of module members are associated with the lung cancer, we could detect some new candidate genes in that module, having associations with the NSCLC which are subject to further biological and experimental analyses to validate their significance association.

3 | R E S U L T S A N D D I S C U S S I O N 3.1 | Preprocessing and outlier removal

Initially, we selected 4000 highly variant genes among 37,762 RefSeq genes of the GSE40419 data for further analysis. Then using samples’ expression profiles, the samples have been clustered using the hierarchical clustering to detect outliers. As indicated in Figure S1, only one sample from that data, “LC_S38”, has been verified as an outlier and removed from our dataset for further analyses. For the remaining samples, correlations between clinical information and those samples were indicated in Figure S2. We then verified that the network to be constructed using the coexpression network analysis, have a scale‐free topology and the scale‐free fit index reached the value above 0.8. The scale‐free property is one of the fundamental properties of the biological networks indicating their biological meaningfulness.

For other data from NCBI, GEO, with GSE84339 accession number, we have initially selected 4000 highly variant genes in which one sample has been removed as an outlier, indicated in Figure S3. Then a coexpression network constructed from that data showed 0.8 scale‐free fit index. Also for the network analysis of data from TCGA, we initially selected 4000 highly variant genes, then four samples have been shown as outliers (Figure S4) and the scale‐free fit index for that data is above 0.9.

3.2 | Coexpressed genetic modules

As a result of the coexpression network analysis, for data with GSE40419 accession number, we identified eight genetic modules with different sizes verified by the number of belonging genes. A heatmap of TOM among all genes has been illustrated in Figure S5. The heatmap can be used to visualize a weighted network. In that plot, the rows and columns are single genes. That figure

demonstrates adjacency or topological overlaps. The strength of the correlations between the genes in genetic modules is represented with the intensity of the red color in the heatmap dendrogram. Light and dark colors represent low and high adjacencies or overlaps, respectively. That figure shows that highly coexpressed genes are in Black, Yellow and Turquoise modules following with the Red and Brown modules.

Those genes that were not assigned to any module create a Grey module. The hierarchical clustering or dendrogram of the selected genes in that data and their related modules after using cutting branches of the tree is illustrated in Figure S6. Due to our findings, we have a Black module with 28 genes, a Blue module with 70 genes, a Brown module with 46 genes, a Grey module with 3417 genes, a Red module with 31 genes, a Turquoise module with 339 genes, and a Yellow module with 36 genes. The grey module which contains genes not assigned to any coexpressed genetic module, should not be considered for further analyses.

For other data from the NCBI with GSE84339 accession number, we have five genetic modules. The hierarchical clustering of selected genes is illustrated in Figure S7. According to the results of such analysis, genes are dispread among modules with 1776, 1263, 584, 324, and 53 members. Then the results from the TCGA data indicated 9 genetic modules with 1636, 903, 717, 349, 208, 103, 46, 24, and 14 members as shown in Figure S8.

3.3 | Candidate new genes in association with NSCLC

As the next step after module verification, we looked for some modules in the data with GSE40419 accession containing significant genes which were previously reported in association with the lung cancer. We then looked for those findings in other datasets including the GSE84339 from the GEO and also the TCGA data. We identified that in the GSE40419 data, the Red module contains CYP2A6,²⁷CYP4Z2P,²⁸FAM196B,²⁹FAM192A,³⁰KCNT1³¹ and MLLT11³² genes and the Turquoise module contains C7orf57,³³ CAPSL,³⁴ CCDC78,³⁵ CD300LG,³⁶ CYP2F1,³⁷ DLEC1,³⁸ DNAH6,³⁹ FAM92B,⁴⁰ LEFTY2,⁴¹ PI16,⁴² PPMIE,⁴³ PZP,⁴⁴ ROPN1L,⁴⁵ RSPH1,⁴⁶ SCGB1A1,⁴⁷ SCGB3A1,⁴⁸SPAG6,⁴⁹TACR2⁵⁰genes previously reported in association with the lung cancer.

Then, we looked for the other genetic members of the Red and Turquoise modules of the GSE40419 data, their molecular functions, and biological processes to see if any other gene can be a suitable candidate for creating or development of the NSCLC. As a result, we could find MLLT11, FAM19A2, RP1L1, CYP4Z2P, CLCA2, BPIFB1, ISL1, EBF3, MKX, HSF5, ESRRG, RASD1 candidate

(5)

genes as new genetic findings in association with NSCLC.

We then looked for our results achieved from this step in other datasets (GSE84339 from the GEO and the TCGA data). As a result, we identified that MLLT11, FAM19A2, RP1L1, CYP4Z2P, ISL1, EBF3, MKX, ESRRG, and RASD1 genes were also detected in the modules of other datasets.

Among the mentioned genes, we identified that FAM196B, MLLT11, and RP1L1 genes in the Red module of the GSE40419 data, were also included in created modules from the GSE84339 data. Also among reported genes in the Turquoise module of the GSE40419 data, we saw that ISL1, EBF3 and MKX, and ESRRG genes are present only in the GSE84339 dataset from the GEO and RASD1 gene only in the TCGA data. Also, we found that the CYP4Z2P gene in the Turquoise module of the GSE40419 data has also been seen in both created modules from the GSE84339 data in the GEO and the TCGA data. Those results are shown in Table 1 with details. The first column of that table is the list of genes from the Red or the Turquoise module of the GSE40419 data having significant associations with the NSCLC. The second column of that table indicates other datasets used in this study which were confirmed the results achieved from the GSE40419 dataset. Then the third and the fourth columns of that table shows biological pathways and molecular function of those genes that may have some effects on the NSCLC.

The existing relationships between the new candidate genes mentioned in this section and other human cancers except for the lung cancer, were also mentioned in the sixth column of Table 1.32,51,57-64

The main source we used for this part was DisGeNET database.⁶⁸Comparison of some associated diseases with the new candidate genes can help in selecting them to be checked experimentally to prove their association with the NSCLC. Then, we looked for the expressions of those candidate genes in different cell lines of the lung. We reported the results of this part in the fifth column of Table 1. The main sources we used for this part of our analysis were UCSC genome browser²² and Encyclopedia of DNA elements (EN- CODE)⁶⁷ databases. As indicated in the results of this part, the expression level of BPIFB1 gene in the lung is very high and the expression of HSF5 gene in the lung and the bronchus‐fibroblast‐of‐lung is almost high. Then, we found that the MLLT11, FAM19A2, RP1L1, MKX, ESRRG, and RASD1 genes are expressed in the fibroblast of the lung but their expression levels are low. Based on the results of this part, we could candidate some novel genes, reported in Table 1, having some associations with the NSCLC that are subject to future experimental examinations and validations on some mouse models before it can be used clinically for human.

3.4 | Biological pathways and newly candidate transcription factors

For each identified module of the GSE40419 data, considering the module‐trait associations, we character- ize each module with respect to the smoking status and the stages of the NSCLC. As identified in Figure 1, the Red and Turquoise modules are related to the beginning of the NSCLC, but from different pathways. Using PANTHER database, we found that the Red module is more involved in Nicotine degradation pathway⁶⁹and the Turquoise module in P53 pathway feedback loops 2⁷⁰and also in Inflammation mediated by chemokine and Cytokine signaling pathway,⁷¹ Enkephalin release, Go- nadotropin‐releasing hormone receptor pathway⁷² and Opioid proenkephalin pathway.⁷³

Our results also indicated that the Brown module of the GSE40419 data has some important genes that are activated with the smoking and then we looked for their biological pathways and processes. We identified that this module contains some important genes that involve CCKR signaling map, Plasminogen activating cascade, Apoptosis signaling pathway,^74,75Blood coagulation, P53 pathway, Interleukin signaling pathway,^76-78Insulin/IGF pathways,⁷⁹ Inflammation mediated by chemokine and cytokine signaling pathway^80,81and Gonadotropin‐releasing hormone receptor pathway.⁸²

We also concluded that some genes inside the Black module of the GSE40419 data are related to the cancer progression and the smoking that can activate their negative functions. We then looked for some important genes inside that module. It has been shown that, this genetic module is more involved in Lysine biosynthesis and Synaptic vesicle trafficking pathways.⁸³ Also, our results indicated that the Green and Yellow modules of the GSE40419 data are stage‐specific. They contain some genes that are specific to stage 1 of the NSCLC. We found that the Green module contains some key genes that involve EGF receptor signaling pathway⁸⁴and the Yellow module genes involve FGF signaling pathway, P53 pathway, and Metabotropic glutamate receptor group II pathway.^77,85,86 The results of this part have been summarized in Table 2.

We then looked for some TFs in those modules that can increase cancer risk. Our results indicate that the Brown module of the GSE40419 data has NR4A3, the Yellow module of that data has POU5F1B and the Turquoise module has ISL1, EBF3, MKX, HSF5, ESRRG and Ebf3 as some important TFs. Then the list of newly identified TFs in association with the NCSLC, its stages, and the smoking status have been summarized in Table 3. We then checked for our findings in other datasets including the GSE84339 from the GEO and the

(6)

TABLE1NewlycandidategenesinassociationwiththeNSCLC Gene, chromosomeDatasetsMolecularfunctionBiologicalprocesses,pathwaysExpressionlevelin lungcelllinesAssociatedhumandiseases MLLT11,1GSE40419, GSE84339ProteinbindingExtrinsicapoptoticsignaling pathway,positiveregulationofthe apoptoticprocess,positive regulationofmitochondrial depolarization,DNA‐templated Lowexpressionlevel infibroblast‐of‐lung andNCI‐H460

Leukemia,mammaryneoplasma, breastcarcinoma32,51 FAM19A2,12GSE40419, GSE84339Lowexpressionlevel infibroblast‐of‐lung andNCI‐H460

Tobaccousedisorder,mental impairment52 RP1L1,8GSE40419, GSE84339Axonemeassembly,intracellular signaltransduction,photoreceptor celldevelopment,photoreceptorcell maintenance,retinadevelopmentin thecamera‐typeeye,visual perception

Lowexpressionlevel infibroblast‐of‐lungOccultmaculardystrophy,retinal disease,retinitispigmentosa,age‐ relatedmaculardegeneration53 CYP4Z2P,1GSE40419, GSE84339, TCGA

Henebinding,ironionbinding, oxidoreductaseactivity,actingonpaired donors,withincorporationorreductionof molecularoxygen

Oxidation‐reductionprocessNoexpressioninlung celllinesBreastcarcinoma,malignant neoplasmofbreast,tumor angiogenesis54 CLCA2,1GSE40419Chloridechannelactivity,intracellular calciumactivatedchloridechannel activity,ligand‐gatedionchannelactivity, metalionbinding,metalloendopeptidate activity Celladhesion,chloride transmembranetransport,ion transmembranetransport, proteolysis

Noexpressioninlung celllinesMalignantneoplasmofbreast,breast carcinoma,neoplasmmetastasis, mammaryneoplasm,prostate carcinoma55,56 BPIFB1,20GSE40419Lipidbinding,molecularfunctionTheantimicrobialhumoralresponse, innateimmuneresponseinthe mucosa,negativeregulationoftoll‐ likereceptor4signalingpathway

Highlyexpressedin thelungNasopharyngealcarcinoma,primary malignantneoplasmoflung, carcinomaofthelung,pulmonary cysticfibrosis,malignantneoplasm oflung,lungdiseases,interstitial57,58 ISL1,5GSE40419, GSE84339RNApolymeraseIIactivatingtranscription factorbinding,bHLHtranscriptionfactor binding,enhancersequence‐specificDNA binding,esterogenreceptorbinding, ligand‐dependentnuclearreceptor binding,metalionbinding,promoter‐ specificchromatinbinding

Atrialseptummorphogenesis,axon regeneration,cardiaccellfate determination,cellularresponseto glucocorticoidstimulus,endocardial cushionmorphogenesis,heart development,innervation, mesenchymalcelldifferentiation, negativeregulationofcanonical Wntsignallingpathway,negative Noexpressioninlung celllinesBladderexstrophy,bipolardisorder, bladderneoplasm,recurrenturinary tractinfection,heartdisease59,60 (Continues)

(7)

TABLE1(Continued) Gene, chromosomeDatasetsMolecularfunctionBiologicalprocesses,pathways

Expressionlevelin lungcelllinesAssociatedhumandiseases regulationofepithelialcell proliferation,negativeregulationof inflammatoryresponse,negative regulationofintracellularestrogen receptorsignallingpathway,outflow tractmorphogenesis,pancreas development,pharyngealsystem development EBF3,10GSE40419, GSE84339RNApolymeraseIIproximalpromoter sequence‐specificDNAbinding,metalion binding,proteindimerization

Multicellularorganismdevelopment, positiveregulationoftranscription byRNApolymeraseII,DNA‐ templated Noexpressioninlung celllinesSquamouscellcarcinoma, Alzheimer’sdisease,pediatricacute lymphoblasticleukemia, degenerativepolyarthritis61 MKX,10GSE40419, GSE84339

Sequence‐specificDNAbindingMuscleorgandevelopment, regulationoftranscription,DNA‐ templated

Lowexpressionlevel infibroblast‐of‐lung HSF5,17GSE40419DNAbindingtranscriptionfactoractivity, Sequence‐specificDNAbindingRegulationoftranscription,DNA‐ templatedAlmosthighly expressedinthelung andbronchus‐ fibroblast‐of‐lung

Melanoma62 ESRRG,1GSE40419, GSE84339AF‐2domainbinding,RNApolymeraseII regulatoryregionsequence‐specificDNA binding,retinoicacidreceptoractivity, steroidhormonereceptoractivity,zincion binding

Positiveregulationoftranscriptionby RNApolymeraseII,DNA‐ templated,retinoicacidreceptor signalingpathway,steroidhormone mediatedsignalingpathway Lowexpressionlevel inthelungand fibroblast‐of‐lung

Malignantneoplasmofbreast, adenocarcinoma,anoxia,colorectal cancer,tobaccousedisorder,breast carcinoma63 RASD1,17GSE40419, TCGAGTPbinding,GTPaseactivity,protein bindingG‐proteincoupledreceptorsignaling pathway,negativeregulationof transcription,DNA‐templated,nitric oxidemediatedsignaltransduction

Lowexpressionlevel inthelungand fibroblast‐of‐lung

Prostaticneoplasms,aicardi‐goutieres syndrome,diabetes,malignant neoplasmofbreast,mammary tumorigenesis64 Note:NewlycandidategenesintheRedandTurquoisemodulesoftheGSE40419datahavinganassociationwithNSCLCareidentifiedusingcoexpressionnetworkanalysisandsomebiologicaldatabaseslike GeneOntology(GO)65,66forenrichmentanalysis.ThoseresultsareexaminedusingtheGSE84339datafromtheGEOandtheTCGAdataandareindicatedinthistable.Theexpressionlevelsofcandidate genesinthelungcelllineshavebeenreportedusingENCODEdatabase67andalsotheirassociationwithsomehumandiseasesusingDisGeNETdatabase.68 Abbreviation:NSCLC,non–smallcelllungcancer.

(8)

F I G U R E 1 Module‐trait association.

The figure illustrates the association between the genetic modules and some clinical information from the GSE40419 data including the smoking status and the cancer stages. The intensity of the red and green colors represents positive and negative correlations, respectively

T A B L E 2 Module pathways. Genetic modules created from a coexpression network analysis of the GSE40419 dataset, biological pathways which are more related to those genetic modules and their association with the clinical traits

Module Pathway Clinical traits

Red Nicotine degradation pathway Beginning of lung adenocarcinoma

Turquoise P53 pathway feedback loops 2, inflammation mediated by chemokine and cytokine signaling pathway, enkephalin release, gonadotropin‐releasing hormone receptor pathway, opioid proenkephalin pathway

Beginning of lung adenocarcinoma

Brown CCKR signaling map, plasminogen activating cascade, apoptosis signaling pathway, blood coagulation, P53 pathway, interleukin signaling pathway, insulin/IGF pathways, inflammation mediated by chemokine and cytokine signaling pathway, gonadotropin‐releasing hormone receptor pathway

Activated by smoking

Black Lysine biosynthesis and synaptic vesicle trafficking pathways Cancer progression, activated by smoking

Green EGF receptor signaling pathway Beginning of lung adenocarcinoma

Yellow FGF signaling pathway, P53 pathway and metabotropic glutamate receptor group II pathway

Beginning of lung adenocarcinoma

(9)

TABLE3NewlycandidateTFsinrelationwiththeNSCLC.TFsfromanalyzingtheGSE40419datathatcanbenewcandidatesfortheNSCLC,theirrelatedmodulesandbiological pathwaystheyinvolve Transcription factorModule,databaseBiologicalpathwayInteractedproteins

Importantproteinsin associationwithlungdiseases NR4A3Brown(GSE40419)Axonguidance,cellularrespiration,cellular responsetocatecholaminestimulus,cellular responsetocorticotropin‐releasinghormone stimulus,cellularresponsetoleptinstimulus, commonmyeloidprogenitorcellproliferation, gastrulation,hippocampusdevelopment,mast celldegranulation,mesodermformation, positiveregulationofcellcycle,positive regulationofepithelialcellproliferation,positive regulationofglucosetransmembranetransport

STX17,TAF15,HIVEP1,PSMD2,PSMC1,EXT1, TCF12,TFG,EWSR1,FUSPSMD2(malignantneoplasm ofthelung) POU5F1BYellow(GSE40419)Regulationoftranscription,DNA‐templatedPRDM14,TMEM75,FAM84B,TCF7L2,HLA‐CTCF7L2(adenocarcinoma) ISL1Turquoise(GSE40419), Blue(GSE84339)Atrialseptummorphogenesis,axonregeneration, cardiaccellfatedetermination,cellularresponse toglucocorticoidstimulus,endocardialcushion morphogenesis,heartdevelopment,innervation, mesenchymalcelldifferentiation,negative regulationofcanonicalWntsignallingpathway, negativeregulationofepithelialcell proliferation,negativeregulationof inflammatoryresponse,negativeregulationof intracellularestrogenreceptorsignalling pathway,outflowtractmorphogenesis,pancreas development,pharyngealsystemdevelopment

LDB1,GIP,LHX4,SSBP3,LDB2,SHH,PAX6, NKX2‐5,BMP4,SSBP2LDB1(squamouscellcarcinoma), SSBP3(tobaccousedisorder) EBF3Turquoise(GSE40419), Blue(GSE84339)

Multicellularorganismdevelopment,positive regulationoftranscriptionbyRNApolymerase II,DNA‐templated MAPRE2,PRDM8,MAPRE1,PRDM1,PRDM6, PRDM5,PRDM4,ZNF423,TMTC1,MEN1

MAPRE1(primarymalignant neoplasmofthelung) MKXTurquoise(GSE40419), Pink(GSE84339)Muscleorgandevelopment,regulationof transcription,DNA‐templatedTCF15,SCXA,SCXB,COL1A1,PBX4,MYOD1, ARMC4,TNXB,TNMD HSF5Turquoise(GSE40419)Regulationoftranscription,DNA‐templatedNAP1L4,NAP1L6,NAP1L2,NAP1L3,NAP1L5, NAP1L1,CDC5L,ZSWIM6,ZSWIM5,ASF1B

CDC5L(adenocarcinoma), ZSWIM5(tobaccousedisorder) ESRRGTurquoise(GSE40419), Blue(GSE84339)PositiveregulationoftranscriptionbyRNA polymeraseII,DNA‐templated,retinoicacid receptorsignalingpathway,steroidhormone mediatedsignalingpathway

NCOA1,MED1,PNRC2,TLE1,NRIP1, PPARGC1A,TLE1(malignantneoplasm ofthelung) ENSG00000215700,SLC7A1,NCOR1,NCOR2 Note:InteractedproteinswiththementionedTFshavebeenindicatedusingSTRING(Jemsenetal2009)databaseandalsotheirassociationwithsomehumandiseasesusingDisGeNETdatabase.68 Abbreviation:NSCLC,non–smallcelllungcancer.

(10)

TCGA dataset. As shown in the second column of Table 3, some of the genetic findings at this stage were also repeated and shown in other datasets. As indicated in that table, MMP3,⁸⁷FGF18,⁸⁸ISL1,⁸⁹EBF3,⁹⁰and MKX⁹¹ are repeated and confirmed results in other datasets.

Based on the importance and validation of our results, we could candidate some important TFs in association with the NSCLC to be checked experimentally on lab models before it can be used for real‐time applications. Next, for the mentioned TFs, we looked for some classes and groups of genes, on which those TFs could effect. For that, the class of interacted proteins with TFs have been found using STRING database⁹² and the results are mentioned in the fourth column of Table 3. We then looked for the mentioned proteins, having interactions with TFs, and their relationships with some human diseases as reported in DisGeNET database.⁶⁸

3.5 | Metastatic and other important genes in NSCLC

As a result of our analyses, we found one important metastatic gene called FGF18 in the Yellow module of the GSE40419 data. Some genes with important roles in the metastasis stage of the NSCLC can also cause the metastasis stage of the breast cancer that was reported in the Mammaprint database, containing some important genes in association with the Breast cancer. As reported previously, the Yellow module more involved in the beginning stage of the NSCLC. This result, in

combination with the metastatic genes that have been founded in that module, indicates that the metastatic genes that can spread the lung cancer to other parts of the body have important roles at the beginning of cancer.

This fact demonstrates that cancer treatments via gene therapy are effective from the beginning stage of cancer.

In the next step, and as a result of searching for some important genes in association with the NSCLC using coxpresdb.jp database,²¹we found that RP1L1 gene in the Red module is coexpressed with some genes that are involved in the cancer pathways. We also found that FGF18 gene in the Yellow module is coexpressed with genes in the cancer pathways. These findings have also been shown in the results achieved from the coexpression network analyses of other datasets (GSE84339 from the GEO and the TCGA data) and can be used to candidate new genes in association with the NSCLC which are subject to further analysis for gene therapies. Our results also showed that in the Black module of the GSE40419 data, there exists an important gene called TEPP which is coexpressed with MMP. This finding has been shown in the results achieved from the network analyses of the GSE84339 dataset from the GEO and also the TCGA data. In the Brown module, we then found some MMPs including MMP25 and MMP3. As reported previously, the Brown and Black modules were more related to the smoking status. Therefore, our results illustrate some candidate new genes in association with the NSCLC and especially considering the smoking status. The results of this part are also shown in Table 4 and are subject T A B L E 4 Newly candidate metastatic genes in association with the NSCLC

Metastatic gene Databases

Expression level in lung cell

lines Associated diseases

FGF18 GSE40419,

GSE84339, TCGA

Almost low expression level in the lung and fibroblast‐of‐lung

Malignant mesothelioma, polycystic ovary syndrome, diaphragmatic hernia, cleft lip, colonic neoplasms

RP1L1 GSE40419,

GSE84339

Very low expression level in fibroblast‐of‐lung

Occult macular dystrophy, retinal diseases, retinitis pigmentosa, age‐related macular degeneration, hereditory macular dystrophy

CYP4Z2P GSE40419 No expression in lung cell lines Breast carcinoma, malignant neoplasm of breast, tumor angiogenesis

TEPP GSE40419,

GSE84339

Almost low expression level in the lung and fibroblast‐of‐lung

Crohn disease, inflammatory bowel diseases

MMP25 GSE40419 Intermediate expression level in the lung

Colonic neoplasms, hepatitis C chronic, cleft lip, cleft palate, liver cirrhosis, multiple sclerosis, amelogenesis imperfectra, colon carcinoma, odontogenic tumors

MMP3 GSE40419,

GSE84339, TCGA

Almost low expression level in fibroblast‐of‐lung

arthritis experimental, hyperalgesia, coronary artery disease, mammary neoplasms, astrocytoma, bipolar disorder, schizophernia, coronary restenosis

Note:As a result of our coexpression network analysis of the GSE40419 data, metastatic genes are reported in this table. Also, the expression levels of those genes in the lung cell lines has been reported as well as some human diseases in association with them. Validation of those important findings reported in this table has been performed using other datasets and also shown in this table.

Abbreviation: NSCLC, non–small cell lung cancer.

(11)

to further experimental validations on lab models like a mouse, before it can be used for human’s real‐time applications.

The expression levels of the genes reported in this section have also been founded in the lung cell lines using ENCODE database.⁶⁷ As indicated in Table 4, MMP25 gene has an intermediate expression level in the lung. Also, FGF18, TEPP and MMP3 genes have almost low expression levels in the lung or the fibroblast‐of‐lung.

We also found that RP1L1 gene has a very low expression level in the fibroblast‐of‐lung. Then we looked for some diseases associated with the genes reported in this section using DisGeNET database.⁶⁸ Our findings demonstrate that some metastatic genes reported in this part including MMP25, FGF18, TEPP, MMP3, and RP1L1, which have associations with the NSCLC, because of their importance, they are subject to the future experimental validation on some mouse models initially to be used clinically for real‐time human applications.

4 | C O N C L U S I O N

In summary, we identified some genetic modules containing genes with similar expression levels using RNA‐Seq data from Korean patients (GSE40419). We also analyzed the GSE84339 data from the GEO and also the TCGA data using the coexpression network analysis.

Then from the illustrated genetic modules, we identified some novel genes and TFs in association with NSCLC.

We also utilized some clinical information in relation to the smoking status and the stages of NSCLC and indicated their relationship with the genetic modules.

Based on the mentioned relationships between the genetic modules and the clinical information, we identified some new genes specific to the smoking status and different stages of NSCLC. For example, our results can illustrate some important genes in relation to the beginning of cancer or cancer metastasis. To more validate our results, we then discussed our genetic findings biologically using different databases. Next, to candidate‐specific results of this analyses for experimental validations, we illustrated expression levels of each candidate gene in the lung cell lines and based on that, tried to explain more our genetic findings biologically.

The results of this analysis bring some new genetic factors in association with NSCLC that can candidate novel targeted therapies as an effective therapeutic option for that kind of cancer.

C O N F L I C T S O F I N T E R E S T S

The authors declare that there is no conflict of interests.

O R C I D

Sayed Mostafa Hosseini http://orcid.org/0000-0003- 3716-8133

R E F E R E N C E S

1. Landi MT, Dracheva T, Rotunno M, et al. Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival.PLoS One. 2008;3(2):e1651.

2. Okazaki I, Ishikawa S, Ando W, Sohara Y. Lung adenocarcinoma in never smokers: problems of primary prevention from aspects of susceptible genes and carcinogens.Anticancer Res.

2016;36(12):6207‐6224.

3. Okazaki I, Ishikawa S, Sohara Y. Genes associated with susceptibility to lung adenocarcinoma among never smokers suggest the mechanism of disease.Anticancer Res. 2014;34(10):

5229‐5240.

4. Thu KL, Vucic EA, Chari R, et al. Lung adenocarcinoma of never smokers and smokers harbor differential regions of genetic alteration and exhibit different levels of genomic instability.PLoS One. 2012;7(3):e33003.

5. Sørensen JB, Hansen HH. Chemotherapy in adenocarcinoma of the lung.Cancer Surv. 1989;8(3):671‐679.

6. Calvayrac O, Pradines A, Pons E, Mazières J, Guibert N.

Molecular biomarkers for lung adenocarcinoma.Eur Respir J.

2017;49:1‐17. https://doi.org/10.1183/13993003.01734‐2016 7. Tavazoie S, Hughes JD, Campbel MJ, Cho RJ, Church GM.

Systematic determination of genetic network architecture.Nat Genet. 1999;22(3):281‐285.

8. Chan BA, Hughes BGM. Targeted therapy for non‐small cell lung cancer: current standards and the promise of the future.

Transl Lung Cancer Res. 2015;4(17):36‐54. https://doi.org/10.

3978/j.issn.2218‐6751.2014.05.01

9. Pillai RN, Ramalingam SS. Advances in the diagnosis and treatment of non‐small cell lung cancer. Mol Cancer Ther.

2014;13:557‐565. https://doi.org/10.1158/1535‐7163.MCT‐13‐

0669

10. Bianchi F, Nuciforo P, Vecchi M, et al. Survival prediction of stage I lung adenocarcinomas by expression of 10 genes.J Clin Invest.

2007;117(11):3436‐3444. https://doi.org/10.1172/JCI32007.

11. Shao W, Wang D, He J. The role of gene expression profiling in early‐stage non‐small cell lung cancer. J Thorac Dis. 2010;2:

89‐99.

12. Stratton MR, Campbell PJ, Futreal PA. The cancer genome.Nature.

2009;458(7239):719‐724. https://doi.org/10.1038/nature07943 13. Li Y, Xiao X, Ji X, Liu B, Amos CI. RNA‐seq analysis of lung

adenocarcinomas reveals different gene expression profiles between smoking and nonsmoking patients. Tumour Biol.

2015;36:8993‐9003. https://doi.org/10.1007/s13277‐015‐3576‐y 14. Giulietti M, Occhipinti G, Principato G, Piva F. Weighted gene

co‐expression network analysis reveals key genes involved in pancreatic ductal adenocarcinoma development. Cell Oncol.

2016;39:379‐388. https://doi.org/10.1007/s13402‐016‐0283‐7 15. Yuan L, Chen L, Qian K, Qian G, Wu C‐l, Wang X. Genomics

data co‐expression network analysis identified six hub genes in association with progression and prognosis in human clear cell renal cell carcinoma (ccRCC).Genom Data. 2017;14:132‐140.

https://doi.org/10.1016/j.gdata.2017.10.006