machine learning-based phishing detection system A new hybrid ensemble feature selection framework for Information Sciences

(1)

ContentslistsavailableatScienceDirect

Information Sciences

journalhomepage:www.elsevier.com/locate/ins

A new hybrid ensemble feature selection framework for machine learning-based phishing detection system

Kang Leng Chiew

^a

, Choon Lin Tan

^a^,^∗

, KokSheik Wong

^b

, Kelvin S.C. Yong

^c

, Wei King Tiong

^a

aFaculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, Kota Samarahan, Sarawak 94300, Malaysia

bSchool of Information Technology, Monash University Malaysia, Bandar Sunway, Selangor 47500, Malaysia

cDepartment of Electrical and Computer Engineering, Faculty of Engineering and Science, Curtin University, CDT 250, Miri, Sarawak 98009, Malaysia

a r t i c l e i n f o

Article history:

Received 25 March 2018 Revised 22 January 2019 Accepted 24 January 2019 Available online 24 January 2019 Keywords:

Phishing detection Feature selection Machine learning Ensemble-based Classiﬁcation Phishing dataset

a b s t r a c t

Thispaperproposesanewfeatureselectionframeworkformachinelearning-basedphishingdetectionsystem, called the Hybrid Ensemble Feature Selection (HEFS).In the ﬁrst phaseofHEFS,anovelCumulativeDistributionFunctiongradient(CDF-g)algorithmisex- ploitedtoproduceprimaryfeature subsets,whicharethenfedintoadataperturbation ensembletoyield secondaryfeaturesubsets. Thesecondphasederivesasetofbaseline featuresfromthe secondaryfeature subsetsbyusingafunctionperturbationensemble.

Theoverallexperimental results suggestthat HEFS performsbest whenit isintegrated withRandomForestclassiﬁer, wherethebaselinefeaturescorrectlydistinguish94.6%of phishingandlegitimatewebsitesusingonly20.8%oftheoriginalfeatures.Inanotherex- periment,the baselinefeatures(10intotal)utilisedonRandom Forestoutperformsthe set ofall features(48 intotal)used on SVM,Naive Bayes,C4.5,JRip, and PART classi- ﬁers.HEFS alsoshows promising resultswhen benchmarkedusinganother well-known phishingdatasetfromtheUniversityofCaliforniaIrvine(UCI)repository.Hence,theHEFS isahighlydesirableandpracticalfeatureselectiontechniqueformachinelearning-based phishingdetectionsystems.

1. Introduction

Phishingisa formofcyber-attackthat utilisescounterfeitwebsites tostealsensitiveuserinformationsuch asaccount logincredentials, credit cardnumbers, etc. Throughoutthe world, phishingattacks continueto evolve andgain momen- tum.In June 2018, the Anti-PhishingWorking Group (APWG) reportedas many as51,401 unique phishing websites [3]. AnotherreportbyRSAestimatedthatglobalorganisationssufferedlossesamountingto$9billionduetophishingincidents in2016[5].Thesestatisticshaveproventhattheexistinganti-phishingsolutionsandeffortsarenottrulyeffective.

Themostwidely deployedanti-phishing solutionis theblacklistwarningsystem, foundinconventional webbrowsers such as Google Chrome and Mozilla Firefox. The blacklist system queries a central database of already-known phishing URLs,thusitisunabletodetectnewlylaunchedphishingwebsites[20,29].Amorepromisingandintelligent anti-phishing

∗ Corresponding author.

E-mail addresses: [email protected] (K.L. Chiew), [email protected] (C.L. Tan), [email protected] (K. Wong), [email protected] (K.S.C. Yong), [email protected] (W.K. Tiong).

https://doi.org/10.1016/j.ins.2019.01.064

(2)

solutionisthemachinelearning-baseddetectionsystem. Thistechniquecapitalisesonlargeamountofindicators,ormore commonlyknownasfeatures,givingittheﬂexibilitytorecognisenewphishingwebsites.

The accuracyofamachine learning-based phishingdetectionsystemdependsprimarily onthe selectedfeatures.Most anti-phishingresearchersfocusonproposing novelfeaturesoroptimisingclassiﬁcationalgorithms[13,16],inwhichdevel- oping a proper feature selection andanalysistechniqueare not their mainagenda. As such, irrelevant features may still exist,increasingthecost(i.e.,storage,power,trainingtime,etc.)ofatechniquebutnotcontributingtotheoverallaccuracy.

Thus,asystematicfeatureselectionframeworkisneededtoidentifyacompactsetoffeaturesthataretrulyeffective.

In general, thereare two main techniquesfor selectingfeatures, namely, thewrapper andfilter measure. The wrapper methodrunsiteratively,whereeachruninvolvesgeneratingafeaturesubsetandevaluatingitviaclassification(i.e.,training andtesting).Whenevaluatingalargersetoffeatures,thenumberofiterationsinwrappermethodwillscaleupexponen- tially andbecome computationally impractical forreal time applications. On theother hand,filter measures are metrics computedfromstatisticalandinformationtheories,which canreflectthegoodness ofeachindividual featurewithoutin- vokinganyparticularclassifier[1,28].Thus,inthisstudy,wefocusonthefiltermeasures,whicharefarlesscomputationally intensiveandhence,morepracticalforfeatureselection.

Previousphishingdetectionstudiesthat utiliseﬁltermeasuresforfeatureselection areincomplete.Filtermeasures,on theirown,canonlyrankthefeaturesaccordingtoimportance.Obtainingareducedfeaturesubset requireschoosingacut- off rank,whereonlythefeaturespositionedabovethiscut-off rankareselected.Tothebestofourknowledge,optimisation ofthefeaturecut-off rankhasneverbeenexploredbeforeintheliterature.Thereisnosystematicframeworktodirectﬁlter measures-basedfeatureselectioninidentifyingtheoptimalcut-off rank.

Inthis paper,a novelalgorithm calledCumulativeDistributionFunctiongradient (CDF-g) isproposed asan automatic featurecut-off rankidentifier.TheCDF-galgorithmexploitspatternsinthedistributionoffiltermeasurevaluestodetermine theoptimalcut-off rank,leadingtoahighlyeffectiveandcompactsetofbaselinefeatures.Tofurtherimprovethebaseline features’ stabilityand reduce overfitting, we extended the CDF-g algorithm into a hybrid ensemble structure consisting ofdataperturbation andfunctionperturbationtechniques.Data perturbation entailssubsamplingthedataset intomultiple partitions,whilefunctionperturbationinvolvesapplyingdifferentfiltermeasuresonthesamedataset[19].Wetermedthe completefeatureselectionframeworkastheHybridEnsembleFeatureSelection(HEFS).

This paperhas madethe following contributions: (a) Exploiting anovel algorithm, i.e., CDF-g, usinga hybrid ensemble structure to produce a stable androbust subset of features; (b) Achieving competitive detectionaccuracy while using reducedmachineprocessingresources;(c)Eliminatingtheneed forhumaninterventionsbyusinga fullycomputable techniquetoautomaticallyselectthemosteffectivephishingfeatures;(d)Providinga flexibleandrobustfeatureselection frameworkthatcanbeutiliseduniversallyondifferentdatasets;(e)Acceleratingthedevelopmentofaccurateandefficient machinelearning-based phishingdetectionsystemsbyapplyingtheproposedfeature selectionframeworkandtherecommendedbestperformingclassifier;(f)Prescribingthebaselinefeatures(i.e.,bestfeaturesubset)asthedefactostandardfor anti-phishingresearcherstobenchmarknewlyproposedmachinelearningtechniques,and;(g)Constructingalargepublicly accessiblemachinelearningdatasetthatishighlyrelevanttomostanti-phishingresearches.

The remainder ofthe paperis organisedasfollows: Section 2highlights the motivations ofthispaper. Section 3dis- cussesrelatedworksandtheirlimitationsonfeatureselectionwithregardstophishingdetectionapplication.Section4puts forwardtheproposedfeatureselectionframework.Section5describestheexperimentalsetup,presentstheresults,anddis- cussesthemajorﬁndings.Finally,Section6concludesthispaper.

2. Motivation

Ourworkinthispaperisdrivenbythefollowingmotivations:

(a) Thereisnoexistingframeworktoreliably determinethecut-off ranks (i.e.,cut-off point)forthefeatures rankedby ﬁltermeasures, thusresearchersresortto themostconvenient way,whichisby arbitrarily selectinga cut-off rank (e.g.,top-10, top-20,etc.) [1,21,28]. Obviously,such an arbitraryselection is not an optimised techniquebecause it ispronetooverstatingorunderstatingthecut-off rank.Hence,weare motivatedtoexploreontheoptimisationsof featuredimensionality.

(b) Different machine learning datasets vary in characteristics such as feature type, number of features, etc.

Through our preliminary experiments, we found that the effectiveness of the existing feature selection technique in[22]and[27]aredatasetdependent(i.e.,notrobust),andtheyhavefailedtoidentifytheappropriatefeaturecut- off rankwhenappliedonourdataset.Thus,weareinspiredtostudyfurtherontheﬁltermeasurevalues,whichled totheproposalofaﬂexibleandintelligentalgorithmcalledCDF-g.

(c) Inphishing detection, thereis nocommonagreement onwhich classifier ismorepreferable when theobjectiveis tomaximisedetectionaccuracy.Forexample,anumberofstudieshavesuggestedthattheRandomForest classifier outperformed other classifiers[14,18,24,25]. However, some recentpapers still continueto employ other classifiers (i.e.,notRandomForest) intheirmachine learning-basedphishingdetectionsystem[7,8,15,23].Hence,the question on“which classifieristhe mostpredictiveone” remainsasahuge subjectofdebate.Forthat,we aremotivatedto providemoreinsightsintothisissuebyperformingalargescalebenchmarking,wheresixclassifiersaretestedwith differentfeaturesubsetcombinationsderivedfromvariousfiltermeasuresrankings.

(3)

3. Relatedworks

Inthis section, we review the recent phishingdetection techniques that are based onfeature selection, and evaluate differentclassiﬁcationmodels.

Toolanand Carthy [28] investigated into the selection offeatures among 40features, which are gathered fromprior techniquesdesignedforthedetectionofspamandphishing.Theirprimaryanalysisinvolvesrankingfeaturesacross3dif- ferentdatasetsusingInformationGain (IG),withthe aimofidentifyingarepresentativeset offeatures. Byperformingan intersection(i.e.,AND)functionoverthetop-10IGrankingsofthethreedatasets,theyhaveidentiﬁed9consistentfeatures.

However,no justificationisprovidedastowhythey haveconsidered onlythetop-10 IGranked features.In addition,the authorsran a C5.0classifier over 3feature setswith varyingIG values,andshowed thatthe feature setswithhigher IG valuesperformbetterthanthosewithlowerIGvalues.Nevertheless,thestudyin[28]isincomplete,asitonlyassessedthe performanceofasinglefiltermeasure,namelyIG,fromageneralperspective.

InKhonjietal.’swork[14],theybenchmarkedtheperformanceoffeatureselectionmethodsforphishingemailclassifica- tion.Severalconventionalfeatureevaluatorsarecompared,whichincludefiltermeasures(i.e.,IGandRelief-F),Correlation- BasedFeatureSelection (CFS),andthewrappermethod.Results showedthatthewrappermethodcoupledwiththebest- firstforwardsearchingmethodoutperformsfeaturesubsetsderivedbyIGandRelief-F,whileCFSperformstheworstamong them.Inaddition,theexperimentalresultsalsosuggestthattheRandomForest classifierismostpredictive,anditiscon- sistentlyoutperformingtheC4.5andSVM classifiers.However,ashighlightedinSection 1,thewrappermethodiscompu- tationallyexpensive,whichprohibitsitswidespreadusageinmachinelearning-basedphishingdetection.Therefore,current researchdirectiononfeatureselectionshouldfocusmoreonfiltermeasures,whicharelessintensivetocompute.

AsimilarstudyisconductedbyBasnetetal.[4],whichevaluatedonlytwocommonfeatureselectiontechniques,namely, theCFSandwrappermethod.Theauthorstestedtwofeaturespacesearchingtechniques(i.e.,geneticalgorithmandgreedy forwardselection)onfeaturesderivedfromthewebpageitselfaswellasthirdpartysourcessuchassearchengines.Perfor- manceevaluationofthefeaturesubsetsareconductedusingNaiveBayes,LogisticRegressionandRandomForestclassiﬁers.

ExperimentalresultsrevealedthatthewrappermethodachieveshigherdetectionaccuracieswhencomparedtoCFS,which is consistent with the results reported in [14]. However, the authors acknowledged that the wrapper method is indeed computationallymoreintensive,thuspreventingitfrombeingafeasibleapproachinfeatureselectionapplications.

QabajehandThabtah[21]assessedatotalof47featuresforphishingemaildetectionusingIG,Chi-Square andCFS.For both IG andChi-Square ranked features, the authors noticed a larger reduction of filtermeasure values whichappeared betweenthe 20thand21stfeature,andutilisedthisgapinthefiltermeasurevaluesasthecut-off rankto selectthe top 20featuresasthereducedfeatureset.Resultsindicatethatthedetectionaccuracyremains stablewhenusingthereduced featureset.However,itisunclearonhowtoidentifythecut-off rankfromacomputationalpointofview.Furthertestsare conductedusing12commonfeaturesderivedbyintersectingthefeaturesetsofIG,Chi-SquareandCFS,achievingadecrease ofonly0.28%inaverageaccuracywhencomparedtothefull featureset.Thus,theirstudyhasdemonstratedthemeritsof filtermeasuresinreducingfeaturedimensionalitywithoutsacrificingtheclassificationaccuracy.

Recently, Thabtah and Abdelhamid [27] exploited Chi-Square and IG to benchmark the features for phishing website detection.The authors suggested a more systematic way of identifying the cut-off ranksfor features ranked by IG and Chi-Square. Athreshold-based ruleset is proposed, whichdefines thecut-off ranks as two successivefeatures having at least50% difference invalues ofIG andChi-Square. Ifa cut-off rank occursbelowthe recommendedminimum value of filtermeasure (i.e., 10.83 forChi-Square and 0.01 for IG), it will be discarded. In another phishing detection work, Ra- jab[22] utilised thesamerule setto determinecut-off ranks fortheir featureset evaluation. Nevertheless,theproposed rulesetin[22]and[27]arenotrobustandmayfailtoidentifytheappropriatecut-off ranksforcertaindatasetswhenthe reductioninfiltermeasurevaluesaremoreuniform.Assuch,thecurrentfiltermeasures-basedfeatureselectionstudiesin thephishingdetectionfieldarelackingasystematicapproachtoidentifytheoptimalcut-off rank.

4. Theproposedfeatureselectionframework

Fig.1showsanoverviewoftheproposedfeatureselectionframework.Forbetterunderstanding,weemployatop-down presentationapproach,wherethemajor componentsandprocessesinFig.1are firstlyintroducedwithoutgoing indepth ondetailsoftherelatedalgorithm.Atthisstage,itissufficienttotreattheCDF-gcomponentasablackboxthatfunctions asafeaturecut-off rankidentifier(detailedexplanationonCDF-gcomponentispresentedinSection4.2).

4.1. Thehybridensemblestrategy

Inthefieldoffeatureselection,therearetwomajortypesofensembletechniques,namelydataperturbationandfunction perturbation.Dataperturbationappliesthesamefeatureselectiontechniqueonmultiplesubsetsofadataset,whilefunction perturbationappliesmultiplefeatureselection techniquesonthesamesetofdata.Combining bothdataperturbationand function perturbationresults ina hybrid perturbation ensembletechnique. A numberof studies havesuggested that the ensemblestrategycouldbeapromisingapproachtoimproveclassificationperformanceandtoobtainamorestablesubset offeatures[6,19,21,28].Recently,Hadietal.[12]alsoexploredthebenefitofdatasetpartitioningconceptandagreedonits potentialinreducingthecomplexityoftheclassifier.Thus,thesestudieshavemotivatedustoproposetheHybridEnsemble

(4)

Fig. 1. Overview of the proposed feature selection framework.

FeatureSelection(HEFS)framework,whichconsistsoftwocycles,namely,dataperturbationcycleandfunctionperturbation cycle.

Formally,theproposedHEFSframeworkisdescribedasfollows.Letthej-thpartitionandk-thfiltermeasurebedenoted asP_jandFM_k,respectively,asshowninFig.1.Webeginbydescribingtheprocessflowofthedataperturbationcycle.The full dataset(i.e.,completesamples andcompletefeatures)are randomisedbeforebeingdividedby stratifiedsubsampling intoJpartitions.Randomisationreferstotheshufflingofthesamplerows,andisutilisedtoensurethatthesamplesineach partitionaremorebalanced.

ForadatasetpartitionP_j,thefiltermeasureFM_kcomputesthefiltermeasurevaluesbasedontherawfeaturevaluesin thatpartition.Thefiltermeasurevaluesaresortedandpassedtothenextstage,wheretheCDF-galgorithmisthenapplied to producethe feature cut-off rank, τ_j_,_k. After repeating theaforementioned procedure forj = 1,2,…, J, alist offeature cut-off ranks {τ₁_,k,τ₂_,k,…,τ_J,k}iscreatedrespectivelyforeachdatasetpartition byusingfiltermeasure FM_k.The meanof thefeaturecut-off ranks,i.e.,τ¯_k,isobtainedusingEq.(1),andreferencedon{P₁,P₂,…,P_J}togeneratean arrayofprimary featuresubsets{FS_1,_k,FS_2,_k,…,FS_J,k}.Specifically,eachprimaryfeaturesubsetFS_j,kconsistsofthetop-τ¯_kfeaturesinP_jthatare sortedaccordingtofiltermeasurevalues.Thedataperturbationensemblecomponentwillapplyanintersectionoperation, asshowninEq.(2),onthearrayofprimaryfeaturesubsetstoproduceasecondaryfeaturesubsetFS_k.Inthisintersection operation,onlyfeaturesthatarecommonamongtheprimaryfeaturesubsetsareretained.Byconsideringtheconsensusof allpartitions,ithelpstodiscardanyirregularfeaturesthatmayexistincertainpartitions.Afeaturethatistrulypredictive ismostguaranteedtoexistinalldatasetpartitions,thus enablingittobeselected(throughtheintersectionoperation)as partofthesecondaryfeaturesubset.Thisconcludesasingledataperturbationcycle.

¯ τk=1

J J

j=1

τj,k. (1)

FS_k= J

j=1

FS_j,k. (2)

Onthe other hand,the function perturbationcycle canbe definedas runningthe dataperturbation cycleover K dif- ferentfiltermeasures,whichproduces an arrayofsecondaryfeature subsets{FS₁,FS₂,…,FS_K} thatisreadytobe fed into thefunctionperturbationensemblecomponent.Thefunctionperturbationensemblecomponentaggregatesitsinputsviaa unionoperation,asshowninEq.(3),toobtainthebestfeaturesubset,i.e.,baseline featureset.Through thefunctionper- turbationcycle,theintelligenceofdifferentfiltermeasurescanbeleveraged,thusleadingtothebaselinefeaturesetthatis lesssusceptibletooverfitting.

Baselinefeature set:=

K

k=1

FS_k. (3)

4.2. CDF-galgorithm— Anautomaticfeaturecut-off rankidentiﬁer

Thegoalofselectingsubsetoffeaturesinmachinelearning-basedphishingdetectionistoreducethefeaturespacewith- outcompromisingonthedetectionaccuracy.Infeaturesubsetselectionapproachesthatutilisefiltermeasures,determining anoptimalcut-off rankiscrucialinachievingtheaforementionedgoal.Thecut-off rankisdefinedasthespecificthreshold rankinanorderedlistoffiltermeasurevalues,whereanyfeatureslocatedbeyondthecut-off rankisconsideredirrelevant andthereforediscarded.Fig.2illustratesthe cut-off rankτ,wheretheselectedfeaturesubset issurroundedbya discon- tinuousrectangle.Ashighlighted inSections2and3,theexisting cut-off rankidentificationtechniques[21,22,27]isfound tobeunstableandnotrobust.Hence,itishighlydesirabletodeviseanewapproachthatismoreflexibletodeterminethe optimalcut-off rank.Inthissection,wepresentanovelalgorithmnamedCDF-g,whichaimsatovercomingthelimitations ofexistingfeaturecut-off rankidentificationtechniques.

(5)

Fig. 2. Feature subset selection using cut-off rank.

4.2.1. CDF-gconceptsanddeﬁnitions

Inthissection,weprovideatheoreticalbackgroundonCumulativeDistributionFunction(CDF)anddescribehowitcan be adapted inour proposed featureselection algorithm. LetX be a discrete randomvariable andx representsa possible valueoftherandomvariable.Theprobabilitythatthediscrete randomvariableXtakesona particularvalue xistypically calledtheprobabilitydensityfunctionorthefrequencyfunction,asexpressedbelow:

P

(

^X=x

)

. (4)

TheCDFoftherandomvariableXisdeﬁnedasEq.(5).

F_X

(

^t

)

=P

(

^X≤t

)

. (5)

ThenotationF_X(t)isafunctionoftthatrepresentstheCDFfortherandomvariableX.

AdaptingtheCDFinourfeatureselectioncontext,thediscreterandomvariableisassumedtorepresentthevalueoffilter measurecomputedforeachfeature.Thus,inthispaper,weleveragedtheCDFcurvetovisualisethecumulativedistribution offiltermeasurevalues.ByanalysingthegradientchangesintheCDFcurve,wecanidentifyplateauregionsalongtherange offiltermeasurevalues.Plateau regions arethe sectionsofa curvewithzero gradients,whichindicates that thesucces- sive valueshaverelatively largergaps. Inthe context ofourfeature selection work, aplateau regionwithin a CDFcurve offiltermeasure valuesrepresentsseparation betweentwo featuresubsetsthat differ significantlyin termsof predictive power.Assuch,ourapproachexploitsthisconcept topinpointthecut-off rankofthetop rankedfeatures.Wetermedthe aforementionedCDFgradientapproachasCDF-g.

Tocompute thegradientatanypointR_i ontheCDFcurve,we usedthecentral differencesfortheinteriorpoints,and one-side(forwardorbackwards)differencesfortheboundaries,asshowninEq.(6):

Gradient,G

(

^Ri

)

=

⎧ ⎪

⎪ ⎪

⎪ ⎨

⎪ ⎪

⎩

FX

(

^ti+1

)

⁻^F^X

(

^ti

)

h , ifi=1;

FX

(

^ti+1

)

−FX

(

^ti−1

)

2h , if1<i<n; F_X

(

^ti

)

−F_X

(

^ti−1

)

h , ifi=n;

(6)

wherethedefaultdistanceh=1.0,andndenotesthenumberofpointsontheCDFcurve.

Thepseudo-codeofapplyingCDF-gtoidentifythefeaturecut-off rankisshowninAlgorithm1,andwefurtherexplained itthroughanexample.Forinstance,usingIGasfiltermeasure,wecomputethevaluesofIGforthefullsetoffeatures.The computedvalues are then normalised andsorted in increasing order. Next,a CDF curve is calculated andplotted using thesortedvalues.Based onthe gradientofthe CDFcurve,the firstplateaulocated above50% ofthecumulative value is identifiedanditscorrespondingnormalisedIGvalueismappedtothefeaturesrankingtoobtainthecut-off rank,asshown by the red vertical dotted line in Fig.3. The scatter plot in Fig. 3 provides a clearer understanding on the relationship betweenthecut-off rankandthefeaturesdistributionalongtherangeofnormalisedIGvalues.

In plottingthe complete CDF curve, it is necessary to set the number ofuniform intervals called bins. For example, ifbins=15 and 1.2≤x_i≤12.3, the plot points on the CDF curve are computedby running Eq. (5) over t = {1.20, 1.94, 2.68,…, 12.30}.In our study,we empirically set the numberof binsto be twice the numberof features and 50% asthe thresholdforlocatingtheﬁrstplateau,wheretheseparametershasprovidedoptimaltrade-offsbetweenfeaturespacere- ductionandclassiﬁcationaccuracy.

4.3.Featurespreparation

Features utilised in existing phishing website detection studies can be categorised into two major types, namely: (a) internal features,and; (b)external features. Internal features are derived fromwebpage URL andHTMLsource codes,all ofwhichcanbe directlyacquiredfromthewebpageitself.Ontheother hand,external featurescamefromqueryingthird

(6)

Algorithm1 Featurecut-off rankidentiﬁcationusingCDF-g.

Input: V,b V =Featurevectorﬁle,b=Binsize

Output: τ τ=Featurecut-off rank

1:Begin

2:FM_

v

alues=ComputeFilterMeasure(V)

3:FM_

v

alues=Normalise(FM_

v

alues)

4:FM_

v

alues=Sort(FM_

v

alues)

5:CDF_

v

alues=ComputeCDF(FM_

v

alues,b)

6:forallc∈CDF_

v

aluesdo

7: ifc>0.5then

8: CDF_gradient=ComputeGradient(c)

9: ifCDF_gradient =0.0then

10: τ=MapToFeatureRanking(c)

11: break

12: endif

13: endif

14:endfor

15:End

Fig. 3. Example of locating the plateau on the CDF curve of IG values.

partyservicessuchasdomainregistry,searchengine,WHOISrecords,etc.Inthiswork,ourbenchmarkingfocusessolelyon internalfeatures.Thereasonsforexcludingexternalfeaturesarethreefold:

(a) The webpage URL and HTML source code is the most basic data available in phishing datasets. Using commonly accessiblefeaturesensuresthatourstudyisrelevanttomostanti-phishingresearchers.

(b) Externaldataarehighlyvolatile.Forexample,theresultsofsearchengineschangefrequently.

(c) Accesstoexternalresourcesisunstableandunpredictable.Forexample,thepublicAPIforqueryingtheGooglePageR- ankmetricwasoﬃciallydeprecatedin2016,thusaffectingmanyphishingdetectiontechniques[10,23,30]thatdepend onthePageRankfeature.

Toautomateourfeatureextraction,weleverageontheSeleniumWebDriver¹,ascript-basedautomationframework for browser.APythonscriptcontainingpredeﬁnedparsingrulesisexecuted,whichinturnwilldirectthebrowsertoloadthe webpage.Next,theparsingruleswillsearchtherenderedwebpagecontent,extractthefeaturevalue,andsaveittoatext ﬁle.TheadvantageofusinganactualbrowserinfeatureextractionistoaddrobustnessagainstvariationsinHTMLcontent comparedtoparsingapproachbasedonregular expressions.Inaddition,itcloselyemulatestheenvironmentwhereusers normallyinteractwithwebsites.

1http://www.seleniumhq.org/projects/webdriver/

(7)

Overall,atotalof48featuresareextractedfromwebpageURLsandHTMLsourcescodes.Thesefeatureswere selected aftera comprehensivereview ofthe relatedworkson machinelearning-based phishing websitedetection.It isbased on 11researchpapers publishedbetweenyear2007and2016. Adetailedlistingof thefeatures isprovidedinthe Appendix section.

5. Resultsanalysisanddiscussions 5.1. Datasetpreparation

Inorderto gatherthe requiredfeatures,we collectedwebsiteson twoseparate sessions, i.e.,betweenJanuary toMay 2015andbetweenMaytoJune2017.Speciﬁcally,weselected5000phishingwebpagesbasedonURLsfromPhishTank²and OpenPhish³,andanother5000legitimatewebpagesbasedonURLsfromAlexa⁴andtheCommonCrawl⁵archive.

ThecollectionofwebpagesisautomatedbyusingtheGNUWget⁶toolandPythonscript.ApartfromthecompleteHTML documents,we alsodownloaded therelated resources (e.g., images,CSS, JavaScript) to ensure all downloaded webpages canberenderedproperlyinthebrowser.Inaddition,thescreenshotofevery webpageisstoredforfurtherinspectionand ﬁltering.

Thedownloadeddatasetwerefurtherprocessedtoremovedefectivewebpageswhichfailedtoloador“Error404” pages inboththephishingandlegitimatedataset.Duplicateinstancesofwebpagesarealsoremoved.Afterﬁlteringthesamples, featureextractionisperformedbasedonthedescriptionsinSection4.3.Thecompletedatasetisavailablefordownloadas theWaikatoEnvironmentforKnowledgeAnalysis(Weka⁷)machinelearningsoftware’sAttribute-RelationFileFormat(ARFF) ﬁleat[26].

5.2.Experimentalsetup

Experimentprocessessuchascalculatingthefiltermeasurevaluesandrunningclassifications(i.e.,trainingandtesting) areperformedusingWeka.Itshouldbenotedthatalltheclassifiersdeployedareusingthedefaultparametersasspecified byWeka. Otherstudies haveexplored thebenefitsof finetuningoroptimising theseparameters.However, ouraimis to showtheimpactoffeatureselectionandassessits effectoverdifferenttypesofclassifiers,withoutgoingindepthoffine tuningtheclassifiersparameters.TheproposedCDF-galgorithmisimplementedinPythonprogramminglanguageusingthe thirdpartypackagecalledNumPy⁸.

Sincewe haveconstructed a large size dataset,we set the total numberof datasetpartitions J= 10. Thisis because eachobtainedpartitionscan stillhavereasonableamountofphishingandlegitimatewebpage samplestoproperlytrain a classifier.As forthenumberoffiltermeasures tobe utilisedforfunction perturbation,wesetK =3,to preventthefinal baselinefeaturesetfrombeinginflatedbyconflictingfeaturesduetodifferencesinfiltermeasurerankings.

Inrunningtheexperiments,wefollowedaclassiﬁcationapproachsimilar tothetrain-test split.Foreach partition,70%

ofthedataisutilisedfortrainingwhile30%isreservedfortestingpurpose.Theclassiﬁcationaccuracyforeachpartitionis computedasfollows:

Accuracy= TP+TN

TP+TN+FP+FN (7)

whereTP,TN,FPandFNdenotetruepositive,truenegative,falsepositiveandfalsenegative,respectively.Lastly,theaccu- raciesforallpartitionsareaveragedtoproduceasinglescore.Experimentsareconductedonadesktopcomputerequipped withanIntelCorei52.5GHzCPU,6GBRAMandWindows7HomePremium64-bitoperatingsystem.

5.3.EvaluationI— Performanceofvariousclassiﬁersondifferentfeaturesubsetsderivedfromﬁltermeasuresrankings

Inthissection, we evaluatethe detectionperformance ofvarious classifierson incremental featuresubsets. Classifiers thatarecommonlyfoundinWekaare selected,namely,SVM,NaiveBayes,C4.5,RandomForest,JRip,andPART.Thefilter measuresusedtorankthefeaturesandgeneratetheincrementalfeaturesubsetsareInformationGain,Relief-F,Symmetrical Uncertainty,Chi-Square,GainRatio,andPearson CorrelationCoefficient.ResultsarepresentedsubsequentlyinFigs.4,5,6, 7,8,and9.

Basedon theresults,the performancesofdifferentclassiﬁerscan beseen to convergeandstabilise atdifferentstable statesofaccuracy.Here,stablestateofaccuracy isdeﬁnedasthemaximumaccuracylevelthat remains constantevenas morefeatures arebeing added.Therefore,itis imperative toidentifythe optimaltop-nfeature subset, whichisthe bare

2https://www.phishtank.com/

3https://www.openphish.com/

4https://www.alexa.com/

5http://commoncrawl.org/

6https://www.gnu.org/software/wget/

7https://www.cs.waikato.ac.nz/ml/weka/

8https://pypi.python.org/pypi/numpy/

(8)

Fig. 4. Performance of Chi-Square ranked feature subsets.

Fig. 5. Performance of Gain Ratio ranked feature subsets.

minimum numberof features needed toachieve thestable state of accuracy that a specific classifier can offer. As such, thenovelCDF-galgorithmisproposedtomeetthisdemandbyautomaticallydeterminingtheoptimaltop-nfeaturesubset withoutanypriorclassifications,wheretheresultsarepresentedlaterinSection5.5.

Comparing the stablestates of accuracy among the classifiers, Random Forest appears to consistently outperform all remaining classifiers by scoring above 95% accuracy. Therefore, our finding here is consistent with the earlier studies in[14,18,24,25]whichalsoclaimthatthebestclassifierintermsofaccuracyisRandomForest.

5.4. EvaluationII— Fitnessofﬁltermeasurewithselectedclassiﬁer

CapitalisingonthebestclassifierrecommendationinEvaluationI,anotherexperimentisconductedtoanalysetheeffects ofusingdifferentfiltermeasureswithRandomForest.Fromthisexperiment,weprovideinsightsonwhichfiltermeasureis thefittestonetobeutilisedwithRandomForest.Here,thefitnessofafiltermeasureisgaugedbyitsinitialaccuracy(i.e., accuracyofthetop-1stfeaturerankedbythefiltermeasure)andhowfastitreachesthestablestateofaccuracywhenthe featuresareincrementallyaddedaccordingtotherankingofthatfiltermeasure.TheresultsareshowninFig.10.

Based onFig.10,InformationGain andChi-Square performedthe bestby attainingthe sameinitialaccuracy at90.5%, followed bySymmetrical Uncertaintyat88.9%. Thefavourable resultsofInformation Gain could be attributedtothe fact

(9)

Fig. 6. Performance of Information Gain ranked feature subsets.

Fig. 7. Performance of Pearson Correlation Coeﬃcient ranked feature subsets.

thattheRandomForest classiﬁer andInformation Gainare bothfoundedonsimilar informationmetrics,i.e.,the entropy.

Ourresultsalsocorrelatewith[21],wheretheperformanceofInformationGainandChi-Squarearefoundtobecomparable.

Ontheotherhand,GainRatioachievedrelativelylowinitialaccuracyat81.8%.TheleastfitfiltermeasureswithRandom Forestseems tobe Relief-FandPearson CorrelationCoefficient,whichonlyachieved76.9% ofinitial accuracy.The perfor- manceofRelief-Fisconsistent withtheresultsreportedin[14],wherefeaturesubsetsbasedonRelief-Fexhibitedhigher errorratesascomparedtoInformationGainwhentheRandomForestclassifierisused.

Therefore,basedontheanalysisresults,itisreasonabletoconcludethatInformationGain,Chi-Square,andSymmetrical Uncertaintyarethetop-3fittestfiltermeasurestobeusedincombinationwithRandomForestclassifier.

5.5.EvaluationIII— Performanceofbaselinefeatures

ThroughtheproposedCDF-galgorithmandHEFSframework,wemanagedtoidentifyasetofbaselinefeatures(i.e.,best featuresubset)thatcapitalisesonup-to-datephishingschemesandcontributessignificantlytowardsthephishingdetection rate.Thisevaluationaimstobenchmarktheperformance ofthebaselinefeaturesetagainst thefull featuresets.Basedon thefinding from Evaluation II, we haveselected Chi-Square, Information Gain, andSymmetricalUncertainty asthe filter measuresinthefunctionperturbationcycle.Thefinalbaselinefeaturesetcontains10features,whicharelistedinTable1.

(10)

Fig. 8. Performance of Relief-F ranked feature subsets.

Fig. 9. Performance of Symmetrical Uncertainty ranked feature subsets.

Evaluationresults showninTable 2suggestthat thebaseline featuresusing RandomForestoutperform all fullfeature sets that use C4.5,JRip,PART, SVM, andNaive Bayes.This proves that the derived baseline features, when coupled with RandomForestclassiﬁer,arehighlyeffectiveindistinguishingbetweenphishingandlegitimatewebsites.Whencomparing baselineandfullfeaturesonthesameclassiﬁer(RandomForest),thebaselinefeaturesonlyexperiencedaminimalaccuracy deterioration of 1.57% while achieving a massive 79.2% reduction in feature space.In short,the evaluationresults have validatedtheeffectivenessoftheproposed HEFSframework,whichyieldeda highlyeffectivesetofbaseline features.The baseline features canbe prescribedasthe defactostandard fortheanti-phishing researcherstoevaluate newly proposed machinelearningtechniques.

Abriefanalysisonthebaselinefeatureshasrevealedinterestingﬁndings.Itissurprisingthatsomefrequentlypromoted features in existing phishingdetection studies are not chosen asthe baseline features by our proposed feature selection approach.Currently,featuressuchasNumDots,UrlLength,AtSymbol,NoHttps,andIpAddressarestillregardedasessentialfea- turesforphishingdetection.However,thisstudyhasshownotherwise,andindeedtheuseofthesefeaturesonlycontribute littletotheaccuracyofaphishingdetectiontechnique.Thisispotentiallyduetotheevolutionofphishingtrendsinrecent years,wherephishersareemployingnewschemestoevadedetection.Hence,thesefeatureshaveeventuallybecomeobso- lete.Assuch,ourﬁndingsinthisevaluationshouldserveasanurgentnoticetoanti-phishingresearcherstodiscontinuethe useoftheseoutdatedfeatures.

(11)

Fig. 10. Performance of Random Forest on different ﬁlter measure ranked features.

Table 1

Baseline feature set.

FrequentDomainNameMismatch PctNullSelfRedirectHyperlinks PctExtNullSelfRedirectHyperlinks PctExtResourceUrlsRT NumNumericChars ExtMetaScriptLinkRT PctExtHyperlinks SubmitInfoToEmail

NumDash NumSensitiveWords

Table 2

Performance comparison between baseline and full features.

Classiﬁer Feature set Number of features Accuracy (%)

Random Forest Full 48 96.17

Random Forest Baseline 10 94.60

C4.5 Full 48 94.37

JRip Full 48 94.17

PART Full 48 94.13

SVM Full 48 92.20

Naive Bayes Full 48 84.10

Table 3

Average runtime per sample for classiﬁcation phase.

Feature selection technique Number of features Classiﬁcation time (ms)

Chi-Square 37 0.0670

Information Gain 34 0.0673

Symmetrical Uncertainty 46 0.0783

HEFS 10 0.0523

5.6.EvaluationIV— Runtimeanalysiscomparison

Evaluationis conductedto benchmark theruntime performance ofHEFS against existingfeature selection techniques, namely,theﬁltermeasures.Notethatforeachrespectiveﬁltermeasure,weonlyreporttheruntimeperformanceofitsbest featuresubsetthatachievedthehighestaccuracy.

Table3showstheaverage classiﬁcationtime (i.e.,testingtime)per sample,where 1500phishingand1500 legitimate datasamplesareevaluated.Resultsindicatethat thebaselinefeaturesetproducedusingHEFShasthelowesttestingtime, whichis reasonablesincereducing thenumberof features willalso reduce thecomputationcomplexityof theclassiﬁer.

Thus,the proposed HEFS framework isempirically shownto beadvantageous when dealingwithlarge datasets andreal timeapplications.

(12)

Table 4

Performance benchmarking between the proposed HEFS framework and FACA [11] . Classiﬁer Feature set Number of features Accuracy (%)

FACA Full 30 92.40

Random Forest Full 30 94.27

Random Forest Baseline (derived from HEFS) 5 93.22

5.7. EvaluationV— Benchmarkingonothermachinelearningdataset

Tofurthervalidatetheeffectivenessoftheproposedtechnique,theHEFSframeworkisbenchmarkedwitharecentphish- ingdetectiontechniquebyHadietal.[11].TheauthorsproposedanewclassificationalgorithmcalledFastAssociativeClas- sificationAlgorithm(FACA)andevaluateditonalargephishingdatasetfromtheUCImachinelearningrepository[17].The dataset,whichconsistsof11,055instances and30differentfeatures,hasbeenwidely utilisedforbenchmarking purposes in phishingdetection research. Furtherdetails on the features anddataset descriptions are givenin[17].The motivation toselect[11] forbenchmarking istwo-fold:(a)Associative Classification(AC)algorithmemploys built-in featureselection process in the form of rule pruning,which is highlyrelevant to our proposed feature selection framework, and; (b) AC is apromising andcompetitive classificationapproach that can beconsidered asone ofthe state-of-the-artclassification techniques[1,2,12].

ThebenchmarkingresultsshowninTable4suggestthatourproposedHEFSframeworkoutperformsthefeatureselection andclassiﬁcationmodelin[11]whileeffectivelyreducingupto83.33%infeaturedimensionality.Inotherwords,assuming thatFACAhasindeedperformedfeatureselection(intheformofrulepruning),itsaccuracywhentestedonthefullfeatures (30intotal) isstill inferiorcomparedtothebaselinefeatures (5intotal)derived byusingourproposedHEFSframework.

ThesuboptimalperformanceofFACAmayimplythatitsclassiﬁcationalgorithmislesspowerful,oritmayalsosuggestthat thefeatureselection (rulepruning)inFACAisnottrulyeffective.Inaddition,thepromisingresultsinthisevaluationalso suggestthattheproposedHEFSisrobustandthus,canﬂexiblyadapttodifferentdatasets.

6. Conclusionandfuturework

Inthiswork,anewfeatureselectionframeworkcalledHEFSisproposed,whereexistingfiltermeasuresareleveragedto findaneffectivesubsetoffeaturesforutilisationinmachinelearning-basedphishingdetection.AspartoftheHEFS,wepro- poseanovelalgorithmcalledCDF-gtoautomaticallydeterminetheoptimalnumberoffeatures.TheCDF-galgorithmworks byexploitingpatternsinthedistributionoffiltermeasurevalues,thusenablingittoflexiblyadapttodifferentdatasets.The CDF-gisextendedandintegratedinahybridensemblestructuretoyieldahighlyeffectivesetofbaselinefeatures.Results suggestthatthebaselinefeaturesperformbestwhenintegratedwithRandomForest classifier,achievingacompetitiveac- curacyof94.6%usingonly20.8%oftheoriginalnumberoffeatures.Inaddition,thebaseline featuresutilisedonRandom Forest outperformsfullsetfeatures usedonSVM,NaiveBayes,C4.5,JRip,andPARTclassifiers.Thus,the proposedfeature selection framework is proven to be moreadvantageous thanthe currentapproaches, whichtend to be computationally expensive. Thepractical implicationsofthisresearch include:(a)enhancingthe detectionaccuracy andcomputational ef- ficiencyofmachinelearning-basedphishingdetectionsystems;(b)providingafullyautomatic,flexible androbustfeature selection frameworkthat canbe utiliseduniversally ondifferentdatasetsto producehighlyeffectiveoptimalfeature subset;(c)prescribingasetofbaseline featuresforfuturebenchmarkingofmachinelearning-basedanti-phishingtechniques, and;(d)acceleratingthedevelopmentofpowerfulandcompactphishingdetectionsystembyapplyingtheproposedfeature selectionframeworkandtherecommendedbestperformingclassifier.

Infuture,itcouldbeworthwhiletoinvestigatetheimpactoffeatureselectionusingotherclassiﬁcationalgorithmssuch asassociativeclassiﬁcation.Researchersmayalsoexploreonfeaturetransformationasanotherviablefeaturereductionap- proach.Inaddition,futureanti-phishingworksshouldconsidertodiscontinuetheuseofsomeobsoletefeaturesasreported inSection5.5,whicharefoundtobenolongereffective.

Acknowledgements

ThefundingforthisprojectismadepossiblethroughtheresearchgrantobtainedfromUNIMASundertheDanaPelajar PhD[Grant No:F08/DPP/1649/2018].ThisworkisalsosupportedbytheSarawakFoundationTunTaibScholarshipScheme.

AsincerethankyoutoGladysYee-KimTanforherdiligentproofreadingandlanguageassistanceonthispaper.

(13)

AppendixA

Table A1

Phishing website features.

No. Identiﬁer Value type Description

1 NumDots Discrete Counts the number of dots in webpage URL [1,31] . 2 SubdomainLevel Discrete Counts the level of subdomain in webpage URL [13,30]

3 PathLevel Discrete Counts the depth of the path in webpage URL [9] . 4 UrlLength Discrete Counts the total characters in the webpage URL [16,17] . 5 NumDash Discrete Counts the number of “-” in webpage URL [9,16,17,30] .

6 NumDashInHostname Discrete Counts the number of “-” in hostname part of webpage URL [9,16,17,30] . 7 AtSymbol Binary Checks if “@” symbol exist in webpage URL [9,13,16,17,30] .

8 TildeSymbol Binary Checks if “∼” symbol exist in webpage URL [8] . 9 NumUnderscore Discrete Counts the number of “_” in webpage URL [9] . 10 NumPercent Discrete Counts the number of “%” in webpage URL [8] . 11 NumQueryComponents Discrete Counts the number of query parts in webpage URL [8] . 12 NumAmpersand Discrete Counts the number of “&” in webpage URL [8] . 13 NumHash Discrete Counts the number of “#” in webpage URL [8] .

14 NumNumericChars Discrete Counts the number of numeric characters in the webpage URL [9] . 15 NoHttps Binary Checks if HTTPS exist in webpage URL [9,13,15–17] .

16 RandomString Binary Checks if random strings exist in webpage URL [8] .

17 IpAddress Binary Checks if IP address is used in hostname part of webpage URL [10,13,16,17,30] . 18 DomainInSubdomains Binary Checks if TLD or ccTLD is used as part of subdomain in webpage URL [30] . 19 DomainInPaths Binary Checks if TLD or ccTLD is used in the path of webpage URL [13,17,30] . 20 HttpsInHostname Binary Checks if HTTPS in obfuscated in hostname part of webpage URL.

21 HostnameLength Discrete Counts the total characters in hostname part of webpage URL [15] . 22 PathLength Discrete Counts the total characters in path of webpage URL [15] . 23 QueryLength Discrete Counts the total characters in query part of webpage URL [15] . 24 DoubleSlashInPath Binary Checks if “//” exist in the path of webpage URL [23] .

25 NumSensitiveWords Discrete Counts the number of sensitive words (i.e., “secure”, “account”, “webscr”, “login”,

“ebayisapi”, “signin”, “banking”, “conﬁrm”) in webpage URL [10,30] .

26 EmbeddedBrandName Binary Checks if brand name appears in subdomains and path of webpage URL [30] . Brand name here is assumed as the most frequent domain name in the webpage HTML content.

27 PctExtHyperlinks Continuous Counts the percentage of external hyperlinks in webpage HTML source code [13,16,17] . 28 PctExtResourceUrls Continuous Counts the percentage of external resource URLs in webpage HTML source

code [13,16,17] .

29 ExtFavicon Binary Checks if the favicon is loaded from a domain name that is different from the webpage URL domain name [17] .

30 InsecureForms Binary Checks if the form action attribute contains a URL without HTTPS protocol [30] . 31 RelativeFormAction Binary Checks if the form action attribute contains a relative URL [30] .

32 ExtFormAction Binary Checks if the form action attribute contains a URL from an external domain [16,17] . 33 AbnormalFormAction Categorical Check if the form action attribute contains a “#”, “about:blank”, an empty string, or

“javascript:true”[17] .

34 PctNullSelfRedirectHyperlinks Continuous Counts the percentage of hyperlinks ﬁelds containing empty value, self-redirect value such as “#”, the URL of current webpage, or some abnormal value such as

“ﬁle://E:/”[13,17,30] .

35 FrequentDomainNameMismatch Binary Checks if the most frequent domain name in HTML source code does not match the webpage URL domain name.

36 FakeLinkInStatusBar Binary Checks if HTML source code contains JavaScript command onMouseOver to display a fake URL in the status bar [16,17] .

37 RightClickDisabled Binary Checks if HTML source code contains JavaScript command to disable right click function [16,17] .

38 PopUpWindow Binary Checks if HTML source code contains JavaScript command to launch pop-ups [8,16,17] . 39 SubmitInfoToEmail Binary Check if HTML source code contains the HTML “mailto” function [17] .

40 IframeOrFrame Binary Checks if iframe or frame is used in HTML source code [17] . 41 MissingTitle Binary Checks if the title tag is empty in HTML source code [8] .

42 ImagesOnlyInForm Binary Checks if the form scope in HTML source code contains no text at all but images only.

43 SubdomainLevelRT Categorical Counts the number of dots in hostname part of webpage URL. Apply rules and thresholds to generate value [16] .

44 UrlLengthRT Categorical Counts the total characters in the webpage URL. Apply rules and thresholds to generate value [16,17] .

45 PctExtResourceUrlsRT Categorical Counts the percentage of external resource URLs in webpage HTML source code. Apply rules and thresholds to generate value [17] .

46 AbnormalExtFormActionR Categorical Check if the form action attribute contains a foreign domain, “about:blank” or an empty string. Apply rules to generate value [17] .

47 ExtMetaScriptLinkRT Categorical Counts percentage of meta, script and link tags containing external URL in the attributes. Apply rules and thresholds to generate value [17] .

48 PctExtNullSelfRedirectHyperlinksRT Categorical Counts the percentage of hyperlinks in HTML source code that uses different domain names, starts with “#”, or using “JavaScript ::void(0)”. Apply rules and thresholds to generate value [17] .

(14)

Supplementarymaterial

Supplementarymaterialassociatedwiththisarticlecanbefound,intheonlineversion,atdoi:10.1016/j.ins.2019.01.064 References

[1] N. Abdelhamid , A. Ayesh , F. Thabtah , Phishing detection based associative classiﬁcation data mining, Expert Syst. Appl. 41 (13) (2014) 5948–5959 . [2] M. Aburrous , M.A. Hossain , K. Dahal , F. Thabtah , Associative Classiﬁcation Techniques for predicting e-Banking Phishing Websites, in: Proceedings of

the International Conference on Multimedia Computing and Information Technology (MCIT), Sharjah, United Arab Emirates, 2010, pp. 9–12 . [3] Anti-Phishing Working Group, Phishing Activity Trends Report, 2nd Quarter 2018, 2018, Retrieved from http://docs.apwg.org/reports/apwg _ trends _

report _ q2 _ 2018.pdf , accessed 16.11.18.

[4] R.B. Basnet , A.H. Sung , Q. Liu , Feature selection for improved phishing detection, in: Proceedings of the International Conference on Industrial, Engi- neering and Other Applications of Applied Intelligent Systems (IEA/AIE), Dalian, China, 2012, pp. 252–261 .

[5] H. Bleau, 2017 Global Fraud and Cybercrime Forecast, 2016, Retrieved from https://www.rsa.com/en- us/blog/2016- 12/

2017- global- fraud- cybercrime- forecast , accessed 02.11.17.

[6] V. Bolón-Canedo , K. Sechidis , N. Sánchez-Maroño , A. Alonso-Betanzos , G. Brown , Insights into distributed feature ranking, Inf. Sci. (Ny) (2018) . [7] K.L. Chiew , E.H. Chang , S.N. Sze , W.K. Tiong , Utilisation of website logo for phishing detection, Comp. Secur. 54 (2015) 16–26 .

[8] X.M. Choo , K.L. Chiew , D.H.A. Ibrahim , N. Musa , S.N. Sze , W.K. Tiong , Feature-based phishing detection technique, J. Theor. Appl. Inf. Technol. 91 (1) (2016) 101–106 .

[9] H.M.A. Fahmy , S. Ghoneim , PhishBlock: A hybrid anti-phishing tool, in: Proceedings of the International Conference on Communications, Computing and Control Applications (CCCA), Hammamet, Tunisia, 2011, pp. 1–5 .

[10] S. Garera , N. Provos , M. Chew , A.D. Rubin , A Framework for Detection and Measurement of Phishing Attacks, in: Proceedings of the 5th ACM Workshop on Recurring Malcode (WORM), Alexandria, Virginia, USA, 2007, pp. 1–8 .

[11] W. Hadi , F. Aburub , S. Alhawari , A new fast associative classiﬁcation algorithm for detecting phishing websites, Appl. Soft Comput. 48 (2016) 729–734 . [12] W. Hadi , G. Issa , A. Ishtaiwi , ACPRISM: Associative classiﬁcation based on PRISM algorithm, Inf. Sci. (Ny) 417 (2017) 287–300 .

[13] M. He , S.-J. Horng , P. Fan , M.K. Khan , R.-S. Run , J.-L. Lai , R.-J. Chen , A. Sutanto , An eﬃcient phishing webpage detector, Expert Syst. Appl. 38 (10) (2011) 12018–12027 .

[14] M. Khonji , A. Jones , Y. Iraqi , An empirical evaluation for feature selection methods in phishing email classiﬁcation, Comp. Syst. Sci. Eng. 28 (1) (2013) 37–51 .

[15] M. Moghimi , A.Y. Varjani , New rule-based phishing detection method, Expert Syst. Appl. 53 (2016) 231–242 .

[16] R.M. Mohammad , F. Thabtah , L. McCluskey , An assessment of features related to phishing websites using an automated technique, in: Proceedings of the International Conference for Internet Technology and Secured Transactions, London, UK, 2012, pp. 4 92–4 97 .

[17] R.M. Mohammad, F. Thabtah, L. McCluskey, Phishing Websites Data Set, 2015, Retrieved from http://archive.ics.uci.edu/ml/datasets/Phishing+Websites , accessed 02.11.17.

[18] H.H. Nguyen , D.T. Nguyen , Machine Learning Based Phishing Web Sites Detection, in: Proceedings of the International Conference on Advanced Engi- neering Theory and Applications (AETA), Ton Duc, Ho Chi Minh City, Vietnam, 2016, pp. 123–131 .

[19] B. Pes , N. Dessì, M. Angioni , Exploiting the ensemble paradigm for stable feature selection: a case study on high-dimensional genomic data, Inform.

Fusion 35 (2017) 132–147 .

[20] S. Purkait , Examining the effectiveness of phishing ﬁlters against DNS based phishing attacks, Inform. Comp. Secur. 23 (3) (2015) 333–346 . [21] I. Qabajeh , F. Thabtah , An experimental study for assessing email classiﬁcation attributes using feature selection methods, in: Proceedings of the 3rd

International Conference on Advanced Computer Science Applications and Technologies (ACSAT), Amman, Jordan, 2014, pp. 125–132 . [22] K.D. Rajab , New hybrid features selection method: A case study on websites phishing, Secur. Comm. Netw. 2017 (2017) .

[23] G. Ramesh , I. Krishnamurthi , A comprehensive and eﬃcacious architecture for detecting phishing webpages, Comp. Secur. 40 (2014) 23–37 . [24] O.K. Sahingoz , E. Buber , O. Demir , B. Diri , Machine learning based phishing detection from URLs, Expert Syst. Appl. 117 (2019) 345–357 .

[25] A. Subasi , E. Molah , F. Almkallawi , T.J. Chaudhery , Intelligent phishing website detection using random forest classiﬁer, in: Proceedings of the Interna- tional Conference on Electrical and Computing Technologies and Applications (ICECTA), Ras Al Khaimah, United Arab Emirates, 2017, pp. 1–5 . [26] C.L. Tan, Phishing Dataset for Machine Learning: Feature Evaluation, Mendeley Data, v1, 2018, Retrieved from https://doi.org/10.17632/h3cgnj8hft.1 ,

accessed 20.05.18.

[27] F. Thabtah , N. Abdelhamid , Deriving correlated sets of website features for phishing detection: A Computational intelligence approach, J. Inform. Knowl.

Manag. 15 (04) (2016) 1650042 .

[28] F. Toolan , J. Carthy , Feature selection for Spam and Phishing detection, in: Proceedings of the eCrime Researchers Summit (eCrime), Dallas, Texas, USA, 2010, pp. 1–12 .

[29] G. Varshney , M. Misra , P.K. Atrey , A survey and classiﬁcation of web phishing detection schemes, Secur. Comm. Netw. 9 (18) (2016) 6266–6284 . [30] G. Xiang , J. Hong , C.P. Rose , L. Cranor , CANTINA+: A Feature-Rich machine learning framework for detecting phishing web sites, ACM Trans. Inf. Syst.

Secur. 14 (2) (2011) 21 .

[31] H. Zuhair , A. Selamat , M. Salleh , The effect of feature selection on phish website detection, Int. J. Adv. Comp. Sci. Appl. 6 (10) (2015) 221–232 .