• Tidak ada hasil yang ditemukan

Classification using link prediction

N/A
N/A
Protected

Academic year: 2023

Membagikan "Classification using link prediction"

Copied!
13
0
0

Teks penuh

(1)

ContentslistsavailableatScienceDirect

Neurocomputing

journalhomepage:www.elsevier.com/locate/neucom

Classification using link prediction

Seyed Amin Fadaee, Maryam Amir Haeri

Department of Computer Science and Information Technology, Amirkabir University of Technology, Iran

a rt i c l e i n f o

Article history:

Received 21 February 2019 Revised 24 May 2019 Accepted 5 June 2019 Available online 13 June 2019 Communicated by Prof. H. Zhang Keywords:

Classification Link prediction Graph representation Local similarity measure Similarity-based techniques

a b s t r a c t

Linkpredictioninagraphistheproblemofdetectingthemissinglinksortheonesthatwouldbeformed inthenearfuture.Usingagraphrepresentationofthedata,wecanconverttheproblemofclassification totheproblemoflinkpredictionwhichaimsatfindingthemissing linksbetweentheunlabeled data (unlabelednodes) andtheirclasses.To ourknowledge,despitethefact thatnumerousalgorithmsuse thegraphrepresentationofthedataforclassification,noneareusinglinkpredictionastheheartoftheir classifyingprocedure.Inthiswork,weproposeanovelalgorithmcalledCULP(ClassificationUsingLink Prediction)whichusesanew structurenamelyLabelEmbeddedGraph orLEGand alink predictorto findtheclassoftheunlabeleddata.DifferentlinkpredictorsalongwithCompatibilityScore-anewlink predictorweproposedthatisdesignedspecificallyforoursettings-hasbeenusedandshowedpromis- ingresultsforclassifyingdifferentdatasets.ThispaperfurtherimprovedCULPbydesigninganextension calledCULMwhichusesamajorityvote(hencetheMintheacronym)procedurewithweightspropor- tionaltothepredictions’ confidencestousethe predictivepowerofmultiple linkpredictors andalso exploitsthelowlevelfeaturesofthedata.ExtensiveexperimentalevaluationsshowsthatbothCULPand CULMarehighlyaccurateandcompetitivewiththecuttingedgegraphclassifiersandgeneralclassifiers.

© 2019ElsevierB.V.Allrightsreserved.

1. Introduction

Classificationisanoldprobleminmachinelearningandpattern recognition that aims atfinding acorrect mappingbetweendata andtheircorresponding labels.Thismappingwouldthenbeused toderivetheclassoftheunlabeleddata[1].

Thisfield isstill highlyactiveinthe literatureanda lotofal- gorithms havebeenproposed tocorrectly classify thedata. Most oftheclassificationalgorithmsaimatfindingadecisionboundary inthefeaturespacefordistinguishingthedatabelongingtodiffer- entclasses;however,asmorecomplexdatarequiremorecomplex algorithms,theseapproachescouldfailornotcapturethetruere- lationsinthedata.

Oneofthenewapproachesthathasrecentlygainedpopularity in the literature is classification ofthe unlabeledinstances using the graphrepresentation ofthe data.Data can be representedin differentformsoneofwhichisagraph.Inthissetting,thedatais first convertedto a graphvia a similarityfunction in thefeature space, then unlabeled datais classified by incorporating a graph property.Thesegraphpropertiesarecalledhighlevelfeaturewhich givemoreinsighttothedatacomparedtothelowlevelfeatures.

Corresponding author.

E-mail addresses: [email protected] (S.A. Fadaee), [email protected] (M. Amir Haeri).

Classificationusinggraphrepresentationisstudied extensively innumerousworks[2–9].Theseworksusegraphpropertiessuch as clustering coefficient, modularity, importance, PageRank and others to classify the unlabeled data and they tend to achieve more accurate results compared to the classifiers that classify based on the low level features of data.This approach has been used in text classification [10],hyperspectral image classification [11,12], image classification [2,8], handwritten digits recognition [3]andotherareas.

Linkpredictionistheproblemofpredictingthemissinglinkin agraphortheonesthatwouldbeformedinthenearfuture[13]. Usingthegraphrepresentationofthedata wecantreat theclas- sification asa linkprediction probleminan intuitive waywhere wetrytofindthelinkbetweentheunlabelednodewithitcorre- spondingclass. Toourknowledge, there arenot any workin the literaturethatuseslinkpredictiontosolvetheproblemofclassifi- cation,however,theuseofclassificationtosolvelinkpredictionis studiedextensively[13].

Inthiswork, weproposed an algorithm calledCULP(acronym forClassificationUsingLinkPrediction)thattakesadifferentlook attheclassificationproblemthroughalinkpredictionapproach.As wewillelaborateinthepaper,CULPusesagraphcalledLEGthat modelsthe data inan intuitive andsuitable wayforlink predic- tion.

Anylinkpredictorscanbeusedtoderivetheclassoftheunla- belednode inCULPandweproposed anewlocalmeasurecalled https://doi.org/10.1016/j.neucom.2019.06.026

0925-2312/© 2019 Elsevier B.V. All rights reserved.

(2)

CompatibilityScorethatisdesignedtoimprovetheaccuracyoflink predictionandconsequentlyclassification.

As much insight as highlevel features havefor capturingthe patternspresentinthedata,exploitingthelowlevelfeaturealong- sidethemwouldfurtherimprovethepredictivepowerofagraph classifiersand differentresearchers incorporate thisidea in their work[2,4].This iswhy we furtherimproved CULP andproposed theCULM extension-amajorityvotesystem(hencetheMinthe acronym)withweightsproportionaltotheprobabilitiesofthepre- dictions,thisextension usesmultiplelinkpredictorsalong witha lowlevelclassifier.AswewillseebothCULPandCULMalgorithms derivehighlyaccurateresultswhicharecompetitivewithlowlevel classifiersandothergraphbasedclassificationmethods.

Therestofthepaperisorganizedasfollows;inthenextsection areview ofthe generaldomains usedin thispaper is presented whichisapreliminarysectionelaboratingtheproblemoflinkpre- diction,similaritymeasures invectorspace,methodofconverting graphtodataandtheproblemofclassification.Afterthatasection of related works is given which is a summary of recent works usinggraphrepresentationofthedataforclassification.Next,the CULP algorithm is presented with full details which elaborates on the LEG (Label Embedded Graph) structure, the classification procedure which uses link prediction, our novel link predictor - Compatibility Score, the time complexity and a toy example to demonstrate CULP. Finally, the CULM extension is presented whichisfollowedbyourextensiveexperimentalresultstoputour proposed algorithms into perspective. At the end, the conclusion tothepaperandtheaimforfutureworksarepresented.

2. Preliminaries

TofullyunderstandCULP,a groundingforthe detailscompris- ingthisalgorithmshould beset.In thissection, ageneralreview tographtheory concepts andnotationsalong withthe definition ofthelinkpredictionproblemincomplexnetworksisgiven.After that,an overviewofsome ofthe mostimportantsimilaritymea- suresispresented,followingthisthedifferentwaysofconverting datato graph is discussed. Finallyat the endof thissection the problemofclassificationisdefined.

2.1.Linkprediction

Givena set ofverticesV andaset ofedges E containing(i, j) wherei,jVthedatastructureG(V,E)canbedefinedasagraph.If theelementsinEareorderedpairs,Gisconsideredtobeadirected graph.Inan undirectedgraph if(i,j)∈E itisimplied that(j,i)∈E. Regardlessof thedirectionalityof thegraph,node j isa neighbor nodetonodeiif(i,j)∈E.Foranodei,iisthesetoftheneighbor nodesofi.

ForthegraphG,adjacencymatrixAG orsimplyAisdefinedas anN×Nmatrixwithzero-oneelementsandN=

|

V

|

.Foranyen-

tryinA,Ai,j=1ifandonlyif(i,j)∈E.Inan undirectedgraphby definitionA=AT.Asourfocusinthispaperistowardundirected graph,for the sake of simplicity we use graph to state an undi- rectedgraph.

Thedegreeofanodeiinagraphcanbederivedusing|i|.For anygraph,thecardinalityor|E|canbeobtainedbysummingover thedegreeofallnodesusingEq.(1)whereN=

|

V

|

.

|

E

|

= 12N

i=1

|

i

|

(1)

Theproblemoflinkpredictioninagraphariseswhenthegoal istopredictforthecurrentlyabsentlinks(0entriesinA)theprob- abilityoflinkformationinthefuture.Therearemanyfunctionsto predictthelinkpredictionscores.Thesefunctionsusuallycompute

thelocalsimilaritybetweenthenodestoderivethescores.Oneof thesimplesttechniquesis knownascommonneighbors(CN) [14]. Usingthisapproachthepredictionscorescanbederivedusingthe following:

λ

i,j=

|

i

j

|

(2)

Eq. (2) simply counts the number of common neighbors of nodesiandjtoderiveascorefortheirlinkformation.

Anotherapproachtofindthelinkformationscoreisintroduced byAdamandAdar[15]whichusesdegreesofcommonneighbors asfeaturesforpredictionanditcanbewrittenas

λ

i,j= γij

1

log

|

γ

|

(3)

Eq.(3)isknownasthe Adamic-Adarscore(AA). Thisscorepe- nalizesthefeaturesbytheirlogarithmandusesthesefeaturesfor deriving thepredictionscores.Anotherfamous approachfortack- lingtheproblemoflinkpredictionistheResourceAllocationIndex (RA)[16]thatsimulatesthetransitionofresourcesbetweennodes iandj.ThisindexisdefinedasEq.(4).

λ

i,j= γij

1

|

γ

|

(4)

Thisindexisquite similarto AA,howeveritdoesnot usethe logarithm function which reduces the effect of nodes with high degree. This has the benefit of penalizing high degree common nodes.In alotof networks, thesenodesprovidelittle insightfor linkpredictionastheyareconnectedtoalotofothernodesinthe graph.

Inthiswork,we are proposinganewsimilarityfunction used forthepurposeoflinkprediction.calledCompatibilityScorewhich isdiscussedfurtherinthepaper.

2.2. Similaritymeasures

Anydatapointxwithnumericfeaturesxfwhere1≤fdcanbe regardedasa vector inan d-dimensionalspace.Thisview would enable the measurement of the similaritiesbetween data points usingconventionalsimilaritymeasures.Aswe aregoingtoutilize asimilarity measureinconvertingourdatato graph(discussedin the next segment), we aregoing to provide overviewofsome of thesemeasures.

HavingourdatamatrixX,withnrowsanddcolumnswitheach rowbeingadatavector,theCosinesimilaritycanbedefinedasthe following:

si,j= Xi.Xj

Xi

2

Xj

2

(5)

where

x

2 denotesthe Euclideannormofthevector xwhichis derivedbythefollowing:

x

2=

d

f=1

x2f

Followingthe above equation,the Euclideandistancebetween anytwoddimensionalvectorscanbewrittenas:

φ

i,j=

d

f=1

(

Xi,fXj,f

)

2 (6)

Utilizing the Euclidean distance, another similarity measure - namelyInverseEuclideancanbedefinedusing:

si,j= 1

φ

i,j+

(7)

(3)

InEq.(7)the

termisasmallnumberusedtoavoiddivision

by zero in caseof identical vectors. Another prominentdistance in linearalgebra is whatis known astheabsolute orManhattan distance(Eq.(8))andbysubstitutingEq.(8)inEq.(7),theInverse Manhattansimilarityfunctionisdefined.

φ

i,j= d

f=1

|

Xi,fXj,f

|

(8)

2.3. Convertingdatatograph

Anyvectorbaseddatacanberepresentedasagraph.Doingthis wouldresultinchanging thestructureof thedatawhichenables ustocomputehighlevelfeatures.

Twoofthemostusedprocedures forconvertingdatatograph arer-RadiusandkNNmethods[17].

Using a similarity measure (e.g. cosine similaritydiscussed in the previous segment)s andmatrix data X we canuse either of thesetwoalgorithmstoconvertthedataintoagraph.Inr-Radius, an edge iscreatedbetweenevery pairofdatapointsthat havea similarityhigherthana predefinedthresholdr.Anotherapproach isusingk-nearestneighborsto formupthegraph.If(based ona s)Xiisinthek-nearestneighborsofXjtheedge(i,j)iscreated.

Due to the fact that kNN relation is not symmetric this ap- proach wouldgenerally results ina directed graph.However the same principle can be used to create an undirected graph asin Algorithm 1.Using thisapproach,if Xhas N instances,the num-

Algorithm1 UndirectedkNNconversionfunctionforthedatama- trixXandsimilaritymeasures.

functionkNN-Convert(X,s,k) E=

{}

fori,jX do

ifikNN(s,j)or jkNN(s,i)then EE(i,j)

endif endfor returnE endfunction

ber of undirected edges |E| in the created graph is bounded by

Nk

2

|

E

|

Nk.CULP usesanundirectedkNNmodelingofthedata forthetaskofclassification.

2.4. Classification

Suppose there are two sets ofdata, X withninstances and d featuresforeachinstancewhichisthesetofourlabeleddata.The labelsofXisdenotedbyy whereyi∈1,2,...,CwithCbeingthe number ofclasses. Eachpair (Xi, yi) makes up ourtraining data.

TheothersetofdataisX(u)withminstancesandagaindfeatures foreachinstancewhicharetheunlabeledorthetestdata.

TheclassificationproblemaimsatfindingamappingXi(u)yˆi for every i∈1,...,m. In other words, we are trying to find a proper label for each of the unlabeled instance in X(u). IfC=2, thisiscalledbinaryclassificationandifC>2,theproblemiscalled multi-classclassification[1].

ClassifierslikekNNorDecisionTreecannaturallyhandlemulti- classclassificationproblems,howeversomeclassifierslikeSVMare inherentlydesignedforthebinaryclassificationtaskandupgrading themtohandlemulti-classclassification requiresusingOnevs.All orOnevs.Oneapproaches[1].

Inonevs.all,Cclassifiersaretrainedandeachclassifierhasthe taskofdecidingwhetheran instancebelongstoaparticularclass or not. The one vs. one approach is done by trainingC(C−1)/2

classifiersto classifyaninstanceintoeitheroftwoclassesamong alloftheCclasses.

3. Relatedworks

Using graph classification has recently gained popularity and numerousworks[2–8]focusonusingthisapproachinsteadofthe classicalmethodsofclassification.Thesemethodcancapturecom- plexpatternsinthedataandtheycangeneratehighlevelfeatures toguidetheclassificationprocedure,furthermoretheycanusually bemodifiedtoutilizethelowlevelfeaturesofthedataaswell.

In[2]arandom walker isusedto classifyunlabeledinstances on the graph embedding of the data. This graph is represented by a weight matrix of similarities. The random walk process is continueduntil convergenceand thenew datareceives the label throughaweightedmajorityvotebetweenthelabelsofthetop

η

nodeswithhighestprobabilities.Thismethodtakesthesimilarity amongthedatapointsintoaccountwithasinglenetworkforthe datasetalongwithstructuralchangesofanunlabeledinstanceon thenetworkscreatedforeachclass.Thecomplexityofthemethod isof O(n2), however,asthe authors claimed, using sparse repre- sentationssuch as kNNnetwork, andgraph construction method basedonLanczosbisection[18],thiscomplexitycanbereducedto acomplexitybetweenO(n1.06)andO(n1.33).

Anothersystemisproposed in[9]inwhichagraphiscreated forthe training instances ofeach class,then usingthe proposed spatio-structural differentialefficiency measure inthe paper,a test instanceisconnectedtosomeofthenodesineachgraph.Thela- belofthedatawouldbetheclass ofthegraphthat thetestdata hasthehighestimportancein.Theimportanceischaracterizedby Google’s PageRank measureof the network. The spatio-structural differentialefficiencymeasurein[9]takesconsiders bothphysical andtopological propertiesof thedata andthe complexityof the proposed method is again of O(n2) which is once more reduced toacomplexitybetweenO(n1.06) andO(n1.33)byusinggraphcon- structionmethodbasedonLanczosbisection.

Ahybrid method isproposed in [3]that aids a typical classi- fier (such as kNN, SVM or Naive Bayes) by using high level fea- tures.These highlevel features are thedifference of some graph propertiesbeforeandafterinsertinganewinstanceintothegraph representationof the dataof each class. The graph ofeach class isconstructedusing combinationofr-radius andkNNgraphcon- versionmethods. Thegraphpropertiesusedintheir workare as- sortativity,networkclusteringcoefficientandaveragedegree.Thela- belforthe test instanceis generatedby a weightedcombination of low level and high level features. The authors extended their work in [4] by using two more high level features namely Nor- malizedAverageDistanceamongverticesandcorenessvariabilityand using a stacking procedure to learn the weight foreach feature.

Also[5]extendsthesameworkbydiscardingtheuseofanyclas- sical classifier and using a scheme that takes low level features techniquesintoaccount to filterirrelevant graphsof someof the classes.

Authorsof[6]proposedaframework forclassificationusingk- AssociatedOptimalGraphformodelingthedataandBayestheorem andcomputingaposteriorprobabilityforeachclasstoclassifynew instances. Similar to kNN graph conversionmethod, k-Associated OptimalGraphcomputesthesimilarityofadatapointwithallof thetrainingdata,however,itwouldformanedgeonlyifthepoints belong to the same class. This would result in having multiple component(and possibly morethan one component fora class).

The method furthermoretries to find a local k foreach class so thattheresulting componentsgetthemaximal Purity(ameasure based on average degree of a component). This waythe process offindingthe parameterk isconductedautomaticallywhich also make the complexity of the framework of O(n2). Another paper

(4)

[3]also usesthe k-Associatedgraph inthispaperalong withthe highlevelclassificationmethodof[3]toclassifynewinstances.

Other methodsusingdifferentgraphmeasures havebeenpro- ducedaswell. Neto andZhao [7] uses dynamic entropy foreach weightedgraph produced by r-radius where the weights denote thedistancebetweendatapoints.Cupertino etal.[8]utilizesthe modularitymeasureforclassifyingnewinstancethatbelongstoa patternsetofthesameobjectinthetrainingdata.Thelabelisde- rivedby creating akNNgraphfor each patternsetand choosing thelabelofthegraphwithlowestmodularity changeafterinser- tionofthenewdata.Bothofthemethodsin[7,8]havethecom- plexityofO(n2).

Thegraphbasedclassificationmethodsintheliteraturemostly havethreecharacteristicsin common.Firstlythey createadiffer- entgraphforeachclassesof thedata;thisapproachavoids find- ingmeaningfulpatternthatmayformbythesimilaritiesbetween pointsindifferentclasses.

The second aspect these algorithms have in common is that they treat test instances individually and add them to the graph ofeach class andmeasurea graph propertybefore andafter the insertion.This makes thepredictionof anewinstance inefficient inpresenceoflargeamountoftestdata.

Lastly, thepropertiesthat thesealgorithms useforfindingthe differences before and after the insertion of the unlabeled data (e.g.clusteringcoefficient, average pathetc.) are time consuming andtheircomputationtimes areusually dependent onthe graph sizewhichcanmaketheminfeasibleforlargedatasets.

Our proposed algorithm CULP and it’s extension CULM solves thefirstandsecondissuebyemployinganovelgraphrepresenta- tioncalledLEG whichtreats classes asnodesalong withtraining andtest instances asaunified object andisdiscussed further in thepaper. As for thethird problem, since the labelof a test in- stanceis derived using linkprediction measures (as discussed in the previous section), the classification of the unlabeled data is fasterthanthesimilarmethods.

4. CULPalgorithm

CULP (Classification Using Link Prediction) is a classification method aimed to gain a higher accuracy in mulit-class classifi- cation task by exploiting the similarity among the data points.

This algorithm employs the powers of graph representation and link prediction methods in complex networks to deal with this problem.1TheoverallstructureofCULPisconsistedof2stages:

1.CreatingtheLEGstructureGfromthedata 2. ClassifyingthetestdatausingG

In the firststep we modelour datainto an augmented graph datastructure calledLEG(Label Embedded Graph)which we call G. G is a heterogeneous graph which incorporates the data, the classesandthesimilaritybetweenthemasaunifiedobject.

LEGessentiallycontains3setsofnodesand2setsoflinks.The differenttype ofnodesinGaretrainingnodes,testingnodesand classnodes,alsoalinkbetweentwodatanodesdenotessimilarity betweenthemandalinkbetweenatrainingnodeandaclassnode denotestheclassmembershipofthatnode.

After creatingG, we can convertthe classification problemto the problem of predicting the class membership link of a test- ingnode.Byutilizingalinkpredictionalgorithminthenextstep, membership score for every testing-class pair of nodes is com- puted.

Each ofthemembershipscores actsasaposterior probability.

Alabelischosenforatestingnodebasedonthesescores.

1The complete code of CULP in python can be found in github.com/aminfadaee/

culp .

CULP procedure is depicted in Algorithm 3. In the next seg- ments each of thesteps of the proposed algorithm iscovered in moredetail.

4.1. LEGrepresentation

The first step toward classification using CULP is creating the LEGrepresentation.LEG isaheterogeneousgraph withthree sets ofnodes:

• Trainingnodes(Vl)

• Testingnodes(Vu)

• Classnodes(Vc) andtwosetsofedges:

• Similarityedges(Es)

• Classmembershipedges(Ec)

Eachsetofnodescorrespondtotheiranalogoussetofdatai.e.

Vl containsnnodes,VucontainsmnodesandVccontainsCnodes.

The classmembership edges arecreatedbased onthe labeled data.Eccontainedges(i,j)whereiVlandjisthenoderepresen- tationofyi,meaningthateachtrainingnodeisconnected(without direction)toitscorresponding classnode.Itshould benotedthat sincethelabelsforthetest dataisnot available,Eccontainsonly pairofnodesfromVl andVc.

UnlikeEc,themembersofEsarenotobtainedsotrivially.Esis responsibleforincorporatingthesimilaritiesbetweeninstancesof ourdata andthe edges inthissetare obtainedby usingagraph conversionalgorithm.In thisworktheundirectedversion ofkNN graphconversion(Algorithm1)isused.

EdgesinEsprimarily connect twonodesinVl ora nodefrom Vu tooneinVl.However,thereisnoconstraintonhavinganedge betweentwo nodesinVu,meaning thatwe can findthe similar- itybetweenunlabeleddataandconnectthemaswell(aswehave doneinthiswork).

Iftheunlabeleddataisnotavailableatfirstorincaseofanew unlabeled node x(u) this node is first added to the set Vu, after thatthesimilarityedgesbetweenthisnodeandothernodesofthe graphiscreatedthroughalinearsimilaritycomputation.

Aftercreatingallofthesetsofnodesandedges,wecandefine theLEGG(V,E)whereV=VlVuVc andE=EsEc.AlthoughG isinherentlyheterogeneous,wecantreatitasasimpleundirected graph.TheprocedureforcreatingGissummarizedinAlgorithm2.

Algorithm2 LEGconstructionfunctionforthedataX(l),thelabels yandtheunlabeleddataX(u)withparameterkandthesimilarity functions.

functionLEG(X(l),X(u),y,s,k) X=X(l)X(u)

Vl

{

1,2,...,n

}

//Nodes are represented by numbers Vu

{

n+1,n+2,...,n+m

}

Vc

{

n+m+1,n+m+2,...,n+m+C

}

Ec

{}

fori

{

1,2,...,n

}

do

EcEc(i,n+m+yi) endfor

Es←kNN-CONVERT(X,s,k) VVlVuVc

EEsEc

returnG(V,E) endfunction

Thisalgorithmtakesthelabeledandunlabeleddataalongwiththe parameter k andthesimilarity measure sandproduces G asthe output.

(5)

TherearealwaysnedgesbelongingtoEc.Thenumberofedges inEshowever,hasanupperandlowerbound.Theminimumnum- ber of possibleedges inEs is obtainedwhen the kNNprocedure ofeachpairofpointsinX(X(u)X(l))issymmetric-meaningthat

i

j,ikNN(j)↔jkNN(i).ThemaximumnumberofedgesinEson the other handis obtainedwhenthe kNNprocedure isnot sym- metricforanypairofnodesinX.Usingthese, theboundsonthe numberofedgesinaLEGcanbederivedasEq.(9).

n+k

2

(

n+m

)

|

E

|

n+k

(

n+m

)

(9)

BytheboundsinEq.(9),itcanbestatedthatGgivesusanew low memorycostrepresentationofthedata.The memoryforthe originaldataisofO(n×d+m×d+n)forX(l),X(u)andy,butsince it isusually the casethat k<<dforhighdimensional data,LEG savesalotofmemorycomparedtousingtheoriginaldataforthe taskofclassification.

AnotheraspectofLEGisthefact thatwe areincorporatingall of our labeled and unlabeled data and class labels in a unified structurethatenablesustofindthelabelsofthetestdataviasim- pleandefficientgraphproperties,specificallylinkpredictionmeth- odswhichiscoveredinthenextsegment.

4.2. Classification

Asstatedbefore,inclassification,thegoalistofindamapping Xi(u)yˆiforeveryi∈1,...,m.UsingtheLEGrepresentation,this problem can be reformatted asfinding j for

iVU so that the probabilityof(i,j)∈Ecismaximized.

The newformulation means that edges will be added to the setEcbypredictingthemostprobablemembershiplinkforevery testnode.Thiscanbeeasilydonevialinkpredictionmethodsdis- cussedbefore.

Using a local similarity measure

λ

for link prediction (e.g.

Adamic-Adarindex),thisproblemcanbe solvedusingthefollow- ing:

iVu, EcEc

(

i,j

)

j=argmax

jVc

( λ

i,j

)

(10)

Althoughmorecomplexlinkpredictionmethods(randomwalk, average path length etc.) can be used to solve the problem, the localsimilaritymeasures arenotonlyextremely fastandefficient to computebutthey alsoderive competitively accurateresults as it willbe discussinthe experiments.The pseudocodeofCULP is depictedinAlgorithm3.

Algorithm3 CULPAlgorithm.

functionCULP(X,X(u),y,s,k,

λ

)

GLEG(X,X(u),y,s,k) ˆ

y

{}

foriVudo jargmax

jVc

(

λ

i,j) ˆ

yij(n+m) endfor

returnyˆ endfunction

4.3. Compatibilityscore

In this work a novel local score forlink prediction is formed which is designed specifically for the task of classification. This new similarity function is called Compatibility Score and like

Fig. 1. Using AA or RA for predicting the formation of ( i, j 1) in both LEG’s would result in the same score, however node γ in the first case is more valuable for the prediction.

Adamic-Adar and Resource Allocation scores penalizes the com- monneighbors,however,thispenalizationisdonedifferently.

BothAAandRAscorescanbeunfairinsomeinstances,mean- ing that they can over-penalize a valuable common neighbor or give the same score to two inherently different nodes. Take the twoLEGsinFig.1forexample(iVu,

γ

,a,b,cVl andj1,j2Vc).

Inboth cases thegoal isto find thescore forthe (i, j1) link.AA andRAwouldbothpenalizenode

γ

inthesameway(penaltyof5

forRAandlog(5)forAA);however,inthefirstLEGthenode

γ

is

morevaluablethanthat ofthesecondLEGandthisisduetothe factthat threeneighborsofthisnode (a, b,c)are alsoconnected tonodej1.

Whentrying topredictthescorefortheformationoflinkbe- tween nodes i and j with a common neighbors between them namely

γ

,twosetsofedgescanbedefinedstartingfrom

γ

:com-

patibleedgesandincompatibleedges.

Compatible edges for node

γ

are the ones connecting

γ

to

nodeswhicharebythemselvesconnectedtothedestinationofthe candidatelink(jinthiscase).Wecandefineincompatibleedgesas alltheotheredgeswhicharenotcompatible.

Nowthecardinalityofincompatibleedgesortheincompatibility penaltyfornode

γ

whichisa commonneighborofnodesi andj

canbedefinedasthefollowing:

δ (

i,j,

γ )

=

|

γ

|

|

γ

j

|

(11)

UsingEq.(11)theCompatibilityScore(CSforshort)isformally definedas Eq.(12). In thisequation both

δ

(i, j,

γ

) and

δ

(j, i,

γ

)

areusedforthepredictionof(i,j)tomakethescoresymmetricso that

λ

i,j=

λ

j,i.

λ

i,j= γij

1

δ (

i,j,

γ )

+ 1

δ (

j,i,

γ )

(12)

Usingthe CompatibilityScoreforthe casesofFig.1 thescore forlink(i,j1)inLEG1canbecomputedas0.7andinLEG2as0.4.

ThisisthedesiredoutcomeasthescoreinLEG1isnowhigher.In theexperiments,amoredetailedcomparisonofCSwithotherlink predictionmethodsisdone.

(6)

4.4.Timecomplexityanalysis

In this subsection, the time complexity of finding the class membershipedgeofatestnodewillbeanalyzed.Themaincom- ponentin finding the correctlink isthe local similaritymeasure

λ

whichisusedforlinkprediction.Theselocalmeasures findthe scoreintime proportionalto thedegree oftheir sourceanddes- tinationnodes.InCULP, thesource node i belongsto Vu andthe destinationnode jbelongstoVc.Sothefirststepinanalyzingthe timeoffindingaclassmembershipedgeisfindingtheaveragede- greeofnodesinVuandVc.

Thedegreeofnodejisthenumberoflabelednodesconnected to it ormore specifically nj which is the numberof data points with class of node j; however, for the degree of i a more de- tailedanalysisisneeded.Asstatedbefore,inanyundirectedgraph Eq.(1)holds.Eq.(1)canberewrittenasthefollowing:

|

E

|

= 1 2

iVc

|

i

|

+

iVl

|

i

|

+

iVu

|

i

|

(13)

Since thedegreeoftheclassnodessumsuptothenumberof labeleddatan,itcanbesubstitutedintheaboveequation;onthe otherhand,ifwetreateachnodeinVu tohaveaveragedegree D, wecanstatethatnodesinVl wouldhaveaveragedegreeofD+1 (sinceeachofthemhasalsoamembershipedge).Usingallthese, theaboveformulacanberewritteninthefollowingmanner:

|

E

|

= 12

(

n+n

(

D+1

)

+mD

)

|

E

|

=n+nD 2 +mD

2 (14)

Asstatedbeforethe numberofedgesina LEGis boundedbyan upper and lower bound which is derived in Eq. (9). Now using Eqs.(14)and(9)theupperboundofDcanbedefinedas:

n+nD 2 +mD

2 =k

(

n+m

)

+n

D=2k (15)

anditslowerboundas:

n+nD 2 +mD

2 =k

2

(

n+m

)

+n

D=k (16)

Consequently,theaveragedegreeforlabeledandunlabelednodes is of O(k) and for class nodes is of O(n). The Common Neigh- bor,Adamic-Adar and Resource Allocation all have the complexity offindingthecommonneighborsbetweensourceanddestination whichistheintersection oftheneighborhoodsofthetwo nodes.

TheCompatibilityScorehowever,firstfindsthecommonneighbors anddoestwo intersection foreach of the nodesin the common neighborset.

Ifdoneefficiently,theintersectionoftwosetswithsizesaand bcan be obtainedinorder ofO(min(a,b))inaverage. Usingthis, the complexity of finding the score in LEG for the formation of linksbetweeniand jis ofO(k) whenCommon Neighbor,Adamic- Adar or Resource Allocation is used and is O(k2) when Compati- bility Score is used. Since k is usually small (in our experiments 1≤k≤35), it is safe to state that the link prediction is done in constanttime;alsoasthereareCnodesinVc,predictingthelabel ofminstanceswouldtaketimeofO(mC)aftercreatingtheLEG.

Fig. 2. Toy example demonstrating CULP. A- The set of data belonging to 2 classes and a test point in red B- LEG graph of the data.

4.5. Toyexample

Inthissubsectionasimpleclassificationproblemissolvedus- ing CULP to demonstrate the steps involving in this algorithm.

ThedataispresentedinFig.2-Aastwoclasses.Thewhitepoints represent the data of class 1 and the dark points belong to class 2.The problemisfinding the correctlabelofthe red point (pointi).

Thefirststepischoosingasimilarityfunctionsandavaluefor theparameter k forformingthe graph.Herewe chose k=2 and theEuclideansimilarity(discussedinthepreliminariessection).

Now thenodesets canbe definedasVc= j1,j2,Vu=iandall theotherpointsasthesetVl.BycreatingtheedgesinEcandEsas showninAlgorithm2 theLEG inFig.2-Bcan be derived.As can beseen,inthisgrapheverynodeexceptforiisconnectedtoone oftheclassnodesj1 andj2 (whitenodes)by dottedlinksandthe blacklinksrepresentstheedgesofEs.

Lookingatthegraph,itcanbeseenthatthenodeiisconnected tonodesa,bandc.Thismeansthesenodeswouldassistinfinding thelabelfornodei.Usingthesenodes,thescoresforedges(i,j1) and(i,j2)canbeobtainedwitheachofthescoresdiscussedbefore as

λ

.TheresultsofcomputingthesescoresaredepictedinTable1.

The resultsofall the linkpredictors inTable 1show that the scoreforthelink(i,j2)ishigher.Thispredictionmatchesthepat- ternperceivedbylookingatthedatainFig.2-Aandisthecorrect prediction.

(7)

Table 1

Scores computed by 4 different link predictor for the toy example of Fig. 2 .

λ (i, j 1) (i, j 2) Prediction

CN 1 2 2

AA 1/ log (4) 2/ log (3) 2

RA 1/4 2/3 2

CS 1 / 2 + 1 / 4 2(1 / 2 + 1 / 3) 2

5. CULMextension

As we stated in the time complexityanalysis subsection and demonstratedinthetoyexampleoftheprevioussection,oncethe LEG structure isformed, the prediction of links can be done in- stantly;knowingthisandthefact thattherearedifferentoptions inchoosingthelinkpredictor

λ

,thequestionarisesastowhynot

use allof our predictors and somehow combinetheir predictive ca- pabilities to assistus in findingthe best membershiplink fora test node?

The next question arises after we analyze the related works doneinthefieldofclassificationusingcomplexnetworkrepresen- tations. Agoodportion ofthesemethods arecapable ofincorpo- ratingorexploitingthelow levelfeatures ofthedata toenhance the classificationperformance. How canwe modifyour framework CULPto exploitthelowlevelfeaturesofthedataaswellasthehigh levelfeatures?

Theanswer tobothofthesequestionsliesinourextensionto CULPalgorithmwhichwecalltheCULMextension.CULMincreases the predictive capabilities of CULP by using a weighted majority vote procedure(hencethe MasinMajorityintheendinstead of P).

Insteadofusingonlyonelinkpredictor

λ

,wewilluseanarray

oflinkpredictors .Eachlinkpredictor

λ

whenused,givesascore

to the links (i, j) for all jVc. We can use all of thesescores to estimatetheprobabilitypofourpredictioncorrectnessasEq.(17).

pyˆ=

λ

i,j jVc

λ

i,j

(17)

In this equation yˆ is the label corresponding to j and j is computedusingEq.(10)oftheprevioussection.UsingEq.(17)we canassignconfidencetothepredictionof

λ

.Whenusingmultiple

predictors, it isobvious that a

λ

with higherconfidence ismore

reliable.Weare goingtousetheseprobabilities toassignweights toeachofthe

λ

sin .Thiswayinsteadofusingasimplemajority

vote, a weighted voting procedure can be used. In a weighted majority vote procedure,few predictions are aggregated. Eachof these predictionhas an individual weight whichstates the value of their vote;finally thevoting inthis settingwould be done as Algorithm4.

Algorithm4 WeightedMajorityVotingAlgorithm.

functionVOTE(Y,W) L

{

0

}

C

foryYandwW do LyLy+w

endfor ˆ

yargmax(L) returnyˆ endfunction

In Algorithm4,Y isthe setcontainingthe predictedlabels of each ofthe predictors, W isthe respective weights ofthe labels and Lis a set with Celements which keepstrack ofthe weight foreachoftheclasses.Usingthisalgorithmenablesustonotonly

usemultiplelinkpredictors’predictedlabels,butalsoincorporate arbitraryanyclassicalclassifier

ψ

withsuitableweights.Thisway

thelowlevelfeaturesofthedataisexploitedaswell.

Thenext step isto definethe weights foreach ofourpredic- torsand

ψ

. Ifyˆλ isthepredictedlabelofthepredictor

λ

forthe

unlabeleddatax(u)andpλyˆ istheprobabilityofthisprediction,the weightofpredictor

λ

forx(u) can be definedasEq.(18).Alsofor

thepredictionof

ψ

onx(u) which canbe denotedasyˆψ,we can definetheweightasEq.(19).

wλyˆ =

α

pλyˆ

λ

pλyˆ (18)

wψyˆ =1−

α

(19)

The

α

parameterwhich is usedin both equationsisprovided

bythe user.Thisparametercontrolsthetrade-off thatCULM will makebetweenthelinkpredictors’labelsandthepredictionofthe lowlevelclassifier.

The parameter

α

is chosen in the range0 to 1; howeverany

valuebelow0.5wouldresultinneutralizingthevoteofCULMpre- dictors.Alsoif

α

=1,thepredictioniscompletelydoneby CULM predictorsandthe low levelclassifier is ignored;soin generalit canbestatedthat0.5≤

α

≤1.

Now theCULM extension canbe formally definedas thepro- cedure captured in Algorithm 5. In this algorithm, after creating

Algorithm5 CULMAlgorithm.

functionCULM(X,X(u),y,s,k, ,

ψ

,

α

)

GLEG(X,X(u),y,s,k) ˆ

y

{}

foriinVudo P

{}

Yˆ←

{ ψ

(Xi(u))

}

W

{

1

α}

for

λ

in do

jargmax

jVc (

λ

i,j) PPλi,j

jVc

λi,j Yˆ←Yˆ∪j(n+m) endfor

forpPdo WWα×p

pP p

endfor ˆ

yiVOTE(Yˆ,W) endfor

returnyˆ endfunction

theLEG,eachofthepredictorsin producealabelandaproba- bility.Theseprobabilitiesandlabelsarethenmergedwiththatof thelowlevelclassifier

ψ

toformupYandWwhicharepassedto

Algorithm4toproducethefinallabelforthetestinstance.

Asanalyzed,thetimecomplexityofpredictingthelabels ofm instances usingCULP isO(mC). CULM inherentlyrepeats the pre- dictionltimeswithlbeingthenumberoflinkpredictorsin and then uses a majority vote. The predictions complexity is O(lmC) andthevotinghasthecomplexityofO(l);Therefore,wecaniden- tifyCULMtime complexitytobe ofO(lmC+l+O(

ψ

))withO(

ψ

)

part being the complexity of the low level classifier.Clearly the generaltime forCULMcouldbe majorlydifferentuponusingdif- ferentclassifiers.

(8)

Table 2

Datasets used in deriving the results for CULP and CULM.

Dataset Instances Attributes Classes

Zoo 101 16 7

Hayes 132 4 3

Iris 150 4 3

Teaching 151 5 3

Wine 178 13 3

Sonar 208 60 2

Image 210 19 7

Glass 214 9 6

Thyroid 215 5 3

Ecoli 336 7 8

Libras 360 90 15

Balance 625 4 3

Pima 768 8 2

Vehicle 846 18 4

Vowel 990 10 11

Yeast 1,484 8 10

RedWine 1,599 11 6

Segment 2,100 19 7

Optical 5,620 64 10

Poker 25,010 10 10

6. Experimentalresults

In thissection, we are presenting the result of our proposed algorithmsCULPandCULMon20differentrealdatasetsandcom- paringit to classical classification methods as well asbest clas- sifiersof the related works in the domain ofclassification using complexnetworks.

The datasets used for our experiments are all obtained from UCI machinelearningrepository [19].These datasetsinclude Zoo, Hayes-Roth (Hayes), Iris, Teaching Assistant Evaluation (Teaching), Wine,SonarMinesvs.Rocks(Sonar),ImageSegmentationtrainingset (Image)andtestingset(Segmentation),GlassIdentification(Glass), Thyroid Disease (Thyroid), Ecoli,Libras Movement (Libras), Balance Scale(Balance),PimaIndiansDiabetes(Pima),StatlogVehicleSilhou- ettes(Vehicle), Vowel Recognition(Vowel), Yeast, WineQualityRed (RedWine),OpticalRecognitionofHandwrittenDigits(Optical),Poker Hand(Poker).Eachofthesedatasetsalongwiththenumberofin- stances,attributesandclassesislistedinTable2.

6.1.CULPanalysis

Thereasonbehindchoosingthesedatasetsisthevarietyofboth structureanddomainbetweenthem.Thesizeofthesedataisbe- tween101to25,010 whichtestthe practicalityofouralgorithms on both small andlarge datasets; the number of attributes vary from4to90whichtesttheproposedalgorithmsagainstbothlow andhighdimensionaldatasetsandfinallythere isalotofvariety inthenumberofclassesinthe datasetswhich rangesfrom2up to10.

This section is organized asfollows: first, the experiment on CULP and different predictors as

λ

is presented, after that the

CULMalgorithms isanalyzed with3different lowlevel classifier, thefollowingsubsectionwilldiscusstheeffectsof

α

parameter,af-

terthat acomparisonofCULPandCULMwithclassical classifiers willbedemonstratedandfinallyCULPandCULMwillbecompared along all the classical approaches and the similar works around classificationusingcomplexnetworks.

As the first experiment, different link predictors are used in CULPtocomparetheperformanceofeachoneonthedatasets.For thisexperimentsthepredictor

λ

isoneoftheCN,AA,RAandCS

whicharerespectivelydefinedinEqs.(2),(3),(4),(12).

TheparametersusedinCULMarek(1≤k≤35),

λ

(thelinkpre-

dictor which is Common Neighbors, Resource Allocation, Adamic

AdarorCompatibilityScore),thevectorsimilarityfunctionsand

α

(0.5≤

α

≤1).Foreachlinkpredictorandeachdataset,theparam- etersaretuned.Thistuningisdoneviaa10-FoldCrossValidation procedure. After finding the best parameters, 30 runs of 10-Fold CrossValidationisdone thatamounttototalof300runs.Table3 capturestheresultsobtainedbythesesettings.

Ineach cell ofTable3,the firstnumberisthe meanaccuracy of the runsand the second number is thestandard deviationof them. The number in the parentheses represent the best k ob- tainedforeachcell andthebold cellarethe bestresultobtained onadataset.

AscanbeseeninTable3,theCompatibilityScoreachievedthe best resultsamongthe predictors,this isdueto the fact that CS exclusivelygotthehighestaccuracyon6datasetsofGlass,Libras, Balance,Pima, YeastandRedWine.Inthesecond placeisthe Re- sourceAllocationIndexthatobtainedthetopaccuracyforZoo,Iris, Ecoli,OpticalandPokerexclusivelyandachievedanidenticalbest accuracywithAdamic-AdarScoreon theVowel dataset.Thethird bestpredictoristheCommonNeighborwith5datasetsofHayes, Teaching, Sonar, Thyroid andVehicle on top and finally Adamic- Adar for Wine, Image and Segment and the shared best results withRAforVowel.

Analyzing the ks in this experiments, we can see that for 10 datasetsofZoo,Hayes,Iris,Teaching,Wine,Image,Thyroid,Libras, Vehicleand Poker thebest k is identicalfor each predictoron a dataset;inBalanceandPimahowever;theksarenoticeablydiffer- entwithCommonNeighborhavingthehighestkinbothofthem.

Intherestofthedatasetsthechoice ofkamongdifferentpredic- torsareatmostdifferentby1(forYeastitis2).

6.2. CULManalysis

Asthenextexperiment,theCULMalgorithmisrunoneachof thedatasets. Theparameter

α

is tunedover theset{0.6, 0.7, 0.8,

0.9,1}.Allthevaluesbelow0.6for

α

isnot usedtokeepthere-

sultsandcomparisons fair(asstated before,anyvalue below0.5 for

α

zerosthe effectof CULP predictorsalso experimentally the same holdsfor

α

=0.5), thiswaywe are sure that thelink pre- dictorsisnotcompletelyovershadowedbythelowlevelclassifier.

Otherparametersofthealgorithmandthetuningisdoneasbefore andagaineachcellistheresultof300runs.

For a low level classifier to accompany the link predictors in CULM,three different algorithms havebeen chosen andused.

Theselow level classifiersare LDA(LinearDiscriminantAnalysis), CART (Classification And Regression Trees) and multi-class SVM (SupportVectorMachine)withRBFkernel.

Table4 capturesthe resultsofthisexperiments. Thefirst col- umn is the best results for each of the datasets using CULP (Table 3);the next three columns are the results of CULM with respectivelyLDA, CARTandSVM as

φ

andineach ofthecellsin

thesecolumnthe numbersin parenthesesrepresentthek and

α

usedinruns.Thelastcolumninthistablerepresentstheaccuracy gainachievedbyusingCULMinsteadofCULP.Eachofthenumbers inthiscolumnisobtainedby comparingthebestresultobtained byCULMwiththebestresultobtainedbyCULPforeachdataset.

LookingatTable4itisclearthat intheThyroid dataset,using CULMachievednochangeintheaccuracyandinthedatasetsIris andOpticaltheaccuracydeteriorates;however,takingintoaccount theother 17datasets,CULMalmost achievedacompletelyhigher result.

CULMwithSVMasitslowlevelclassifierachievedthebestre- sultson6datasets ofSonar,Thyroid, Libras,Balance,Vowel,Red- WineandPoker exclusivelyandshares thebestresultonThyroid with CULM-LDA and CULP. As the next best classifiers we have bothCULM-CARTandCULM-LDAwithexclusively5bestaccuracy

Referensi

Dokumen terkait

Hipertensi yaitu suatu kondisi ketika tekanan darah terhadap dinding arteri terlalu tinggi yaitu sistol 140 mmHg dan diastol 90 mmHg. Tekanan darah tinggi dapat beresiko

DECISION 21 October 2016 Summary Substance Taratek JP Application code APP202715 Application type To import or manufacture for release any hazardous substance under Section 28