• Tidak ada hasil yang ditemukan

3.8 Evaluation Methodology

3.8.3 Experiment Setup

The experiment is set up with 50% labeled data. The results are averaged over 5 different train-test splits. We also extensively searched for optimal hyperparameter values for all the models using 20% of training data as a validation set. More details on the hyper-parameter search space are provided next.

Metrics.We report classification performance with Micro-F1 scores. Additionally, we define two aggregate metrics viz: RankandPenalty[123] to measure the overall performance of models across datasets. Rankof a model is defined as the average position of the model when their results are ordered in descending order in each dataset. Penaltyof the model is defined as the average difference from the best performing model in each dataset. The lower the Rank and Penalty, the better is the performance. Let,EandDbe the set of all embedding methods and datasets respectively witheandd being a member from these sets.Re,dis the rank of a particular embedding methodeon a dataset d when all the competing methods are ranked on each dataset based on their micro-f1(%) scores. Similarly,Se,d be the score achieved by an embedding methodeon a datasetd as how muchediverges from the best performing modeleond. The formulae for Mean Rank and Penalty are,

MRe= ∑d∈DRe,d

|D| (3.12)

MPe= ∑d∈Dmax({Se,d})−Se,d

|D| : e∈E (3.13)

Our model, USS-NMF, is clearly the winner on the whole across all the tasks [Refer to the Performance Analysis in Section5.9in Tables3.4,3.5and other Ablation Studies]. However, we resort to these two aggregate scores to measure the consistency of all the models across datasets. We also provide statistical significance with Wilcoxon signed-rank test [124], the established test for comparing two models on multiple datasets.

Classifier.We learn an external Logistic Regression classifier (LR) to make predictions from model’s learned node representations. Though we can obtain label predictions internally for the supervised models by reconstructing the label matrix as in Eqn3.2by multiplying label and node embeddingsQU, we found that training a classifier based on the node embeddings further improves the performance.

3.8 Evaluation Methodology 65 Implementation Details.The details of the hyper-parameters for our model and the baselines are provided below. The hyper-parameter search space for different components in all the models experimented here is tabulated in Table3.2.

Matrix Factorization based methods Random Walk/ Other methods

Co-efficients NMF:S NMF:S+Y MMDW MNMF MNMF+Y MF-Plan NMF:S+Y

+LS(S,Y) Co-efficients DeepWalk ComE Gemsec

Network 0.1-10.0 0.1-10.0 0.1-10.0 0.1-10.0 0.1-10.0 0.1-10.0 0.1-10.0 p [1.0] NA

[0.1, 0.3, 0.5, 0.7, 1.0, 3.0, 5.0, 7.0, 10.0]

Label NA 0.1-10.0

Max-margin loss based biased gradient:

e^[-1, -2, -3, -4, -5]

NA 0.1-10.0 0.1-10.0 0.1-10.0 q NA Network: 0.1-10.0 Same as p

Cluster Factorization NA NA NA 0.1-10.0 0.1-10.0 NA NA Walk-Length 80 NA 80

Cluster Learning NA NA NA 0.1-10.0 0.1-10.0 NA NA No of Walks 40, 80 NA 40, 80

Cluster Orthogonality NA NA NA [1e+ (0,4,8)] [1e+ (0,4,8)] NA NA Learning Rate NA [0.001, 0.025, 0.625, 0.1] Initial LR: [0.01, 0.1]

Minimal LR: [0.0001, 0.001]

Graph Laplacian Reg NA NA NA NA NA 0.1-10.0 0.1-10.0 Community Learning NA 0.1-10.0 [0.01, 0.1, 1.0]

L2 Regularization 0.1-10.0 0.1-10.0 0.1-10.0 0.1-10.0 0.1-10.0 0.1-10.0 0.1-10.0 L2 Regularization NA NA 0.1-10.0

#Clusters NA NA NA #Labels(-1, +2) #Labels(-1, +2) NA NA #Clusters NA #Labels(-1, +2) #Labels(-1, +2)

#Experiments 25 125 125 110 130 130 130 #Experiments 2 125 204

Table 3.2 Hyper-parameter range search for Baselines

The range 0.1-10 refers to the set [0.1, 0.5, 1.0, 5.0, 10.0]. We selected 25 values forkas #Labels(-1, +2), i.e., increasing 2 in the upper range and decreasing 1 in the lower range from a dataset’s actual no of labelsq(inclusive).

Co-efficients USS-NMF(Effective range) USS-NMF(Entire range) Dataset (small=<1k, large>1k |V|) Small Large All Datasets

Network 1, 5 1, 5, 10 0.1-10 / [0.1, 0.5, 1.0, 5.0, 10.0]

Label 0.1, 1 0.1, 0.5, 1 0.1-10

Cluster Factorization 0.1, 1 0.1, 0.5, 1 0.1-10

Cluster Learning 10 10 0.1-10

Cluster Orthogonality 1e + (0, 4) 1e + 8 1e + (0, 4, 8) Graph Laplacian Regularization 0.5, 1 0.5, 1 0.1-10

L2 Regularization 1 1 0.1-10

#Clusters #Labels #Labels #Labels(-1, +2)

#Experiments 32 (Full search) 54 (Full search) 150 (Partial seach)

Table 3.3 Hyper-parameter search space for USS-NMF

We selected 25 values forkas #Labels(-1, +2), i.e., increasing 2 in the upper range and decreasing 1 in the lower range from a dataset’s actual no of labelsq(inclusive). We also report generic range which effectively works for most of the cases. See results in Table3.10.

Lagrange Multipliers’ Range. We first provide the details of the hyper-parameter search for the Lagrange multipliers followed by model-specific details. For all the matrix factorization baselines, we vary the hyper-parameter values (the respective weightage terms for each component in the objective function) in[0.1,0.5,1.0,5.0,10.0]except for Wikipedia.

In Wikipedia, we found that network information is far more important than other supervision knowledge. So we varied network co-efficient in the range of[10000,1000,100,10]with other weights in[0.001,0.01,0.1,1.0]. We fixed the embedding dimension as 128 for all datasets except Blogcatalog, for which the dimension is set as 4096.

DeepWalk and MFDW.For original random-walk based DeepWalk we set the window size to 5. We also have MFDW aka NMF:S in Eqn3.1— the objective function for Matrix Factorized DeepWalk, as we build our model incrementally on top of it.

Max-Margin DeepWalk (MMDW).In paper MMDW [58], a max-margin loss is incor- porated in the objective function of MFDW to learn discriminative representations of vertices.

It has one important hyper-parameter alpha-bias(η)that induces max-margin loss-based bias into the random walk.

NMF:S+Y.We build a variant of MMDW which also incorporates supervised information into node embeddings by jointly optimizing Eqn 3.1and 3.2. It works competitively as compared to MMDW.

Planetoid and NMF:Planetoid. Planetoid [22] learns an embedding space for nodes by jointly enforcing label and neighborhood similarity. It uses random walks to enforce structural similarity. We derive a matrix-factorized version of Planetoid as an alternative baseline. It enforces matrixE, i.e, train-label similarity on the embedding spaceU, unlike ours as in Eqn3.5which enforces label similarity on the cluster space.

ONMF:Planetoid=O(NMF:S+Y) +Tr{U∆(E)UT} (3.14)

MNMF and MNMF+Y.We build one semi-supervised variant of MNMF, viz. MNMF+Y by jointly optimizing its objective function along with Eqn3.2. Unlike the original MNMF that factorizes a combination of first-order and cosine similarity based second-order node proximity to learn node representations, here, for the sake of fair comparison, we stick to a combination of first-order and second-order transition probability-based proximity matrix as S, following MMDW [58] as we did for all other comparable methods.

USS-NMF.We used the same range of hyper-parameters as stated in Table3.3last column, but instead of searching the optimal combination in the entire range (which is cumbersome), we did a partial range search in steps. Step by step, 1) network + label information weights, 2) cluster matrix factorization + cluster learning weights + orthogonality constraint, 3) label smoothing + L2 regularization weights, and finally, 4) the clustersk for each dataset was varied with an increment of 2 in the upper range and with a decrement of 1 in the lower range from its actual number of labelsq(inclusive). In the first step, we fixed all other variable values as 1.0, k=q. In later steps, we set already searched parameter values to optimal values found in previous steps to vary the other variables under consideration. In Table3.3 we have given an effective value range for each of the coefficients, applicable for varying sized datasets. We make the following points based on observation,