5.9 Results
5.9.1 Transductive Link Prediction
for inductive link prediction in KGs. It proposes to sample enclosing subgraphs between a (source, target) pair and uses edge type-aware attentive GNN with an edge-dropout strategy to learn edge embeddings.
5.9 Results 139
AUC-ROC ACM DBLP DDB IMDB PubMed FreeBase AUC-PR ACM DBLP DDB IMDB PubMed FreeBase GCN 79.701 82.643 80.865 89.183 70.904 76.993 GCN 92.435 83.008 74.707 87.101 68.543 74.753 GAT 79.468 81.532 76.564 86.589 71.904 75.131 GAT 92.122 74.158 77.389 84.202 71.141 75.976 RGCN 75.707 82.908 82.338 73.429 76.335 77.133 RGCN 78.751 84.648 79.033 78.269 69.445 74.692 GATNE 76.285 83.348 77.984 88.23 75.06 71.979 GATNE 77.438 83.415 78.849 86.562 74.397 74.935 HetGNN 88.784 83.915 78.349 86.951 71.632 70.203 HetGNN 90.057 85.696 79.483 86.15 74.271 74.898 HGT 75.691 85.136 87.393 75.024 79.007 73.244 HGT 78.466 80.883 87.168 77.508 71.252 72.252 HGN 75.205 86.11 83.879 89.04 78.911 72.033 HGN 76.553 81.798 84.464 88.578 75.165 76.648 GRAIL 88.753 88.205 90.804 90.565 95.059 71.676 GRAIL 89.075 86.013 91.131 89.649 95.309 76.435 MV-HRE 97.15 92.064 97.45 92.323 96.778 84.453 MV-HRE 97.368 92.069 97.224 92.702 96.781 83.774
Table 5.4 Transductive link prediction classification scores (%): 2−Hop negative triples
MRR ACM DBLP DDB IMDB PubMed FreeBase HITS@1 ACM DBLP DDB IMDB PubMed FreeBase GCN 96.702 92.548 92.789 93.048 83.2 87.198 GCN 94.209 91.159 86.73 92.177 68.359 74.879 GAT 97.227 94.322 87.058 92.548 89.359 86.715 GAT 95.38 94.741 75.394 93.267 79.13 73.913 RGCN 97.693 95.344 94.268 86.5 82.059 92.11 RGCN 95.584 92.802 89.159 73.665 69.706 84.541 GATNE 97.972 91.605 90.729 90.797 86.047 85.588 GATNE 96.44 93.255 88.493 91.995 74.121 71.981 HetGNN 97.756 93.24 89.478 93.32 87.5 86.393 HetGNN 95.881 86.585 83.9 86.682 75.52 73.43 HGT 94.916 94.283 94.764 85.709 88.328 87.739 HGT 91.088 88.772 90.238 72.214 77.366 76.738 HGN 94.051 93.667 93.323 95.68 96.745 87.923 HGN 89.511 92.37 87.045 91.446 93.647 76.329 GRAIL 99.821 94.997 98.883 96.212 97.794 90.419 GRAIL 99.842 94.942 97.796 92.432 95.588 81.159 MV-HRE 99.961 95.069 99.82 97.012 98.529 94.203 MV-HRE 99.921 95.451 99.64 93.662 97.059 88.406
Table 5.5 Transductive link prediction ranking scores (%): 2−Hop negative triples
GCN and GAT consistently perform well and outperform HGN on the ACM dataset for all compared metrics. As GRAIL uses structurally expressive GNN [24,25] learned using enclosing subgraphs, it performs effectively - the second-best model in our evaluation. Since GRAIL and MV-HRE are optimized based on ranking objectives, they also consistently perform well on ranking metrics. MV-HRE outperforms the second-best methods across datasets by an average of 5.66% and 0.796% in AUR-ROC and MRR scores, respectively. At the same time, MV-HRE beats GRAIL by an average of 5.859% and 1.078% in AUR-ROC and MRR scores, respectively. We attribute this performance improvement to enriched attentive contextual triplet representation learning.
5.9.1.2 1-Hop random neighborhood sampling
AUC-ROC ACM DBLP DDB IMDB PubMed FreeBase AUC-PR ACM DBLP DDB IMDB PubMed FreeBase GCN 91.745 96.396 95.246 95.175 94.436 91.854 GCN 93.36 96.043 94.165 93.83 92.929 89.751 GAT 88.288 96.189 92.19 93.991 91.641 92.405 GAT 89.663 95.023 90.429 93.049 90.97 90.768 RGCN 97.925 93.583 87.129 83.483 82.978 91.379 RGCN 97.887 93.442 84.46 83.183 82.231 91.2 GATNE 97.14 92.572 90.865 95.728 85.014 81.409 GATNE 97.426 92.665 89.46 96.278 83.487 80.866 HetGNN 96.299 94.855 89.32 96.79 81.655 86.794 HetGNN 96.494 94.27 87.422 97.455 83.162 85.79 HGT 98.009 93.62 91.438 86.128 84.312 86.131 HGT 97.983 92.726 90.88 87.939 86.541 87.482 HGN 98.064 97.098 94.089 98.102 93.52 91.794 HGN 97.971 97.034 94.659 98.332 90.36 92.562 GRAIL 88.68 96.393 90.036 97.657 97.593 86.692 GRAIL 88.781 95.85 90.101 98.996 97.908 86.926 MV-HRE 97.612 98.903 97.979 99.063 98.626 94.711 MV-HRE 97.548 98.445 97.798 99.745 98.93 93.884
Table 5.6 Transductive link prediction classification scores (%): 1−Hop negative triples
MRR ACM DBLP DDB IMDB PubMed FreeBase HITS@1 ACM DBLP DDB IMDB PubMed FreeBase GCN 98.083 99.45 99.198 99.383 98.912 96.338 GCN 98.671 98.906 98.471 98.786 97.824 96.675 GAT 98.071 99.104 98.793 99.252 99.044 95.358 GAT 97.715 98.215 97.616 98.544 98.121 95.272 RGCN 99.894 97.278 95.732 91.675 91.042 98.51 RGCN 99.824 94.626 91.903 83.722 82.474 97.02 GATNE 98.062 96.428 93.816 96.371 92.937 88.928 GATNE 98.553 94.22 89.055 97.193 86.348 85.507 HetGNN 98.97 97.552 95.09 98.551 94.051 90.095 HetGNN 98.454 95.121 90.281 97.115 88.293 89.901 HGT 99.959 96.959 96.639 94.812 94.777 89.544 HGT 99.785 93.975 93.612 89.943 89.752 87.567 HGN 99.859 98.826 97.331 99.196 98.236 95.97 HGN 99.753 97.678 94.737 98.393 96.495 92.239 GRAIL 99.965 99.962 98.685 99.46 99.786 97.848 GRAIL 99.93 99.923 98.37 98.92 99.572 95.695 MV-HRE 99.982 99.981 99.843 99.947 99.92 98.834 MV-HRE 99.965 99.962 99.685 99.893 99.84 98.669
Table 5.7 Transductive link prediction ranking scores (%): 1−Hop negative triples
Tables5.6,5.7, show the transductive link prediction performance of the candidate meth- ods for 1−hop negative neighbors. We observe that, though the margin of improvement is lower than that of the performance on 2−hop negative triples, MV-HRE still outperforms the rest of the competing methods. In Table5.6, HGN is seen to give a competitive performance for the classification metrics, outperforming MV-HRE in ACM. However, MV-HRE remains the best and second-best performer across most of the datasets for ranking metrics. RGCN is consistently seen to perform better in Freebase. In this case, the GNN baselines generally give competitive performance compared to their performances in the 2−hop case. This is since the methods are originally trained based on positive, negative edges and not based on other higher-order graph topologies.
5.9.1.3 Performance on Benchmark splits
Transductive 2-Hop 1-Hop
LastFM PubMed LastFM PubMed
AUC MRR AUC MRR AUC MRR AUC MRR
RGCN 57.21 77.68 78.29 90.26 81.9 96.68 88.32 96.89
GATNE 66.87 85.93 63.39 80.05 87.42 96.35 78.36 90.64 HetGNN 62.09 83.56 73.63 84 87.35 96.15 84.14 91
MAGNN 56.81 72.93 - - 76.5 85.68 - -
HGT 54.99 74.96 80.12 90.85 80.49 95.48 90.29 97.31
GCN 59.17 79.38 80.48 90.99 84.71 96.6 86.06 98.8
GAT 58.56 77.04 78.05 90.02 83.55 91.45 87.57 98.38
HGN 67.59 90.81 83.39 92.07 91.04 99.21 91.4 96.04
MV-HRE 90.954 98.243 88.147 94.41 96.667 100 93.596 98.796 Table 5.8 Performance on the Heterogenous Graph Benchmark’s Transductive Link
Prediction Task [4]. Results of competing methods are taken from the Benchmark.
5.9 Results 141 In Table5.8, we evaluate MV-HRE’s performance using the (train/ validation/ test) edge- splits from HGB [4] Benchmark. Given the input triplets, we perform repeated random walks to generate representative path contexts. Note that this does not guarantee to cover all the input triplets with a path context. We report all the competing methods’ performance from the original paper [4]. On two datasets, LastFM and PubMed, using HGB’s splits, we show that MV-HRE offers significant margins of improvement in both classification and ranking scores for the 2-hop followed by 1−hop negative sampling cases. Specifically, MV-HRE beats HGN (the best performer in LP on LastFM and PubMed) by 23.364% and 4.757% in AUC-ROC scores on LastFM and PubMed. We see 2.5%−7.5% performance improvement in MRR scores by introducing MV-HRE. MV-HRE also outperforms HGN using 1-hop negative sampling test instances consistently. 2%−5.5% improvements can be seen in the AUC-ROC scores for LastFM and PubMed. This study strongly speaks for the usefulness of various contexts in link prediction, as well as, the robustness of MV-HRE in a diverse range of experimental settings.
Clarification on high-performance:Since MV-HRE is optimized for the ranking loss, its ranking performances are always on the higher side. MV-HRE’s link prediction performance on PubMed is comparable with the rest of the methods that gave competitive performances.
However, on LastFM, MV-HRE gives excellent link prediction results, far better than the other competing methods. We found a number of issues in the HGB-provided splits of the LastFM dataset — the existence of a large number of singleton nodes, skewness in the node degree-distribution, less modular structures in the graph, and leakage of train edges. We assume the overlapping train and test edges are the probable cause of such high scores for the methods HGN and MV-HRE in classification and ranking performance metrics on the LastFM dataset. We tried resolving the data leakage issue in LastFM. But it resulted in very small sample sizes for the train/ validation/ test edges. Therefore, we did not proceed further and reported here the results obtained on the original edge splits provided by HGB.