Link Prediction in Social Networks Using Hierarchical Community Detection

(1)

IKT2015 7^th International Conference on Information and Knowledge Technology

Link Prediction in Social Networks Using Hierarchical Community Detection

Hasti Akbari Deylami Social Networks Lab

Faculty of Electrical and Computer Engineering University of Tehran, Tehran, Iran

hasti.akbari@ut.ac.ir

Masoud Asadpour Social Networks Lab

Faculty of Electrical and Computer Engineering University of Tehran, Tehran, Iran

asadpour@ut.ac.ir

Abstract— Social network analysis is an approach to the study of social structures. One of the important fields in social networks analyses is link prediction. Link prediction tries to reach an appropriate answer to this question: what kinds of interaction among members of a network would possible form in future, given a snapshot of the network in current time. The main purpose of this paper is to boost the performance of similarity based link prediction methods by using community information.

This information is derived from the structure of the graph, based on the number of community levels that two vertices have in common, in a hierarchical representation of communities. To evaluate the performance of the proposed method, four datasets are used as benchmark. The results suggest that the information of communities often increases the efficiency and accuracy of link prediction.

Keywords-component; Social Networks Analyses; Link Prediction; Hierarchical Community Detection;

I. INTRODUCTION

Analyzing social networks has been attracted the attention of many researchers. Link prediction is one of the main problems in social network analysis [1]. Link prediction means predicting the possibility of establishing a connection between two vertices in future while there is no connection between these vertices at the moment [2]. Link prediction has been applied in many relation networks such as collaborations in co- authorship networks [3], prediction of being actor [4], underground relationships between terrorists [5], and so on.

The most simple and basic type of methods in link prediction are similarity-based algorithms. In these methods, for each pair of nodes, and , a , is assigned, which is a criterion function that defines the similarity between and [6]. Finding a good criterion that can show similarities better, plays an effective role in providing appropriate and reliable results in link prediction.

In this paper we try to boost the performance of several criteria based on similarity by considering the information of communities, which is extracted from the hierarchical community detection algorithms. Then we evaluate the impact of this information on the link prediction.

A brief description of researches and studies that had been done in the past is presented in section II. In section III, the proposed method is described. Section IV contains the results

of the proposed approach and comparison of its impact against the current methods, and finally in section V, we summarize and conclude the proposed method and its effects on link prediction methods.

II. RELATED WORKS

Link prediction methods can be divided into two general categories: unsupervised and supervised.

Unsupervised methods only use the structural characteristics of the network graph to perform link prediction [6]. In these approaches, a score value as the similarity measure is set for each pair of nodes who have no link.

Determining the appropriate features and similarity measures are the main challenge in these approaches.

Usually, extracting structural features of network graphs is done by local and global approaches [6]. The local structure of nodes like their degree is the basis for link prediction in local approaches. It is clear that calculating and extracting local features are remarkably fast. On the other hand, extracting a feature vector for each node according to overall graph structure is the basis for the global approaches. Although these methods have higher accuracy due to considering the entire information of the network, but they have higher computational costs. Adamic Adar [10], Common Neighbor [11] and Katz [12] are some examples of unsupervised link prediction methods.

Supervised link prediction methods attempt to predict by learning some steps of the links creation process of the network from the past experiences. These methods learn the parameters of a probabilistic model, or examine the evolution of a specific substructure in the network graph. In general, the supervised algorithms, often have better performance than unsupervised algorithms; although in some cases, the supervised algorithms, due to their complexity and time-consumption in training phase, cannot be used in large scale graphs and may only be practical for networks of a few thousand nodes. Probabilistic models and maximum likelihood methods are some examples of supervised link prediction methods. There is a more detailed review of these approaches in [6].

Calculation of the similarity measure gets time consuming as the size of social networks grows. For this reason, some estimation methods have been used [14]. Of course estimation

(2)

methods decrease the accuracy of prediction. To achieve a higher accuracy, different combinations of similarity measures have been used in some articles. But the main question is, whether we can utilize the structural features to achieve better prediction in a more reasonable time. In [15], the structure of communities is used to improve the performance of link prediction methods. To do so, community information of common neighbors of a pair of nodes has been added to the similarity based link prediction methods. Our method is different from this paper; instead of using community information of common neighbors of a pair of nodes, we use community information of the pair of nodes, themselves.

III. M^ETHODS

Due to simplicity and good performance, unsupervised methods have become. As mentioned before, these methods use a similarity measure between nodes of the network. The rationale behind these methods is homophily (i.e. love of same), the more similar two nodes are, the more likely they would make link in future. A widely used similarity measure is the number or fraction of common neighbors i.e. two nodes are similar if their friends are similar.

Ref. [15] shows that the valuable information hidden in graph communities can effectively help in improving the performance of link prediction methods. To do so, community information of common neighbors of a pair of nodes has been added to the similarity based link prediction methods.

In this paper, we use community information as well, but our method is different; we use community information of source and destination nodes while ref. [15] uses community information of common neighbors of source and destination nodes.

The rationale behind our method is that two nodes that are situated in the same community are more likely to form link if they have not. In other words, the concept of common neighbors has been generalized to a more general concept;

instead of the number of common neighbors, the common communities of nodes are considered. If two nodes are co- members of more communities, they are more likely to make friendship link.

In this paper we try to improve the accuracy of unsupervised methods while slightly increasing the computational complexity. We think the amount of increase in accuracy can justify the amount of increase in complexity. To reach this goal; in the first step, we predict the link for pairs of nodes by using local / global information, and then augment this information with the number of common communities. To do this, the number of common communities is used as a weight in similarity measures.

A criterion for determining similarity score between pairs of nodes of the network graph is suggested as follows:

, , , 1

Where, , is the similarity measure between vertices and which is calculated by unsupervised link prediction methods [6]. And, , is a function of the

number of common communities between two nodes and which will be explained in Eq. (2). Finally, score(x,y) is the new similarity measure.

We use the hierarchical version of Infomap[7] algorithm in order to find the network communities. This algorithm is a hierarchical community detection algorithm. In ref. [8]

comparison is made between different community detection algorithms, and Infomap is shown to perform superior to other methods. The basic structure of this algorithm is based on information theory and random walks. Due to use of random walk, it is a technique that can accommodate all network information. Infomap models the community detection problem as an optimization problem that tries to minimize the map equation i.e. the length of the code required to represent a random walk with infinite length. The basic method is described in details in [9] and the hierarchical version is described in [7].

The output of Infomap, is a tree that contains the communities and sub communities of the graph. Figure 1 shows a sample output. It shows a graph with 10 nodes. In the first level it is divided into two sub communities containing nodes {1,2,3,4,5} and {6,7,8,9,10}. Then these sub communities are divided into two and three sub communities, consisting of {1,2},{3,4,5} and {6,7,8},{9}, and {10}, respectively. At the right side of fig. 1, the id of the communities, that each node belongs to, are written, separated by “:”, with highest level community at left.

For each pair of nodes, the number of common communities is calculated. To do so, we start from the highest level and count one if they are both in the same community, then move to the next level and add one if they are in the same sub-community. We continue counting until we come across different (sub) communities. In fig.1, for example, nodes 1 and 2 have two levels of communities in common, nodes 2 and 3 have one level in common, and nodes 2 and 6 have no common communities.

Figure 1. a) Communities and sub communities of a graph with 10 nodes, b) Output of Infomap algorithm for this graph

The number of common communities is therefore calculated as follow:

, max , , ,

(3)

, , 1

0 2

Where, is the community id of node in -th level and is the community number of node in -th level of communities detected by Infomap algorithm.

IV. EVALUATION AND RESULT

To evaluate the proposed, five famous structural link prediction methods are used as base line. Then the score obtained by these methods are augmented with the extracted information from communities. Then we compare the results on 4 datasets.

A. Datasets

We have used four undirected unweighted networks:

1. Petster hamster network¹ contains friendships and family links between users of hamsterster.com website.

2. Facebook network² consists of a subset of friendship network from Facebook.

3. Enron email network³ represents all email communication between employees of Enron, aggregated on half a million emails.

4. Brightkite network⁴ contains friendship relations from Brightkite, a former location-based social network where users shared their locations.

A summary of these networks is given in Tab. I.

TABLE I. DATASET SUMMARY

Name Vertex Edge Petsterhamster 2426 16631

Facebook 4039 88234

EmailEnron 36691 183830 Brightkite 58228 214078 B. Evaluation

In order to compare different methods, we randomly delete some edges from the network. The output of link prediction algorithms is an ordered list consisting of pairs of nodes sorted in ascending order of similarity scores. Then we calculate the top N precision i.e. the number of correctly predicted links out of the N deleted edges. In this paper, the value of is 100 whenever it is not clearly mentioned.

C. Exprimental Result

For evaluation purpose, each dataset is divided into two sets of train and test data. The process is done by randomly removing 10% of the edges from the dataset. The removed edges are considered as a reference for evaluation. For each dataset, the procedure has been repeated five times and the average precision is reported.

Five unsupervised link prediction methods have been used:

Adamic Adar (AA) [10], Common Neighbor (CN) [11], Katz (K) [12], Rooted Page Rank (RPR) [3] and Prop Flow (PF) [13]. In [6], all of these methods are described in detail. The initial parameters of these methods are set as follows (similar

to [3]): for Katz: . 5, 0.005; for RPR:

0.15; and for PF: . 5.

As can be seen in Tab. II, for the first three datasets, AA has better precision. We saw in the experiments that, when size of datasets increases, CN and AA, that are local similarity measures, have less execution time comparing to K, RPR and PF, which are global similarity measures.

In Tab. III, impact of adding community information on the results of the five mentioned similarity measures is shown. As the results show, using the information of communities improves the prediction performance in all datasets except for the Facebook dataset. Particularly, the effect of using information of communities is clearly visible for AA. At the moment we do not have any clue on why it is getting worse on the Facebook dataset. We will investigate it in future.

To have better idea on the performance of the best two methods, i.e. AA and CN, and their combined versions, their top N precision are evaluated by changing N. The results are shown in Fig. 2 and Fig.3. The figures verify again that information of communities is quite effective for link prediction. This positive effect is seen in all datasets except for specific parts in the Facebook dataset.

TABLE II. TOP 100 PRECISION OF DIFFERENT SIMILARITY MEASURES. Name AA CN K RPR PF Petsterhamster 0.628 0.522 0.466 0.544 0.526

Facebook 0.956 0.952 0.954 0.502 0.40 EmailEnron 0.156 0.128 0.084 0.076 0.048

Brighkite 0.822 0.882 0.878 0.04 0.064 TABLE III. TOP 100PRECISION OF DIFFERENT SIMILARITY INDICES WITH

COMMON COMMUNITIES.

Name AA*CC CN*CC K*CC RPR*CC PF*CC Petsterhamster 0.84 0.774 0.732 0.584 0.552

Facebook 0.90 0.91 0.95 0.29 0.30 EmailEnron 0.228 0.196 0.156 0.04 0.07

Brighkite 0.876 0.886 0.878 0.082 0.066

V. C^ONCLUSION

In this paper, the performance of link prediction on social networks is improved by using the information of communities. Results show that communities of a graph have valuable information which can positively affect the prediction results. Combination of Infomap algorithms and a good measure like Adamic Adar, causes a significant improvement in the precision. As both methods are quite fast, time complexity of the algorithm does not increase a lot.

Finally, with respect to all researches in the field of link prediction, using the information of communities shows improvement on the predictions, but still it is needed that all the existing information in the communities to be examined.

For example, more studies needs to be done on the graph structure to determine the reasons for the different behavior of

1 http://konect.uni-koblenz.de/networks/petster-hamster 2 http://snap.stanford.edu/data/egonets-Facebook.html 3 http://snap.stanford.edu/data/email-Enron.html

4 http://konect.uni-koblenz.de/networks/loc-brightkite_edges

(4)

Facebook dataset. Also, InfoMap might suitable community detection algorithms fo This problem needs to be deeply investigated

REFERENCES [1] B. Furht, Handbook of Social Network Technolo

Springer, 2010.

[2] L. Getoor and C. P. Diehl, “Link mining: a surv Explorations Newsletter, vol. 7, no. 2, pp. 3-12, 20 [3] D. Liben-Nowell and J. Kleinberg, “The link-p social networks, ” Journal of the American So Science and Technology, vol. 58, no. 7, pp. 1019- [4] J. O'Madadhain, J. Hutchins, and P. Smyth, “P

algorithms for event-based network data, ” ACM Newsletter, vol. 7, no. 2, pp. 23-30, 2005.

[5] A. Clauset, C. Moore, and M. E. J. Newman, “Hie the prediction of missing links in networks, ” Na 101, 2008.

[6] L. Lu and T. Zhou, “Link prediction in complex Physica A: Statistical Mechanics and its Applicatio [7] M. Rosvall and C. T. Bergstrom, “Multilevel co walks on networks reveals hierarchical organizat systems,” PLoS ONE 6(4): e18209 -2011.

Figure 2. A compar

not be the most or link prediction.

d in future.

ogies and Applications.

vey, ” ACM SIGKDD 005.

prediction problem for ociety for Information 1031, 2007.

Prediction and ranking SIGKDD Explorations erarchical structure and ature, vol. 453, pp. 98- networks: A survey, ” ons, 2010.

ompression of random tion in large integrated

[8] A. Lancichinetti, S. Fortunato, comparative analysis, ” Phys. Re [9] M. Rosvall and C.T. Bergstrom networks reveal community struc 105, pp. 1118-1123, 2008.

[10] L.A. Adamic and E. Adar, “Frien Networks, vol. 25, no. 3, pp. 211 [11] M.E.J. Newman, “Clustering an Networks, ” Physical Rev. E, vol [12] L. Katz, “A New Status Index Psychmetrika, vol. 18, no. 1, pp.

[13] R. N. Lichtenwalter, J. T. Lussier and methods in link prediction, ” [14] H. H. Song, T. W. Cho, V. D

proximity estimation and link pr Proc. of IMC, 2009.

[15] S. Soundarajan and J. Hopcro improve the precision of link pre 21st International Conference C York: ACM Press, 2012: 607-608

rison between AA measure and its combined version on different da

“Community detection algorithms: a ev. E, vol. 80, 056117, 2009.

m, “Maps of random walks on complex cture, ” Proc. Natl. Acad. Sci. USA, vol.

nds and Neighbors on the Web, ” Social -230, July 2003.

nd Preferential Attachment in Growing l. 64, no. 2, July 2001.

Derived from Sociometric Analysis, ” 39-43, 1953.

r, and N. V. Chawla, “New perspectives In KDD '10, 2010.

ave, Y. Zhang, and L. Qiu, “Scalable ediction in online social networks, ” in ft, “Using community information to diction methods, ” in proceedings of the Companion on World Wide Web, New

8

atasets

(5)

Figure 3. A comparrison between CN measure and its combined version on different daatasets