Link Prediction in Social Networks Using Proximity-Based Algorithms

(1)

Abstract— There has been an overwhelming increase in social media users in today’s world. This ever-increasing data of the Social Network poses a challenge for Link Prediction analysis.

The association between users that is not present but has a possibility of existing in the future can be predicted by Link Prediction techniques. In Social Networks, Link Prediction can be employed to monitor social interactions & anomalies, suggest friends to the users and also to analyze the influence or detect communities.

Link Prediction helps in retaining the users for longer duration and hence there is a boost in the engagement rate. The more accurate the link prediction is the higher the engagement rate of the applications. Social Networks like Facebook, E-business organisations Zomato and Amazon employ Link Prediction in various forms to boost their revenue and user-experience. There are various algorithms that help in calculation of the possibility of link between entities. The algorithm selection will be based on the specific use case requirement of the applications. The authors of this paper discuss Jaccard Coefficient and Resource Allocation Proximity-based algorithms for Link Prediction. The comparative study is conducted for each of the algorithms and it is observed that the combination of both the algorithms yields a better result than either of them.

Index Terms—Link Prediction, Proximity algorithms for Link Prediction, Jaccard Coefficient, Resource Allocation, Common Neighbors, Adamic Adar, Preferential Attachment,Social Net- work Link Prediction

I. INTRODUCTION

There is an alarming increase in the rate of social media users. The global network of users connected through social media for the exchange of ideas, views and information is vast. The connections of a person on social media can be acquaintances, friends, family or any other people that are a part of the network. The web presence of people can be achieved by a mere click of a button which has facilitated online meetings, access to learning materials, sharing and promoting creative work etc.

The interactions between the nodes can generate enormous amount of data, this data can be posts, chats, likes, comments, tweets, shares etc. Working with such huge data can be challenging. Problems arise while storing , pre-processing, retrieval and manipulation of data due to its variant structure..

Analysing this data to get valuable insights can be a tedious task.

Social Networks can be analysed for various revenue and insight generation purposes like viral marketing, detecting sub communities, tailoring product recommendations etc [1].

Link Prediction is one of the main technique used for Social Network analysis. The number of connections a person has, is a vital role in deciding the amount of information exchanged by the person and indirectly the time spent by the user on the social media application. In Social media networks, the engagement rate of an application decides how successful the application is, Link Prediction can help in increasing this factor and hence there will be a boost in the popularity and usage of the application.

Link Prediction requires social networks to be structured as graphs [2] where nodes will represent the users and the edges will represent the relationship between these nodes. The nodes can sometimes represent organizations and parties too, in that case the link will elucidate the type of associations between them. To work with these graphs and networks, a set of algorithms, tools and techniques that can provide accurate results are required. A single type of association can be considered for obtaining the required results.When multiple data associations are to be considered for analysis [3, 4], different types of links can be used to represent the type of associations between the nodes. The proposed work is to analyse different Link Prediction metrics and to predict the missing or upcoming links between two nodes in a social network using Resource Allocation and Jaccard Coefficient algorithms.

Consider a graph G= (V, E) where V are the nodes in the graph that represent the users while E represents the existing edges between the nodes in the graph at time t, as shown in Fig. 1(a). The solid edges in the figure represent the existing connections between the users, while the dotted lines represent the links predicted by using Link Prediction metrics. The task is to predict E‘, the non-existing edges at time t+1, as shown in Fig. 1(b), using the Proximity-based algorithms.

(2)

(a) Graph at time t (b) Graph at time t+1 Fig. 1. Link Prediction [5]

A. Link Prediction

Any social network can be modelled as a graph with vertices and edges, the users can be modelled as the nodes and the edges can represent the connections between the users.

Link Prediction can also be applied to monitor chemi- cal changes wherein protein (compound) interactions can be monitored and the future predicted interactions could lead to discovery of drugs. The spread of diseases can also be monitored by Link Prediction which can be used to control the scale of contamination. Link Prediction is the method of discovering non-existing links between the users based on the analysis of the present links. It can be done using various algorithms [6]. The requirement of the business application dictates the selection of the algorithm for Link Prediction.

Fig. 2. Graph before and after Link Prediction

The image on the left of Fig. 2 depicts the initial stage where user Sophia has connections with Maya and Maria, Adam is connected to Maria whereas David, Maria and Maya are linked to each other. The image on the right of Fig 2 is after the graph is subjected to Link Prediction. The technique employed for Link Prediction has recommended Adam and David as possible links for Sophia.

The further paper contains the discussion of the similarity metrics in section II. The design of the proposed work, that explains the workflow to obtain the future links of a network is discussed in the section III. The results obtained of the work carried out are displayed in the section IV and the future scope

of the research work is discussed by the author in the section V.

II. RELATEDWORK

The task of Link Prediction has attracted the attention of numerous research fields, including statistics, network science, machine learning, and data mining. Based on several graph proximity metrics, LibenNowell and Kleinberg, proposed link prediction models for social networks [7]. Some of the Proximity-based metrics are discussed further in this subsection:

TABLE I SYMBOL TABLE

Symbol Description of the Symbol τ(x) Neighbors of node x Kx Degree of the node x

A. Common Neighbor (CN)

The Common Neighbors metric [8] is one of the most popular metrics in Link Prediction challenges because of it’s simplicity. The number of nodes that both x and y have a direct relationship with, are referred to as the common neighbors. A connection between x and y is easier to establish when there are a greater number of shared neighbors between them. It is based on the observation that two nodes are more likely to connect to one another when they have a no in common rather than when there isn’t a node in common between them.

For example, assume a node A in a network which is connected to nodes B, C and D. Node B is connected to C and E. Node C is connected to E. Since A and B share a common neighbor C, there is a high possibility of link formation between E and A and hence this link can be recommended to the user. The common neighbor metric can be calculated for two nodes x and y as illustrated in (1).

(3)

J C(x, y) = (|τ(x)∩τ(y)|)/(|τ(x)∪τ(y))|) (2) C. Preferential Attachment (PA)

In the Preferential Attachment algorithm [10], the score PA(x,y) depends on the degree of node x and y respectively.

The higher the degree of the node is, the more is the probability of it receiving new links. It is based on the fact that users with a higher number of connections are more likely to create more connections in the future. It should be emphasized that the similarity index has the lowest computing complexity because it doesn’t require any neighbor node information. This metric can be calculated using the by equation (3).

P A(x, y) =K_x∗K_y (3) D. Adamic Adar (AA)

The similarity between web pages was being compared when the need for this metric arose. It is developed with the help of Jaccard coefficient, greater weight is given to common neighbors that have fewer neighbors. Uncommon qualities are given a higher weightage, to help in improving the straightforward counting of common features. The intuitive idea [11] of the metric is that infrequent traits are more informative. It can be calculated as shown in equation (4).

AA(x, y) = X

z∈τ(x)∩τ(y)

1/logKz (4)

E. Resource Allocation (RA)

Tao Zhou, Linyuan Lü, and Yi-Cheng Zhang [12] introduced the RA algorithm in 2009, as a part of link prediction study.

The RA measure is similar to the AA metric in appearance.

They reduce the impact of high-degree common neighbors.

The difference between AA and RA is that, RA metric [11]

penalizes high-degree common neighbors more severely than the AA metric. RA metric can be used in recommendation systems as it is known to reduce the diversity of the recommendation while maintaining the prediction accuracy [13]. The calculation of RA can be done as shown in equation (5).

RA(x, y) = X

z∈τ(x)∩τ(y)

1/Kz (5)

as it provides relative

The Social network considered for the analysis is Facebook, the dataset is collected from Stanford University [14], which is in the form of an edge list. The dataset contains the node id in the first column and the second column has the node id to which the first column node is connected. The sample dataset is illustrated in Table II. 60% of the dataset is reserved for training the model while the remaining 40%

can be used for testing. The division of the dataset is made using the package sklearn.model_selection. The edge list of the social network is processed in such a way that a node has its corresponding list of all connected nodes. The work is carried out in three phases. In the first phase only the Resource Allocation algorithm is taken into consideration for predicting the links while in the second phase only the Jaccard Coefficient is taken into consideration.The third phase involves both Resource Allocation and Jaccard Coefficient algorithms, these are fed to the model which displays the result of the likelihood of link formation between the nodes.

TABLE II DATASET FORMAT

Node ID Connected Node ID

1 871

1 375

3 108

3 708

8 345

The graph dataset that is collected in the form of acsvfile is to be parsed and manipulated in such a way that all the unique nodes in the graph are to be identified. The unique nodes are then to be listed with their corresponding connected nodes in a file. The list of nodes for which the resource allocation and Jaccard coefficient formula is to be applied, is generated using the file. The algorithms take the two input nodes and calculate the possibility of link prediction according to the formula. The formulas of Jaccard coefficient and Resource allocation are given in equation (2) and equation (5) respectively.

(4)

Fig. 3. Workflow of Proposed Work

Algorithm 1: Jaccard Coefficient algorithm Data:Node x and y in the graph

Result:Prediction value JC(x,y) between x and y RA←0;

Computeτ(x) Computeτ(y)

J C←τ(x)∩τ(y)/τ(x)∪τ(y) returnJC

Algorithm 1 calculates the Jaccard Coefficient between node x and y. It is required to find out the neighbours of the nodes x and y. The common neighbours between x and y is found by the intersection operation while the union yields the set of all the neighbours of x and y. The ratio of the intersection to the union gives the Jaccard Coefficient value between the nodes.

Algorithm 2: Resource Allocation algorithm Data:Create a graph G based on the chosen Social

Network Dataset. Node x and y are in the graph Result:Prediction value RA(x,y) between x and y RA←0;

Computeτ(x) Computeτ(y)

forz inτ(x)∩τ(y)do

Calculate Kz /* Degree of node z */

RA←RA+ 1/Kz

end returnRA

Resource Allocation value between nodes x and y is com- puted by algorithm 2. It is required to find out the neighbours of the nodes x and y. For every node z in the graph, that is a neighbour of both x and y, the degree of the node K_z is

calculated. The computedKz values of the nodes are inverted and added. This sum yields the Resource Allocation value between the nodes x and y.

A. XGBoost

Tianqi Chen developed XGBoost [15] initially as a research endeavour for the Distributed (Deep) Machine Learning Com- munity (DMLC) organisation.

XGBoost library is used to train the model as it is a scalable, distributed implementation of Gradient-Boosted Decision Tree (GBDT). Boosting sequentially trains models, with each one being trained to fix the mistakes caused by the preceding one. The benefit of this iterative process is that new models are added with the intention of fixing errors introduced by earlier models as shown in Fig. 4 i.e the previous weaker models are taken into account for building a stronger model.

XGBoost model often provides better accuracy than a single decision tree but, it does so at the expense of decision trees’ inherent interpretability. Tracing the path of a decision tree, for instance, is simple and clear, but it is tedious to follow the paths of hundreds or thousands of trees. Some model compression strategies allow turning an XGBoost into a single "born-again" decision tree that roughly approximates the same decision function in order to achieve performance and interpretability.

The iterative process where the error is calculated for a trained model and will be taken care of, by the next model in the successive iteration is demonstrated in Fig. 4.

IV. RESULTANALYSIS

A. Calculation of Link Probability

The result of the research work should be a model, that is able to predict the possibility of links between the nodes for an unknown dataset (test dataset) as accurately as possible,

(5)

Fig. 4. Xgb iterative Process

using the Resource Allocation and Jaccard Coefficient algorithm independently and also as that of both the algorithms combined. The threshold is selected as 0.7 above which the link probabilities are to be considered as predicted links. Fig.

5 refers to output of the work, the first column is the serial number given to the predicted edges, Second and third column represent the node ids between which the link is predicted by the model and fourth column represents the possibility of link formation between the two nodes using RA and JC algorithms together. Similarly the edges are predicted for the network using RA and JC algorithms separately.

Fig. 5. Calculation of possibility of Link Formation

that 2890 edges out of 3000 edges were predicted with a probability greater than 0.7, by considering RA algorithm independently. Three trials are carried out for each phase and the accuracy of each trial is calculated as shown in TABLE IV.

TABLE III

NUMBER OFPREDICTEDLINKS

Metric Used Trial I Trial II Trial III

RA 2890 2909 2876

JC 2737 2694 2745

RA & JC 2864 2948 2951

TABLE IV COMPARISON OFACCURACIES

Metric Used Trial I Trial II Trial III

RA 0.963 0.969 0.958

JC 0.912 0.898 0.915

RA & JC 0.954 0.982 0.983

The mean of accuracies of the metrics used, is represented in the form of a graph in Fig. 7 and it is observed that RA and JC together yield a better result than either of the individual metrics.

V. CONCLUSION

Any social network can be subjected to the model con- structed, by considering the network in the required format of the dataset. The users or the entities in the network will be considered as the nodes and the associations between them will be considered as the edges. After the edge list is obtained from the graph it can be manipulated to fit the desired format and can be supplied to the model. The model will successfully and efficiently predict the future links.

Future links for the considered dataset of Facebook are successfully predicted based on Proximity-Based algorithms Jaccard coefficient and Resource Allocation independently

(6)

Fig. 6. Predicted Edges Graph

Fig. 7. Metrics VS Mean of accuracies

and also by considering Jaccard coefficient and Resource Allocation algorithms together. Link Prediction can be ex- tended to networks that have multiple associations where different types of links can be considered for each type of association. Multiple links can be scrutinised to obtain the results of the desired factor. Real time network datasets can be considered for analysis which will help in revenue generation and advancement in the fields of Science and Technology.

REFERENCES

[1] M. Mohana, “Challenges and difficulties in social media analytics,”

IARJSET, vol. 8, pp. 232–235, 06 2020.

[2] H. H. Song, T. W. Cho, V. Dave, Y. Zhang, and L. Qiu,

“Scalable proximity estimation and link prediction in online social networks,” in Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement. New York, NY, USA: Association for Computing Machinery, 2009, p. 322–335. [Online]. Available:

https://doi.org/10.1145/1644893.1644932

[3] P. Kumar and V. Venugopal, “Link prediction in the pinterest network,”

2016.

[4] T. Matek and S. Zebec, “Github open source project recommendation system,” 02 2016.

[5] P. Joshi, “A Guide to Link Prediction – How to Predict your Future Connections on Facebook,”

https://www.analyticsvidhya.com/blog/2020/01/link-prediction-how- to-predict-your-future-connections-on-facebook/ , 2020.

[6] M. Akhtar, I. Ahmad, I. Khalil, and S. Ahmed, “Missing link prediction in complex networks,”International Journal of Scientific and Engineer- ing Research, vol. 9, pp. 82–87, 12 2018.

[7] D. Liben-nowell and J. Kleinberg, “The link prediction problem for social networks,” Journal of the American Society for Information Science and Technology, vol. 58, 01 2003.

[8] I. Ahmad, M. Akhtar, S. Noor, and A. Shahnaz, “Missing link prediction using common neighbor and centrality based parameterized algorithm,”

Scientific Reports, vol. 10, p. 364, 01 2020.

[9] L. Dong, Y. Li, H. Yin, H. Le, and M. Rui, “The algorithm of link prediction on social network,”Mathematical Problems in Engineering, vol. 2013, 01 2013.

[10] R. S. Ahmad Zareie, “Similarity-based link prediction in social networks using latent relationships between the users,” 2020.

[11] Z. Wu and Y. Li, “Link prediction based on multi-steps resource allocation,” in2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 1, 2014, pp. 355–360.

[12] T. Zhou, L. Lü, and Y.-C. Zhang, “Predicting missing links via local information,” The European Physical Journal B, vol. 71, no. 4, pp. 623–630, oct 2009. [Online]. Available:

https://doi.org/10.1140%2Fepjb%2Fe2009-00335-8

[13] J. Ai, Y. Cai, Z. Su, K. Zhang, D. Peng, and Q. Chen,

“Predicting user-item links in recommender systems based on similarity-network resource allocation,” Chaos, Solitons Fractals, vol. 158, p. 112032, 2022. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S0960077922002429

[14] J. Leskovec, “Social circles: Facebook,”

http://snap.stanford.edu/data/ego-Facebook.html .

[15] M. Chen, Q. Liu, S. Chen, Y. Liu, C.-H. Zhang, and R. Liu, “Xgboost- based algorithm interpretation and application on post-fault transient stability status prediction of power system,”IEEE Access, vol. 7, pp.

13 149–13 158, 2019.