Social Network Link Prediction Algorithm Based on Node Similarity

(1)

Social Network Link Prediction Algorithm Based on Node Similarity

Shoumeng Huang University of Sanya Sanya Hainan, China [email protected]

Liangu Ma*

Qiongtai Normal University Haikou, Hainan, China [email protected]

*Corresponding author

Abstract—Nowadays, with the rapid development of computer technology, network technology has changed people's lives. The popularity of various social networking sites and smart terminals has enabled people to leave a large amount of real data on the virtual network, thus providing a database for network analysis and research. With the continuous development of social networks, while enjoying the convenience of a rich digital life, people also feel the problems brought about by information expansion, that is, people cannot quickly and effectively extract the most relevant information from the largest information. Predicting the undiscovered links that may or may not exist on the network is useful for analyzing the lack of network data and analyzing the complex mechanisms of network evolution. This research is very important for the analysis of social networks. This paper studies the social network link prediction algorithm based on node similarity. First, the literature research method is used to summarize the social network process and social network link prediction. Experiments are carried out on the social network link prediction algorithm based on node similarity. Comparing the prediction accuracy of the two algorithms, the experimental results show that the prediction accuracy of the LP and Katz indexes exceeds 90%, and the AUC value of most predictions is as high as 95%, indicating that the prediction results of these two types of indexes are close to the truth. The prediction result of the LHNII index is very poor, and the prediction value is less than 0.5, which means that the prediction result is not as accurate as the random prediction result. CN, AA, RA indicators have higher AUC values than other indicators. Among them, the RA index has the best prediction effect, and the prediction effect is the best on the four data sets of Usair, Jazz, Power Grid and Yeast; followed by the AA indicator and the CN indicator, which have the best prediction effect on the C. elegans and Email data sets respectively good.

Keywords—Node Similarity, Social Network, Link Prediction, Prediction Accuracy

I. INTRODUCTION

The network can describe many systems in the real world, such as social systems, information systems, and biological systems. In the network, nodes represent entities, and links represent relationships with each other [1, 2]. In recent years, with the rapid development of the Internet, more and more real network data has been created, downloaded and processed [3, 4]. Through network analysis, the extraction of some valuable rules has become a research point, such as community development, communication analysis and contact [5, 6]. Prediction is one of the most important issues in link mining analysis, and the prospects for link prediction are very broad [7, 8]. It can be used in a suggestion system to

help people meet new friends and provide online marketers with products of interest; it can also be used to infer the complete structure of the network and determine who the patient may have been in contact with; and the presence of terrorists linked molecules; in biology, it can also be used to discover the interactions between proteins and other molecules [9, 10].

In the research of the social network link prediction algorithm based on node similarity, many scholars have studied it and achieved good results [11]. Zhang W J et al.

suggested a common neighbor node algorithm that uses the number of public neighbors of the target node as an indicator to predict whether there is a connection between the target nodes [11]. On this basis, more similarity algorithms based on node neighbors are proposed. Among them, the AA algorithm is proposed by Kamal researchers. This algorithm considers the degree information of common neighbor nodes when defining node similarity [12].

This paper studies the social network link prediction algorithm based on node similarity. First, the literature research method is used to summarize the social network process and social network link prediction. Experiments are carried out on the social network link prediction algorithm based on node similarity. Compare the prediction accuracy of the two algorithms.

II. SOCIAL NETWORK LINK PREDICTION ALGORITHM BASED ON NODE SIMILARITY

A. Research Method

1) Literature research: Reading books and articles related to the relevant domestic and foreign literature on the social network link prediction algorithm based on node similarity, the advantage is that you can understand the development process of the research object from the source, and understand the development status of the research object. And provide a clear and structured theoretical basis for in-depth thesis development.

2) Quantitative analysis: Qualitative analysis is related to quantitative analysis. Quantitative analysis refers to the analysis of mathematical hypothesis determination, data collection, analysis and testing, and it is also quantitative and qualitative.

Qualitative analysis refers to the qualitative analysis of the research object. It refers to the process of conducting research based on subjective understanding and qualitative analysis, through research and bibliographic analysis.

1357

2022 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC)

Authorized licensed use limited to: Zhejiang University. Downloaded on November 17,2023 at 05:45:35 UTC from IEEE Xplore. Restrictions apply.

(2)

B. Social Network Analysis Process

1) Node clustering: Node combinations classify nodes by parameters (such as their own attributes) and links to other nodes. After the nodes are grouped, the nodes with similar characteristics are divided into the same category, and the nodes with larger characteristics are divided into different categories. The block modeling method and the spectrometer segmentation method are the main node grouping methods.

2) Sorting of nodes based on links: Link ranking is one of the main processes of social media analysis. The content is the evaluation of the nodes to be sorted, not only need to identify and analyze the links between nodes in the network, but also need to use some algorithms to calculate the effect of the nodes. This kind of measurable node classification index is called centrality. According to the different ranges of centrality calculation, it can be divided into two categories: global indicators and local indicators. When evaluating the influence of a node on the network, the global index considers the connection between the node and the influencing node; the local index uses the node degree as the main factor to evaluate its influence on the network.

3) Node classification based on links: The link-based node classification, that is, the classification of nodes into specific and linked categories, requires the use of a graph model to represent the network. At the same time, adding node attribute information to a simple machine learning classifier is useful for studying link-based node classification.

4) Link prediction: Using information such as known node attributes and known links between nodes, the link prediction method can analyze the connection probability between nodes. According to the time factor, links can be divided into two categories: one is the prediction of links that already exist on the current social network but has not been discovered; the other is the prediction of the future of links that do not currently exist but can be created in them.

5) Subgraph discovery: Subgraphs found that one of the main processes of social network analysis is to find subgraphs in the overall network graph. Subgraph discovery is studied from two aspects: one is to find subgraphs with high frequency of presentation, and the other is to find meaningful subgraphs based on compressed heuristic search.

6) Picture classification: The object of graph sorting, mining and analysis is the entire network, and then the nature and characteristics of the social network are analyzed and judged. There are three main methods to classify graphs:

graph-based mining that is, searching for high-frequency signatures on all network graphs to obtain signature data based on signature data information; inductive logic programming that is, using relationships to describe pattern information data in high-level graphs. Define the graph core function, which can be measured and characterized based on the random path method.

7) The production model of the graph: Research graph generation model according to different types of dependencies. Many nodes, multiple link types and dynamic networks are related types of structure.

C. Representation of Social Network Link Prediction Data 1) Graphic representation method: The graph representation method of a social network can be described as G=(V,E) by a two-tuple, where V represents the set of vertices in the graph G, and E represents the set of edges in the graph G. In the set E, for any element e=(a b)∈E, e represents that in a certain time, there is an edge between nodes a and b, which is e. If there are multiple edges between nodes a, and b, they are represented by e1, e2…….en respectively.

2) Matrix representation method: The adjacency matrix arranges the nodes in the graph according to the way of rows and columns, focusing on reflecting the adjacent relationship between the nodes. The element aij in the matrix represents the number of links existing between nodes i and j. For example, the adjacency matrix of a simple undirected graph G = (V.E) is a symmetric square matrix of order n. If there is no link between nodes i and j, then aij=0;

if there is a link between nodes i and j, then aij = 1. It can be seen that the degree of the node can be obtained by accumulating the values of all elements in the row or column where the node is located. For a directed graph, accumulating all the elements in the column where the node is located can get the in-degree of the node, and accumulating all the elements in the row where the node is located can get the out-degree of the node, but the adjacency matrix of the directed graph is not necessarily symmetric.

D. Node Similarity Link Prediction Algorithm 1) Similarity algorithm based on node neighbors

a) AA algorithm: The AA algorithm was originally used in the field of information retrieval to obtain the similarity between two web pages. Nowadays, the AA algorithm can be used in link prediction research. The target node can be regarded as a web page for information retrieval, combining the set of common neighbor nodes to obtain the weights of each node, and sum these weights to obtain the target node .The similarity between. The calculation formula of the AA algorithm is as follows:

sim(x, y) =∑ 1

zΓ(z) (1) z∈ Γ(x)∩ Γ(y) (2) 2) Similarity algorithm based on node path

a) Shortest Distance algorithm: The Shortest Distance algorithm defines the similarity between two target nodes based on the path information between them. Its method is: first find all paths between two target nodes, then find the path with the shortest length among them, get its length value and calculate it’s reciprocal as the similarity between the target nodes. The closer the distance between the target nodes, the higher the possibility of a link between the two. The specific formula is as follows:

sim(x, y) = 1

Length(shorts(x, y) (3) In the formula, shorts(x, y) represents the shortest path

1358

(3)

between node x and node y, and Length represents the length of the path.

III. S^OCIALNETWORK LINK PREDICTION ALGORITHM

EXPERIMENT BASED ON NODE SIMILARITY

A. Research Purpose

In order to compare the efficiency of each link prediction index and better optimize on the basis of the existing link prediction index, this chapter uses MATLAB as a simulation tool to verify the three types of algorithms on six data sets.

B. Experimental Data

Generally, real-life networks can be divided into four categories: biological networks, social networks, technical networks and information networks. In order to prove the effectiveness of the algorithm, this paper selects six real network data sets from the above four types of data sets as the experimental data sets. The characteristics of each data set are as follows:

USAir network, an undirected weighted network of American airlines routes

C. elegans network, worm undirected and unauthorised network

Jazz network, a network of undirected and unauthorised cooperation with jazz musicians

Email network, the mail communication network of URV University in Spain

Power Grid network, power transmission network in the western United States

Yeast network, protein molecular interaction network IV. EXPERIMENTAL DATA ANALYSIS

A. Comparison of Link Prediction Accuracy Based on Path Similarity Index

In this paper, the path-based three link prediction indicators with different parameter values are performed on six data sets, and the prediction results with AUC as the evaluation indicator are used for experiments. The experimental results are shown in Table I.

TABLE I.COMPARISON OF LINK PREDICTION ACCURACY BASED ON PATH SIMILARITY INDEX

AUC LP Katz LHNII

USAir 94% 95% 59%

C. elegans 91% 91% 45%

Jazz 95% 94% 76%

Email 93% 94% 82%

Power grid 94% 95% 60%

Yeast 97% 97% 96%

It can be seen from Figure 1 that the prediction accuracy of the LP and Katz indexes exceeds 90%, and most of the predicted AUC values are as high as 95%, indicating that the prediction results of these two types of indexes are close to the true values. The prediction result of the LHNII index is very poor, and the prediction value is lower than 0.5, which means that the prediction result is not as accurate as the random prediction result.

Figure 1. Comparison of link prediction accuracy based on path similarity index

B. Comparison of Link Prediction Accuracy of Local Information Similarity Index

In this paper, the three link prediction indicators based on the similarity of node local information are tested on six data sets with AUC as the evaluation indicator.

TABLE IICOMPARISON OF LINK PREDICTION ACCURACY OF LOCAL INFORMATION SIMILARITY INDEX

AUC CN AA RA

USAir 95% 96% 97%

C. elegans 91% 94% 95%

Jazz 86% 86% 86%

Email 96% 97% 98%

Power grid 95% 96% 97%

Yeast 91% 92% 92%

Figure 2. Comparison of link prediction accuracy of local information similarity index

It can be seen from Figure 2 that CN, AA, and RA indicators have higher AUC values than other indicators.

Among them, the RA index has the best prediction effect, and the prediction effect is the best on the four data sets of Usair, Jazz, Power Grid and Yeast; followed by the AA indicator and the CN indicator, which have the best prediction effect on the C.elegans and Email data sets respectively good.

0%

20%

40%

60%

80%

100%

120%

percentage

AUC Comparison of link prediction accuracy based on

path similarity index

LP Katz LHNII

80% 85% 90% 95% 100%

USAir C．elegans Jazz Email Power Grid Yeast

percentage

AUC

Comparison of link prediction accuracy of local information similarity index

RA AA CN

1359

(4)

V. CONCLUSION

Complex network analysis and link prediction technologies are of great significance for all kinds of network research in real life and can be used in practice.

Algorithms for fast and effective link prediction are always the focus of research by scientists and professionals. This article mainly focuses on the existing two prediction algorithms of two social Internet network links based on the similarity of node paths. Experiments are carried out to detect the accuracy of their prediction and evaluation effects.

The experimental results, first, based on nodes In the comparison of the estimation accuracy of the path similarity index link, the accuracy of the two estimation algorithms of the lp and katz index are both more than 90%, and most of the predicted AUC values are as high as 95%, indicating that the two types of indexes are The prediction results are close to the true value. The prediction result of the LHNII index is very poor, and the prediction value is lower than 0.5, which means that the prediction result is not as accurate as the random prediction result. Second, in the comparison of link prediction accuracy of local information similarity indicators, CN, AA, and RA indicators have higher AUC values than the prediction results of other indicators. Among them, the RA index has the best prediction effect, and the prediction effect is the best on the four data sets of Usair, Jazz, Power Grid and Yeast; followed by the AA indicator and the CN indicator, which have the best prediction effect on the C.elegans and Email data sets respectively good.

ACKNOWLEDGMENTS

Scientific research project of higher education in Hainan Province: Multi-source heterogeneous data fusion link prediction based on LBSN (Hnky2021-51).

REFERENCES

[1] Yu, C., Zhao, X., An, L., & Lin, X. (2017) “Similarity-based link

prediction in social networks: a path and node combined approach”, Journal of Information Science, 43(5), 683-695.

[2] Zhang, L., Qi, Z., Lin, G., & Li, X. (2016) “Research on online social network information diffusion detection node selection algorithm based on the random walk model”, Journal of Computational and Theoretical Nanoscience, vol. 13, issue 1, pp. 971-981, 13(1), 971-981.

[3] Liu, Z., Li, Y., & Liu, H. (2019) “Link prediction in evolving networks base on information propagation”, IEEE Access, 7:140451-140459.

[4] Cai, L., Wang, J., He, T., Meng, T., & Li, Q. (2018) “A novel link prediction algorithm based on deepwalk and clustering method”, Journal of Physics: Conference Series, 1069(1), 012131 (6pp).

[5] Yuan, L., Bin, J. L., Wei, Y. Z., Hu, Z., & Sun, P. (2021)

“Transaction prediction in blockchain: a negative link prediction algorithm based on the sentiment analysis and balance theory”, Wireless Communications and Mobile Computing, 2021(1), 1-11.

[6] Rezaeipanah, A., Mokhtari, M. J., & Zadeh, M. B. (2020) “Providing a new method for link prediction in social networks based on the meta-heuristic algorithm”, Information Technology and Management, 1(1), 28-36.

[7] Zhang, L., Yang, L., Hu, G., Pan, Z., & Li, Z. (2016) “Link prediction via sparse gaussian graphical model. Mathematical Problems in Engineering”, 2016,(2016-2-21), 2016(pt.2), 1-11.

[8] Gupta, A. K., & Sardana, N. (2018) “Prediction of missing links in a social networks: finn (feature integration with node neighbor)”, International Journal of Web Based Communities, 14(1), 1.

[9] Kolomvatsos, Kostas. (2016) “Effective problem solving through fuzzy logic knowledge bases aggregation, Soft Computing”, 20(3):1-22.

[10] Rezaeipanah, A., & Mojarad, M. (2019) “Link prediction in social networks using the extraction of graph topological features”, International Journal of Communication Networks and Information Security, 7(5), 1-7.

[11] Zhang W J. (2018) “Fundamentals of network biology || link prediction: node-similarity-based methods”, 10.1142/q0149, 345-359.

[12] Kamal, Berahmand, Asgarali, & Bouyer. (2019) “A link-based similarity for improving community detection based on label propagation algorithm”, Journal of Systems Science and Complexity, 32(3), 737–758.

1360