Recommendation system for Twitch Social Network graph using Link Prediction Technique

(1)

Recommendation system for Twitch Social Network graph using Link Prediction Technique

S. A. Sadhana,

Faculty of Management Sciences, Anna University, Chennai, India. sadhmba@gmail.com

S. Sabena

Department of CSE, Anna University Regional Centre, Tirunelveli, India

sabenazulficker@gmail.com

K. Selvakumar,

Department of Computer Applications, NIT Trichy, Trichy, India

kselvakumar@nitt.edu L. SaiRamesh

DIST, CEG, Anna University, Chennai.

sairamesh.ist@gmail.com

Abstract—Various social networks have become important part of the people. This social network can be represented as graph. which change with respect to time different nodes and edges are added. In this graph users can be represented as nodes and edges can be thought as relationship between these nodes (User). The goal of link prediction is to finding missing links and prediction future relationships. The task of recommending new relationships(edges) to users(nodes) can be framed as task to Link prediction in a graph. We will use supervised Machine learning approaches with a set of manually extracted features along with node embedding generated using node2vec to increase the performance. Then the trained model is used for performing link Prediction between nodes on the Twitch dataset, which is collected from SNAP. recommendation problem can be mapped to a binary classification problem wherein the two classes are - recommend or do not recommend. We propose solving this by generating link predictions and then using the predicted links to recommend streamers. Thus, we aim to train classification models using machine learning to predict whether a link exists between any given pair of nodes and use this prediction for recommendations. of game. The performance of the model is assessed utilizing prediction performance metric

Keywords— Twitch, Link Prediction, Graph, Social Network, Machine learning, recommendations, classifier, Supervised learning

I. INTRODUCTION

Any social-network is formed with interacting and relationships between peoples belonging do same group or having similar interest. This relationship can be of colleagues, friendship, acquaintance, family, follower, business partner or purchase history. This social network could be represented as a graph with individuals users as nodes and relationships or association as an edges. If we observe as social network graph the relationships are generally formed with mutual interests in a common group.

The social graph changes frequently with respect to time as new links are established and break of old connections. As connection between the nodes will likewise change over the long run anticipating the missing relationship and future associating joins between nodes is a vital assignment in an social network.

In Link Prediction main objective is to predict links that either not yet established at the given time t or it’s unknown at time t, but it is known at time t+1. So, the problem statement can be formed as. For a Given snapshot of a social network graph Gt at some-time t, predicting edges Ep that are present in graph Gt+1 but are absent in Gt, where Gt+1 is social network graph at after some time interval at time t+1,

Link prediction can be done using feature vector ℝf

extracted from node attributes using feature engineering in the network Graph.

Link prediction can have may application such as recommendation systems in a social graph, Knowledge graph and in many other fields. There is probability associated with link formation between nodes which is very important as social network structure varies with respect to time. link prediction can be done using various methods. we can classify them into two major methods

First method uses Neighbourhood based Metrics i.e., it uses various metrics to measures how similar two nodes are between two nodes if two nodes are similar it can form a link. Example of Neighbourhood based Metrics are Jaccard Coefficient, Adamic-Adar coeffect and Common Neighbours.

Second method uses Path-Based Metrics to predict links as the sequence in which nodes are connected in a graph can be used to find similar nodes. These methods are also called kernel-based method as it rely on the structure of the graph.

in this method nodes and neighbours information are used to find the similarity among the nodes. Node2Vec[8] is an one such algorithm used to generate node embeddings.

Node2Vec improvement on random walk with flexible notion of network neighbourhood which leads to richer node embeddings.

Twitch is an online live streaming site which allow streamer to broadcast the live video game gameplay to millions of user, it also allows gamers to chat and to connect with streamers. At any given time there can be 107,800 live Twitch broadcasts are going. So choosing which streamer to watch can be difficult hence there is a need for recommendation system. So we will Used above mentioned method of link prediction to recommend streamer to users ans Twitch is a social network, like other social network we can use twitch data to create a graph. We will use Twitch dataset which is collected on May 2018 using twitch api by SNAP. There are 7126 nodes (users) and 35,324 edges(followers), The edges are undirected

II. LITERATURESURVEY

In [1] paper, a supervised binary classification model is proposed for recommendation based on link prediction in social network. To see the issue as a supervised learning model, the gathered dataset is addressed in a two-class characterization task. graph representation of the dataset

(2)

mirrors that the quantity of missing connections is a lot higher than the given connection. A shorted way calculation on randomized examples is applied to make it adjusted.[1]

has used similarity-based local features of nodes. This features does not capture global graph structure

The author in [2] paper presented another parameter-free measurement to communicate the closeness or the similarity between a couple of nodes which depends on the quantity of ways between two nodes. This new metric assigns lower score to node which has a greater number of connections i.e.

more famous. The authors of [3] show that the feasibility of network connection prediction is fundamentally higher than the feasibility of calibration strategies and can assess the impact of various components of common evaluation factors on predicting client hub participation in an informal community. Shen and Chung [4] proposed a direct the connection sign expectation with saving the underlying equilibrium. This model work with signed networks and spotlight more on remaking the scant negative connections than the bountiful positive connections. plan the pairwise requirements to make the emphatically associated hubs a lot nearer than the negatively associated nodes in the embedding space

In [5], authors of the paper have proposed an implanting approach which learns the structure of evolution in evolving graphs and can foresee inconspicuous connections utilizing profound learning. The model learns the transient changes in the graph network utilizing a profound design made out of dense and recurrent layers It learns the advancement examples of individual hubs and gives an installing equipped for foreseeing future connections.

Paper by Wang Peng et al [6], is one of the most cited survey papers which speaks about the idea behind link prediction, its applications and different ways to tackle it.

The authors have given a general framework for solving the link prediction problem which involves similarity based prediction or learning based prediction. In similarity based prediction, every possible future link is assigned a score based on similarity between the nodes and higher score edges are likely to appear in future. In learning based approach, link prediction is seen as a binary classification problem. Another important paper that we considered was Network Growth and Link Prediction Through an Empirical Lens by Liu, et al [7]. The authors evaluated these algorithms on large detailed network traces obtained from Facebook, Youtube and RenRen. They firmly established that SVM consistently outperforms all metric based methods across all three networks. Based on their findings, we chose to only implement machine learning classifiers instead of single similarity metrics.

In [10], auhtors designed the predictive modelling for extract the real sentiment existent in the stream of twitter by applying data analytic technique. The same way artcile [11], applied semi supervised modelling to mine the opinions of the customer about their purchase from onine shooping. The same kind of problem approached in [14] which uses intelligent rules fused with existent machine learning to mine the sutomer’s opinion for their online purchase. The article [12] and [13] discussed about gropuing the customers by extracting the relationship between the user’s thorugh their search keywords in their query. They are using ontology and machine learning techniques to implment the proposed approach. Ambika et al [15] discussed about the multi-keyword search by ranking the search results usiing machine learning approaches.

III. PROPOSEDMETHODOLOGY

Aim of this paper is solving the problem of personalized recommendations of streamers that a user can follow. This recommendation problem can be mapped to a binary classification problem wherein the two classes are - recommend or do not recommend with features extracted from the dataset. In this paper I have propose solving this by generating link predictions and then using the predicted links to recommend streamers. Thus, we aim to train classification models using machine learning to predict whether a link exists between any given pair of nodes and use this prediction for recommendations.

Figure 1 block diagram of proposed system architecture link prediction can be done with different features extracted from the dataset with combination with node encoding done with node2vec[8] and then applying supervised machine learning algorithms. Now this trained model can be used to recommend new streamer to user.

Following steps are followed to build and train a supervised machine learning model: Data Collection, Data Pre- processing, Feature Engineering, Node embedding, Test- Train-Split, Building Classifier and Evaluation

A. Data Collection

Social network like Twitch is used widely by gamers to live- stream themselves while playing games. The nature of the platform is such that there are few popular gamers with many followers. We obtained the dataset from the SNAP which contains 50 different datasets [5]. We chose the twitch dataset as there has not been much link prediction work done on this previously. The dataset was reasonably sized with 7126 nodes. Here is a visualization of a random sample of 500 edges from the dataset

(3)

Figure 2 visualization of a random sample of 500 edges B. Data Pre-processing

The total number of possible edges in the network is 50,772,750 from which 35,324 are present in the network.

Using all the missing edges from the graph would highly skew the dataset so we randomly sampled 35,324 missing edges. While sampling these missing edges, we added a condition to only consider an edge as missing if the distance between the source and destination was more than 2 as closely connected users are likely to be mutual friends even if an edge does not already exist. This is to ensure the model is able to properly distinguish between present and absent edges, thus improving its performance. We labelled the edges that are present as 1 and the missing ones as 0. We used this presence or absence of an edge as the target class variable for prediction.

Figure 3 Algorithm to generate missing edges

Figure 4 distribution of positive and negative class C. Feature Engineering

We then used the 70,648 edges to extract various features such as

1. Page Rank : It is the technique used applied for Google web pages to ranking the results obtained from the search engine and provide the search results based on user response. It good way to measure how popular site/node is relative to other sites/nodesIt. It increase the counter one by one based on the number and nature of associations with a page to conclude a decent supposition of how critical the site is. The fundamental notion that can't avoid being that more critical sites are presumably going to get extra associations from various sites [6]. Accordingly, we calculated the page ranks of both source and destination nodes of each edge and these formed two of the features we used. This is perfect for you task as good streamer will have more links from other users

2. Shortest Path : We first delete it and then calculate the shortest path between them. The intuition behind this feature is that nodes which are close to each other have shorter path lengths indicating that they are likely to be good recommendation candidates.

3. Follows Back : This feature simply indicates whether a reversely directed edge exists in the network for each existing edge, ie whether a user follows back one of his followees.

4. Follower & Followee Counts : These features are the number of followers and followees of source and destination nodes. The intuition here is that popular streamers have a large number of followers and are good choices for recommendation candidates

5. Inter Followers & Followee Counts : These features are the number of common followers and followees between the source and destination nodes of an edge.

(4)

Figure 5 heatmap shows the feature correlation D. node embedding.

Node embedding is techniques used to represent the node as a vector such that the two similar node will also be close in embedding space. Node2vec[9] is one such popular embedding technique used for graph node embedding. The encoded vector can capture both kernel and node feature of the graph. we will use Node2vec to generate node embedding for training of our classifier along with are manually extracted features to capture details missed by extracted features.

E. Test-Train Split

In order to test the model on unseen data, it is necessary to split the data and this should be done randomly to avoid bias in the training or the testing phase. Random split ensures that the model is trained on edges which belong to both the classes (1 and 0). We have used 70% of the data for training and the remaining 30% for testing the model.

F. Build Models

For this classification problem, we have trained four models namely, Logistic Regression, Random Forest, Support Vector Machine and XGBoost. In this phase, we have used Grid Search on each model in order to estimate the best parameters. Using Grid Search with cross validation we train the model on several possible combinations of parameters and use the validation accuracy to determine the best parameters. We then use this best model to predict on the test data.

IV. RESULTSANDDISCUSSION

We have experimented with total of 10 machine learning models. We have given two different input training dataset for this 10 algorithm, in first set we have trained model with only manually extracted features. And next we have added node embedding along with manually extracted features and output is 1 and 0 where 1 means there exist a edge(link) between 1^st and 2^nd node whereas 0 represent absence of

link below are the result of this two set of experiment. The table contains F1 score, accuracy, false negative and false positive values of 10 models. As our data set by nature will have more negative class, we cannot rely on accuracy along to compare classifier. As any classifier that classifies all input as negative will also have more than 70% accuracy.

Hence, we will compare f1 score of the models. The model with higher f1 score is consider better. If two model have same f1 score we can compare false negatives, as it’s crucial that we do not miss any links. We can have comparatively more false positive as user can choose not to take recommendation that they are not interested but we should try to show all possible recommendations. So we will focus on f1 score and false negative value to compare models

The table 2 show result of only manually extracted features. All algorithm has accuracy above 95 % from this result it’s clear that we are getting best result with decision tree classifiers like XGBoost, LigjtGBM and extraTress classifiers. LigjtGBM is the best model as it has lowest false negatives. The accuracy shows that how positively it retrieves the result and false negative shows that how the system provides accuracy in a right way without much error in their search results.

Table 1 Performance results without node embedding’s

Next, we add node embeddings with extracted features as an input for training models. Adding node embeddings decreased the accuracy of SVM, KNN and naïve bayes models as this model does not work well with more number of input variables. There is significant improvement in performance of ligjtGBM classifier with only 163 false negative compare to 206 without node embedding’s as shown in table 2.

(5)

Table 2 Performance results with node embedding’s

V. CONCLUSION AND FUTURE WORK

We have presented a way of detecting whether a link will be formed in the future in a social network graph and used this information to provide recommendation for streamer on twitch social network. Ligjt GBM model performed best among all models tested. Adding node embedding to input dataset with manually extracted features from graph increase the performance of the model. As node embedding worked great in future, we can also try to us deep learning based graph neural networks GNN. Such a prediction has several other applications like prediction of a disease outbreak, suggesting alternate route or recommendations on websites like Netflix/Amazon and so on

VI. REFERENCES

[1] Behera, D.K., Das, M., Swetanisha, S. et al. Follower Link Prediction Using the XGBoost Classification Model with Multiple Graph Features. Wireless Pers Commun (2021).

[2] Ayoub J, Lotf D, El Marraki M, Hammouch A (2020) Accurate link

prediction method based on path length between a pair of unlinked nodes and their degree. Soc Netw Anal Min 10(1):9–22.

[3] Guanghui Wang, Yufei Wang, Jimei Li, Kaidi Liu, A multidimensional network link prediction algorithm and its application for predicting social relationships, Journal of

Computational Science Volume 53,2021,

https://doi.org/10.1016/j.jocs.2021.101358

[4] X. Shen and F. Chung, "Deep Network Embedding for Graph Representation Learning in Signed Networks," in IEEE Transactions on Cybernetics, vol. 50, no. 4, pp. 1556-1568, April 2020, doi:

10.1109/TCYB.2018.2871503 .

[5] Palash Goyal, Sujit Rokka Chhetri, Arquimedes Canedo,

“dyngraph2vec: Capturing network dynamics using dynamic graph representation learning” in Knowledge-Based Systems,Volume 187,2020, https://doi.org/10.1016/j.knosys.2019.06.024.

[6] W. P. X. B. W. Y. Z. XiaoYu, “Link prediction in social networks

： the state-of-the-art,” vol. 58, no. 1, pp. 1–38, 2015 [Online].

Available: http://lib.cqvip.com/qk/84009A/201501/663405989.ht ml

[7] Q. Liu, S. Tang, X. Zhang, X. Zhao, B. Zhao, and H. Zheng,

“Network Growth and Link Prediction Through an Empirical Lens,”

2016, pp. 1–15 [Online]. Available:

http://dl.acm.org/citation.cfm?id=2987452

[8] M. Young, The Technical Writer’s Handbook. Mill Valley, CA:

University Science, 1989.

[9] Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864).

[10] Sulthana, A. Razia, A. K. Jaithunbi, and L. Sai Ramesh. "Sentiment analysis in twitter data using data analytic techniques for predictive modelling." In Journal of Physics: Conference Series, vol. 1000, no.

1, p. 012130. IOP Publishing, 2018.

[11] Sadhana, S. A., L. SaiRamesh, S. Sabena, S. Ganapathy, and A.

Kannan. "Mining target opinions from online reviews using semi- supervised word alignment model." In 2017 Second International Conference on Recent Trends and Challenges in Computational Models (ICRTCCM), pp. 196-200. IEEE, 2017.

[12] Selvakumar, K., L. Sai Ramesh, and A. Kannan. "Enhanced K- means clustering algorithm for evolving user groups." Indian Journal of Science and Technology 8, no. 24 (2015): 1.

[13] Selvakumar, K., and L. Sairamesh. "User query-based automatic text summarization of web documents using ontology." In International Conference on Communication, Computing and Electronics Systems: Proceedings of ICCCES 2020, pp. 593-599. Springer Singapore, 2021.

[14] SA, Sadhana. "Customer’s opinion mining from online reviews using intelligent rules with machine learning techniques." Concurrent Engineering 30, no. 4 (2022): 344-352.

[15] Ambika, M., N. Mangayarkarasi, Raghuraman Gopalsamy, L. Sai Ramesh, and Kamalanathan Selvakumar. "Secure and Dynamic Multi-Keyword Ranked Search." International Journal of Operations Research and Information Systems (IJORIS) 12, no. 3 (2021): 1-10.

https://doi.org/10.1007/s11277-021-08399-y

https://doi.org/10.1007/s13278-019-0618-2