Enter Title for Paper - IRD India

(1)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

_______________________________________________________________________________________________

“Designing a Framework To Detect Twitter Spammers Using Forensic Approach”

1Ms. Monali kakde, ²Amol Muley

1,2Computer Science and Engineering, J.D College of Engineering, Nagpur Email: ¹[email protected], ²[email protected]

Abstract- Internet has become a significant mean today.

The use of internet has increased widely. Internet reduces the manual work and also more time consuming .This is the reason why people today are connected to internet. In recent years Online Social Networks (OSNs) also developed well and plays an equal role. People use Online Social Networks to share their view of life. Online Social Networks enables them to connect with their friends and families. Online Social Networks has become an important part of many people life today. Therefore Online Social Networks should be highly secure to protect the individual’s privacy. Though provides the security measures but it was limited. one of the most popular online social networks Twitter is paying its dues as more and more spammers set their sights on this micro blogging site.

Twitter spammers can achieve their malicious goals such as sending spam, spreading malware . many researchers along with the engineers at Twitter have devoted themselves to keeping Twitter as spam-free online communities. Twitter spammers are evolving to evade existing detection features. In this paper, we first make a analysis of data set and detect IP address, MAC address, and HOST name utilized by Twitter spammers. We also design a new detection features to detect more Twitter spammers.

Index Terms- Spam, Twitter, Machine Learning, Social Networking site.

INTRODUCTION

Rapid growth of social networking sites are used for communication, sharing images, videos storing information and managing significant information, it is attracting for cybercrime who misuse the page to exploit vulnerabilities for their bad benefits. online accounts crack up each and every day to achieve their malicious goals such as sending spam, spreading malware.

Impersonators phishers, scammers and spammers crop up all the time in Online Social Networks (OSNs), and are harder to identify. Spammers are the users who send unsolicited messages to a large people with the intention of advertising some product to spoil user’s system just for the purpose of making money. Along with the usual problems like spamming, phishing attacks, malware infections, social bots, viruses etc., the greater challenge that social networking sites present for users is to keep private data secure and confidential. The purpose of

social networking sites is easily available and accessible to others for making information. But regrettably, cyber criminals use this public available information to make their target attacks. Once attackers get access to one of user’s accounts, they can easily find out more information and to use this information to access their other accounts and accounts of their friends A lot of research has been done to detect spam profiles in OSNs.

Sites, Twitter is the fastest growing one sites. Its popularity attracts many spammers to spread their spams in user’s accounts with a large amount of spam messages. detection of spam in has become a challenger task for researchers as well as for Twitter itself. Twitter spam detection consist both the varieties of detecting spammers and detecting spam links which is posted by the users. In this paper we have review the existing techniques for detecting spam users in Twitter social network. Feature detection of spammers could be user based or content based or both. Current work provide to detect spam profile that is MAC address IP address, and HOST name.

RELATED WORK

Spam Detection has an extensive scope of research exploring identification of spammers or spam, preventing spammers and Counter balancing its effects on the media; society etc. Kwak et al. have shown an exhaustive and qualitative study of Twitter user accounts’ behavior, like the variations in the number of followers and followings for normal user and spammer etc. Cha et al. have design alternative metrics to measure Twitter accounts. In M. McCord, M. Chuah have shown influence of user-based and content-based features, which are influenced by Twitter Policies, can be used to distinguish between spammers and legitimate users on Twitter. Usefulness of these features is evaluated in spammer detection using traditional classifiers like Random Forest, Naïve Bayesian, Support Vector Machine, K Nearest Neighbor schemes using the Twitter dataset collected .Benevenuto et al. , has investigated different tradeoffs for classification approach of detecting spammers instead of tweets containing spam and the impact of different attribute sets. And it was also shown that change in the performance of classifier’s output based on different feature set selected. In Nikita

(2)

_______________________________________________________________________________________________

Spirin , has shown importance if URL derived features set in detecting spammers. Yang et al. focuses more on analysing evasion tactics utilized by current Twitter spammers and authors designed new machine learning features to more effectively detect Twitter spammers. In addition, authors also formalize the robustness of 24 detection features. .

TWITTER AS AN OSN 5.1 Introduction

Twitter is a social network service launched in March 21, 2006 and has 500 million active users till date who share information. Twitter uses a chirping bird as its logo and hence the name Twitter. Users can access it to exchange frequent information called 'tweets' which are messages of up to 140 characters long that anyone can send or read. These tweets are public by default and visible to all those who are following the tweeter. Users share these tweets which may contain news, opinions, photos, videos, links, and messages. Following is the standard terminology used in Twitter and relevant to our work:

 Tweets : A message on Twitter containing maximum length of 140 characters.

 Followers & Followings : Followers are the users who are following a particular user and followings are the users whom user follows.

 Retweet : A tweet that has been reshared with all followers of a user.

 Hashtag : The # symbol is used to tag keywords or topics in a tweet to make it easily identifiable for search purposes.

 Mention : Tweets can include replies and mentions of other users by preceding their usernames with @ sign.

 Lists : Twitter provides a mechanism to list users you follow into groups

 Direct Message : Also called a DM, this represents Twitter's direct messaging system for private communication amongst users.

TWITTER SPAMMING TECHNIQUES

Twitter Spamming techniques can be divided into two categories:

A. Profile-Based Spamming Techniques

Follow Spam: Follow spam is the act of following mass number of people, not because a user actually interested in their tweets, but simply to gain attention, get views of a respective user’s profile (and possibly clicks on URLs therein), or (ideally) to get followed back. Automated programs are used to make this task easier, this way they can follow thousands of users with in a fraction of seconds. In extreme cases, these automated accounts have followed so many people and they are threat to the performance of the entire system. In less-extreme cases,

they simply annoy thousands of legitimate users who get a notification about this new follower only to find out their interest may not be entirely sincere. These types of accounts can be examined by checking the tweets posted by the users and examining their behaviour. Figure 1 shows an instance of follow spam. Figure 1: Instance of Follow Spam Technique

Figure 1: Instance of Follow Spam Technique Mention Spam: Spammers mention the username of a targeted user before tweeting. Targeted user’s attention can be grabbed by this method.

B. Content-Based Spamming Techniques:

Trend Abuse: Twitter’s API also provides a list of the top trends per hour. Spammers use these trending topics in their tweets and it gets posted in the time line causing annoyance to all the users because public accounts can be seen by anyone on the twitter. Figure 2 shows the two instances of typical (a) trend abuse and (b) multitrend spam.

(a)Trend Abuse Scenario (b)Multi-Trend Abuse Scenario

Figure 2: Instances of Trend abuse spamming techniques

(b)Trend Abuse Scenario

Trend Setting: Here spammers post a large number of tweets containing a specific word in it, making the word or hashtag a new trending topic.

Fake Re-tweets: In this technique spammers take advantage of the Twitter’s Re-Tweet convention to make it appear that a Spammer’s tweet was originally published by another user. These can be identified by

(3)

_______________________________________________________________________________________________

twitter’s search capabilities where re-tweets can be distinguished from original tweets.

Embedding Popular Search Terms: In this technique spammers act very smart. They include popular search terms in their tweets and when a user search for the same terms, these tweets gets displayed in the result set, which is again an annoying experience for a legitimate user, who does not get the expected results.

(b)Multi-Trend Abuse Scenario

Direct Message: This is traditional spamming technique where spammers send personal message tothe targeted user directly. New spamming techniques are emerging still today. Above explained techniques are the most popular spamming techniques used by spammers.

TWITTER SPAM DETECTION

Twitter itself offers several options to the users to report spam messages or spammers. Some of them are: Report a user as spammer by clicking “Report @username as Spam” button under the Actions section of a profile’s sidebar, Report a tweet link as spam, Block suspicious user. Twitter also provide guidelines for analysing a spammer and provide rules of “DONT” for the user’s Detection Techniques of Twitter spam can be classified into two categories:

A. Detecting Spammers(nodes)

Detection Techniques of Twitter Spam is carried out by applying machine learning algorithms on the data extracted through various data mining techniques having features specified in feature sets to detect spammers.

This approach for Twitter spam detection methods is done in three steps:

a) Crawling twitter data and Building labelled collection:

Data about the twitter users can be crawled using different approaches. Twitter provides different APIs like REST API, STREAMING API, and SEARCH API.

Based on feature set selection the data is crawled accordingly. Then collected data is manually labelled as spam and non-spam labels by examining recent tweets and time line of the user. The links given in the each

user tweet is examined and checked manually. As this is a time consuming process labelling is done for small set of data. Labelling is done on the basis of feature set selected.

b) Construction of feature sets: One of the crucial and time consuming tasks in the web spam detection systems is the process of feature extraction, which is usually accomplished after crawling and during the indexing phase. If less number of features is used to detect the spam pages, then one might save some computational costs and therefore the performance of the system will be increased. The automated data mining feature selection technique provides an effective method for selecting the

most predictable features from many presented features.

After features are selected by feature selection methods, their effectiveness can be investigated by accuracy of classification algorithms applied to only these selected features vs. all features. The fewer features lead to reac the higher or the same level of performance.

Classification of feature sets will be discussed in the next section. Types of features:

1) Graph-based Features:

 # friends

 #followers

 Reputation score(#friends/#followers)

 Users with certain distance in social graph 2) Content-based Features

 # Duplicate Tweets

 # HTTP Links

 # Replies and Mentions

 # Trending Topics 3) URL based features

 # tweets containing “spammy” URLs

 fraction of tweets containing “spammy” URLs

 # “spammy” URLs

 # unique URLs

Various types of other feature set e.g. user based features tweet based features, timeline based features, neighbourhood based features etc. are used depending on the requirement of the detecting systems.

c) Classify spammers and non-spammers using machine learning algorithms: After selecting features, we apply various classification algorithms to obtain performance of them on our dataset. The results of these algorithms are used to compare effectiveness of different feature selection methods. For the classification tasks, we use various algorithms such as neural network, Support Vector Machine (SVM), Naïve Bayesian classifier,

(4)

_______________________________________________________________________________________________

Decision trees, and logical regression. The performance of the detection process is based on the right combination of selection of feature sets and machine learning algorithm.

B. Detecting Spam (tweets, links)

In this paper we focus on detecting spammers and detecting spam links is a falls under the class of web spam detection.

EVASION TACTICS

In spite of many researchers and twitters effort to detect and avoid spam (as discussed in section V), Spammers follow evasion tactics to get rid of these detection methods. This section discusses about classification and methods adopted for evasions.

A. Evasion Techniques

The main evasion tactics, utilized by the spammers to evade existing detection approaches, can be categorized into the following two types:

a) Profile-based Feature Evasion Tactics:

A common intuition for discovering Twitter spam accounts can originate from accounts’ basic profile information such as number of followers and number of tweets, since these indicators usually reflect Twitter accounts’ reputation. To evade such profile-based detection features, spammers mainly utilize tactics including gaining more followers and posting more tweets.

Gaining More Followers: In general, popularity of a user can be measured through the number of followers of that account. A higher number of followers of an account commonly imply that more users trust this account and would like to receive the information from it. Thus, many profile-based detection features such as number of followers, fofo ratio (ratio of the number of an account’s following to its followers) and reputation score are built based on this number (number of followers). To evade these features or break-through Twitter’s 2,000 Following Limit Policy, spammers can mainly adopt the following strategies to gain more followers. The first approach is to purchase followers from websites. These websites charge a fee and then use a group of Twitter accounts to follow their customers. The specific methods of providing these accounts may differ from site to site. The second approach is to exchange followers with other users. This method is usually assisted by a third party website. These sites use existing customer’s accounts to follow new customer’s accounts.

Since this method does only require Twitter accounts to follow several other accounts to gain more followers without paying any fee, Twitter spammers can get around the referral clause by creating more number of fake accounts. In addition,

Twitter spammers can gain followers for their accounts by using their own created fake accounts. In this way, spammers can create a bunch of fake accounts, and then

follow their spam accounts with these accounts. Figure shows a existing online website from which users can directly buy followers. Figure 3: Example of twitter followers’ online trading websites

Posting More Tweets: Tweet based feature is also widely used in the existing Twitter spammers’ detection approaches. To evade this feature, spammers can post more Tweets at regular intervals of time to behave more like legitimate accounts, especially continuing to utilize some public tweeting tools or software.

b) Content-based Feature Evasion Tactics:

The percentage of Tweets containing URLs is an effective indicator of spam accounts, which is utilized in work such as . Many existing approaches design content-based features such as tweet similarity (number of tweets posted having similar semantic meaning) and duplicate tweet count (number of duplicate tweet posted) to detect spam accounts. To evade such content-based detection features, spammers mainly utilize the tactics including mixing normal tweets and posting heterogeneous tweets.

Mixing Normal Tweets: Spammers can utilize this tactic to evade content-based features such as URL ratio (ratio of number of tweets posted that contain link to the number of tweets posted), unique URL ratio (ration of number of unique URLs posted to the total number of URLs posted), hashtag ratio (ration of tweets containing hashtags to the total number of tweets posted) . By using this tactic, spammers are able to dilute their spam tweets and make it more difficult to be distinguished

from legitimated accounts.

Posting Heterogeneous Tweets: Spammers can post heterogeneous tweets to evade content-based features such as tweet similarity and duplicate tweet count.

Spammers can utilize public tools to convert a few different spam tweets into hundreds of variable tweets with the same semantic meaning using different words.

CONCLUSIONS AND FUTURE WORK

In this paper we have categorised and discussed various types of spamming techniques, the general approach is that to detect twitter spammers and distinguish between spammers and non spammers with better performance in term of accuracy and precision using user based and content based features used by detectors. With the techniques of spamming and detection methods explained in previous sections one could able to:

1. Identify instances of spam 2. Prevent spammers and

3. Counterbalance the effect of spamming.

Spam detection is a never ending story. It is just mean that how fast spam can detect accurately. Removing spam is a myth completely. This work can be further extended by giving an artificial intelligence system to

(5)

_______________________________________________________________________________________________

detect spammers on twitter using minimal and robust feature sets which imposes minimal cost.

ACKNOWLEDGMENT CONCLUSION

Many methods have been developed and used by various researchers to find out spammers in different social networks. Detection has been done on the basis of user based features or content based features or a combination of both. Few authors also introduced new features for detection. All the approaches have been validated on very small dataset and have not been even tested with different combinations of spammers and non-spammers. Combination of features for detection of spammers has shown better performance in terms of accuracy, precision, recall etc. as compared to using only user based or content based features.

REFERENCES

[1] C. Yang, R. Harkreader, and G. Gu, “ Empirical evaluation and new design for fighting evolving twitter spammers,” in IEEE Transactions On Information Forensics And Security, Vol. 8, No.

8, August 2013.

[2] V. Sridharan, V. Shankar, and M. Gupta,

“Twitter games: How successful spammers pick targets,” in Proc. 28th ACSAC, Orlando, FL, USA, 2012..

[3] G. Stringhini, M. Egele, C. Kruegel, and G.

Vigna, “Poultry Markets: On the underground economy of Twitter followers,” in Proc.Workshop on Online Social Networks, Helsinki, Finland, 2012.

[4] K. Thomas, C. Grier, V. Paxson, and D. Song,

“Suspended accounts in retrospect:An analysis of twitter spam,” in Proc. Internet Measurement Conf. (IMC’11), Berlin, Germany, 2011.

[5] C. Yang, R. Harkreader, and G. Gu, “Die free or live hard? Empirical evaluation and new design for fighting evolving twitter spammers,” in Proc.

14th Int. Symp. Recent Advances in Intrusion Detection (RAID’11), Menlo Park, CA, USA, Sep. 2011.

[6] Z.Chu, S.Gianvecchio, H. Wang, and S. Jajodia,

“Who is Tweeting on Twitter: Human, Bot, or Cyborg?,” in Proc. Ann. Computer Security Applications Conf. (ACSAC’10), Orlando, FL, USA, 2010.

[7] C. Yang, R. Harkreader, J. Zhang, S. Shin, and G. Gu, “Analyzing spammers’ social networks for fun and profit—A case study of cyber criminal ecosystem on twitter,” in Proc. 21st Int.World Wide Web Conference, Lyon, France, Apr. 2012.

