文章基本信息

标题：Spam/twitter.
作者：Cosoi, Alexandru Catalin ; Cosoi, Carmen Maria ; Sgarciu, Valentin 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2009
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Web 2.0 refers to a perceived second generation of web development and design that facilitates communication, secure information sharing, interoperability, and collaboration on the World Wide Web. Web 2.0 concepts have led to the development and evolution of web-based communities, hosted services, and applications; such as social-networking sites, video-sharing sites, wikis, blogs, and folksonomies.
关键词：Online social networks;Spam (Junk email)

Spam/twitter.

Cosoi, Alexandru Catalin ; Cosoi, Carmen Maria ; Sgarciu, Valentin 等

1. INTRODUCTION

Web 2.0 refers to a perceived second generation of web development and design that facilitates communication, secure information sharing, interoperability, and collaboration on the World Wide Web. Web 2.0 concepts have led to the development and evolution of web-based communities, hosted services, and applications; such as social-networking sites, video-sharing sites, wikis, blogs, and folksonomies.

With the rise of Web 2.0 capabilities, consumers have at their disposal a soapbox of unprecedented reach and power by which to share their brand experiences and opinions, positive or negative, regarding any product or service.

Micro-blogging is a form of multimedia blogging that allows users to send brief text updates or micro-media such as photos or audio clips and publish them, either to be viewed by anyone or by a restricted group, which can be chosen by the user. These messages can be submitted by a variety of means, including text messaging, instant messaging, email, digital audio or the web.

The content of a micro-blog differs from a traditional blog in that it is typically more topical, smaller in aggregate file size (e.g. text, audio or video) but is the same in that people utilize it for both business and individual reasons. Many micro-blogs provide this short commentary on a person-to-person level, or share news about a company's products and services.

Nielsen Online, in the Media Alert study published in October 2008, reported that in September, Twitter was the fastest growing social network, with an increase of 343%.

We present, as an example, the Amsterdam incident:

"The Boeing 737-800, which originated from Istanbul, Turkey, was trying to land at Schiphol when it crashed at about 1040 local time. The plane was carrying about 135 people. The first report on Twitter reportedly came from @nipp, who posted the message "Airplane crash @ Schiphol Airport Amsterdam!!" at 10:42, only 2 minutes after the crash. Barnett said that when CNN saw the image it moved quickly to confirm with Dutch officials that a crash had happened".

At this moment, Twitter spam is still at the early stage of its life cycle similar to the level of email-spam in 2002. We can identify two types of twitter spam: status spam, by adding tweets with content similar to the one found in email spam, and follow spam--by which twitter bots add themselves as followers to real profiles. While the reason to be behind the first type is obvious, the latter starts from the presumption that Twitter users will start following back this bots.

A study made by Cosoi and Petre in 2008 shows that 7 out of 10 users will respond to the "followers" with a follow back action. Once a bot will be followed by a large number of legit profiles, they will start to broadcast spammy tweets. This method succeeds only if this bot is from an entire sub-network of automatic generated profiles, because once it will start generate spammy content, it can easily be blacklisted by the users.

A common approach in dealing with email spam is using disposable email addresses. It can be observed that developing a network of legit followers and starting broadcasting spammy information, will eventually lead to blacklisting, which stand for more or less for "spammers using disposable twitter accounts for spam purposes".

By generating large networks of automated twitter profiles, spammers can create a small community of both automatic and legit profiles. Each dot from this network will try to acquire as many legit profiles as possible. Of course, bot-networks of different spammers might overlap, which will lead to spammers spamming spammers, but in most cases a large number of legit users will add them to the "follow" list. When this network will start spamming, each legit user will have a spammy message on their account. This is annoying for both the user, but also for any twitter memetracker that performs analysis on twitter since a large number of users will link to the spammy URL.

As stated earlier in this paper, the spam difficulty level on twitter right now is similar to that of email spam in 2002--in most cases, each post in a spammy network will be identical, which is highly unlikely to happen in case of legit tweet, even though they link to the same subject. For a memetracker, spammy tweets are rather easy to abolish using a daily report, while in case of the end user this is not such an easy task.

Based on a memetracker technology, we propose a system that will both classify tweets as spam or legit, but also provide a sort description with endorsements about profiles that are about to be followed by the user.

While the first task using memetracker can be easily completed by simply searching and counting identical posts (although this technique might seem to work only for the current state of spam when all spam tweets are identical, we can further extend it by continuously following profiles that once had tweeted a spammy content, even though temporary like a disposable twitter account, and link it to other account from the same network), the latter has to resort to more than just content analysis.

Further on, we will consider the introduction of fractal networks in order to prove the theory that Twitter network acts as a fractal network. Having this approach in mind, our goal would be to identify anomalies in the twitter network, and possibly identify spamming twitter accounts, even though they might tweet the same content or not.

2. PROPOSED METHOD

Scale-free graphs represent a relatively recent investigation topic in the field of complex networks. The concept was introduced by Albert and Barabasi in order to describe the network topologies in which the node connections follow a power law distribution. Common examples of such networks are the living cell (network of chemical substances connected by physical links). Although traditionally large systems were being modeled using the random graph theory developed by Erdos and Renyi [14], during the last few years research has lead to the conclusion that a real network's evolution is governed by other laws: regardless of the network's size, the probability P(k) that a node has k connections to other nodes is a power law:

P(k) = [ck.sup.-[gamma]] (1)

This implies that large networks follow a set of rules in order to organize themselves in a scale-free topology. Barabasi and Albert show the two mechanisms that lead to this property of scale invariance: growth (continuously adding new nodes) and preferential attachment (the likelihood of connecting to existing nodes which already have a large number of links). Therefore, scale-free networks are dominated by a small number of highly connected hubs, which on one hand gives them tolerance to accidental failures, but on the other hand makes them extremely vulnerable to coordinated attacks.

As opposed to a random graph, in which all nodes have approximately the same degree, a scale-free graph contains a few so-called hubs (nodes with a great number of links, like the Britney Spears Twitter Profile with 867333 followers), while de majority of the nodes only have a few connections (50% of the twitter users have an average of 10 connections): this is a power law distribution. In a random network the nodes follow a Poisson distribution with a bell shape, and it is extremely rare to find nodes that have significantly more or fewer links than the average. A power law does not have a peak, as a bell curve does, but it is instead described by a continuously decreasing function. When protted on a double-logarithmic scale, a power law is a straight line.

There are two major ways to compute the dimension of this network: box counting method and the cluster growing method.

For the box counting method, let [N.sub.B] be the number of boxes of linear size [l.sub.B], needed to cover the given network. The fractal dimension [d.sub.B] is then given by [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

This means that the average number of vertices <[M.sub.B] ([l.sub.B])> within a box of size [l.sub.B]

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)

By measuring the distribution of N for different box sizes or by measuring the distribution of <[M.sub.B] ([l.sub.B])> for different box sizes, the fractal dimension d- can be obtained by a power law fit of the distribution.

For the cluster growing method, one seed node is chosen randomly. If the minimum distance l is given, a cluster of nodes separated by at most l from the seed node can be formed. The procedure is repeated by choosing many seeds until the clusters cover the whole network. Then the dimension [d.sub.f] can be calculated by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

where <[M.sub.c]> is the average mass of the clusters, defined as the average number of nodes in a cluster. These methods are difficult to apply to networks since networks are generally not embedded in another space. In order to measure the fractal dimension of networks we need the concept of renormalization.

In order to investigate self-similarity in networks, we use the box-counting method and renormalization. For each size [l.sub.B], boxes are chosen randomly (as in the cluster growing method) until the network is covered, A box consists of nodes separated by a distance l < [l.sub.B]. Then each box is replaced by a node (renormalization). The renormalized nodes are connected if there is at least one link between the un-renormalized boxes. This procedure is repeated until the network collapses to one node. Each of these boxes has an effective mass (the number of nodes in it) which can be used as shown above to measure the fractal dimension of the network.

In order to establish whether these networks are indeed scale-free, we determined the degree-distribution P(k), which is the probability of finding a node with a degree k. As a connection between two nodes, we considered only the links determined by A following B. The obtained distribution is indeed scale-free and satisfies the power law with the exponential: [gamma] = 2.71 which satisfies our condition to be between 2 and 3.

P(k) [approximately equal to] [ck.sup.-[gamma]]

log(P(k)) = ([-[gamma]) log (k) + log(c) y = (-[gamma])x + c (4)

The box number distribution was computed and we obtained the dimension [d.sub.B] = 2.68. This means that this network is indeed a fractal network.

3. CONCLUSIONS

Spam on Twitter is at the beginning of its life cycle. Although comforting, this is not exactly good news, since spammers have an impressive background in writing the perfect spam messages. It is only a matter of months until this communication system will become as spammed as the email system. There is a high chance that right now large networks of future spam content providers are being set up in silence. Current spam samples are easy to identify when analyzed from a broader perspective, but from the user point of view is the same or even harder than in case of email spam. Twitter introduced some limits for the "following" and "followers" section, limits that are not public, in order to stop following spam, but we believe that this method would be more successful if they would also analyze connections between these profiles. Also, this could be a downside in case of VIP profiles like Britney Spears profile on Twitter. Although not a method on its own, by using an overview analysis of this phenomenon and with the help of fractal networks, large bot-nets can be identified and prevented from convincing more legit users to follow them.

4. REFERENCES

Albert, R., Barabasi A., Statistical mechanics of complex networks. Review of modern phishics 47-97

Bausch, S., McGiboney, M. Media Alert. Nielsen-Online (Available from: www.nielsen-online.comi.

Bachman, M. Connecting the dots. Nielsen Online (Available from: www.nielsen-online.com)

Cosoi, A. C., Petre L.G. Workshop on digital social networks. SpamConference 2008, Boston, MIT

Erdos, P., Renyi A., On random graphs. Publ Math. Inst. Hung. Acad. Sci, 290-297

MacManus, R. The Fractal Blogosphere. (Available from: http://www.readwriteweb.com/about_readwriteweb.php Accessed on:2009-03-14)

Ursianu, R., Sandu A., Self-Similarity of scale-free graphs. Proceesings of CSCS 16, Bucharest, Romania, page 121.

***Aberden Group, Research Brief. February 2008, Nielsen Online (Available from: www.nielsen-online.com)

***BBC News, How the Schiphol crash happened. (Available from http://news.bbc.co.uk/2fhi/europe/7910215.stm. Accessed on: 2009-04-10)