Spam/twitter.
Cosoi, Alexandru Catalin ; Cosoi, Carmen Maria ; Sgarciu, Valentin 等
1. INTRODUCTION
Web 2.0 refers to a perceived second generation of web development
and design that facilitates communication, secure information sharing,
interoperability, and collaboration on the World Wide Web. Web 2.0
concepts have led to the development and evolution of web-based
communities, hosted services, and applications; such as
social-networking sites, video-sharing sites, wikis, blogs, and
folksonomies.
With the rise of Web 2.0 capabilities, consumers have at their
disposal a soapbox of unprecedented reach and power by which to share
their brand experiences and opinions, positive or negative, regarding
any product or service.
Micro-blogging is a form of multimedia blogging that allows users
to send brief text updates or micro-media such as photos or audio clips
and publish them, either to be viewed by anyone or by a restricted
group, which can be chosen by the user. These messages can be submitted
by a variety of means, including text messaging, instant messaging,
email, digital audio or the web.
The content of a micro-blog differs from a traditional blog in that
it is typically more topical, smaller in aggregate file size (e.g. text,
audio or video) but is the same in that people utilize it for both
business and individual reasons. Many micro-blogs provide this short
commentary on a person-to-person level, or share news about a
company's products and services.
Nielsen Online, in the Media Alert study published in October 2008,
reported that in September, Twitter was the fastest growing social
network, with an increase of 343%.
We present, as an example, the Amsterdam incident:
"The Boeing 737-800, which originated from Istanbul, Turkey,
was trying to land at Schiphol when it crashed at about 1040 local time.
The plane was carrying about 135 people. The first report on Twitter
reportedly came from @nipp, who posted the message "Airplane crash
@ Schiphol Airport Amsterdam!!" at 10:42, only 2 minutes after the
crash. Barnett said that when CNN saw the image it moved quickly to
confirm with Dutch officials that a crash had happened".
At this moment, Twitter spam is still at the early stage of its
life cycle similar to the level of email-spam in 2002. We can identify
two types of twitter spam: status spam, by adding tweets with content
similar to the one found in email spam, and follow spam--by which
twitter bots add themselves as followers to real profiles. While the
reason to be behind the first type is obvious, the latter starts from
the presumption that Twitter users will start following back this bots.
A study made by Cosoi and Petre in 2008 shows that 7 out of 10
users will respond to the "followers" with a follow back
action. Once a bot will be followed by a large number of legit profiles,
they will start to broadcast spammy tweets. This method succeeds only if
this bot is from an entire sub-network of automatic generated profiles,
because once it will start generate spammy content, it can easily be
blacklisted by the users.
A common approach in dealing with email spam is using disposable
email addresses. It can be observed that developing a network of legit
followers and starting broadcasting spammy information, will eventually
lead to blacklisting, which stand for more or less for "spammers
using disposable twitter accounts for spam purposes".
By generating large networks of automated twitter profiles,
spammers can create a small community of both automatic and legit
profiles. Each dot from this network will try to acquire as many legit
profiles as possible. Of course, bot-networks of different spammers
might overlap, which will lead to spammers spamming spammers, but in
most cases a large number of legit users will add them to the
"follow" list. When this network will start spamming, each
legit user will have a spammy message on their account. This is annoying
for both the user, but also for any twitter memetracker that performs
analysis on twitter since a large number of users will link to the
spammy URL.
As stated earlier in this paper, the spam difficulty level on
twitter right now is similar to that of email spam in 2002--in most
cases, each post in a spammy network will be identical, which is highly
unlikely to happen in case of legit tweet, even though they link to the
same subject. For a memetracker, spammy tweets are rather easy to
abolish using a daily report, while in case of the end user this is not
such an easy task.
Based on a memetracker technology, we propose a system that will
both classify tweets as spam or legit, but also provide a sort
description with endorsements about profiles that are about to be
followed by the user.
While the first task using memetracker can be easily completed by
simply searching and counting identical posts (although this technique
might seem to work only for the current state of spam when all spam
tweets are identical, we can further extend it by continuously following
profiles that once had tweeted a spammy content, even though temporary
like a disposable twitter account, and link it to other account from the
same network), the latter has to resort to more than just content
analysis.
Further on, we will consider the introduction of fractal networks
in order to prove the theory that Twitter network acts as a fractal
network. Having this approach in mind, our goal would be to identify
anomalies in the twitter network, and possibly identify spamming twitter
accounts, even though they might tweet the same content or not.
2. PROPOSED METHOD
Scale-free graphs represent a relatively recent investigation topic
in the field of complex networks. The concept was introduced by Albert
and Barabasi in order to describe the network topologies in which the
node connections follow a power law distribution. Common examples of
such networks are the living cell (network of chemical substances
connected by physical links). Although traditionally large systems were
being modeled using the random graph theory developed by Erdos and Renyi
[14], during the last few years research has lead to the conclusion that
a real network's evolution is governed by other laws: regardless of
the network's size, the probability P(k) that a node has k
connections to other nodes is a power law:
P(k) = [ck.sup.-[gamma]] (1)
This implies that large networks follow a set of rules in order to
organize themselves in a scale-free topology. Barabasi and Albert show
the two mechanisms that lead to this property of scale invariance:
growth (continuously adding new nodes) and preferential attachment (the
likelihood of connecting to existing nodes which already have a large
number of links). Therefore, scale-free networks are dominated by a
small number of highly connected hubs, which on one hand gives them
tolerance to accidental failures, but on the other hand makes them
extremely vulnerable to coordinated attacks.
As opposed to a random graph, in which all nodes have approximately
the same degree, a scale-free graph contains a few so-called hubs (nodes
with a great number of links, like the Britney Spears Twitter Profile
with 867333 followers), while de majority of the nodes only have a few
connections (50% of the twitter users have an average of 10
connections): this is a power law distribution. In a random network the
nodes follow a Poisson distribution with a bell shape, and it is
extremely rare to find nodes that have significantly more or fewer links
than the average. A power law does not have a peak, as a bell curve
does, but it is instead described by a continuously decreasing function.
When protted on a double-logarithmic scale, a power law is a straight
line.
There are two major ways to compute the dimension of this network:
box counting method and the cluster growing method.
For the box counting method, let [N.sub.B] be the number of boxes
of linear size [l.sub.B], needed to cover the given network. The fractal
dimension [d.sub.B] is then given by [MATHEMATICAL EXPRESSION NOT
REPRODUCIBLE IN ASCII]
This means that the average number of vertices <[M.sub.B]
([l.sub.B])> within a box of size [l.sub.B]
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)
By measuring the distribution of N for different box sizes or by
measuring the distribution of <[M.sub.B] ([l.sub.B])> for
different box sizes, the fractal dimension d- can be obtained by a power
law fit of the distribution.
For the cluster growing method, one seed node is chosen randomly.
If the minimum distance l is given, a cluster of nodes separated by at
most l from the seed node can be formed. The procedure is repeated by
choosing many seeds until the clusters cover the whole network. Then the
dimension [d.sub.f] can be calculated by
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)
where <[M.sub.c]> is the average mass of the clusters,
defined as the average number of nodes in a cluster. These methods are
difficult to apply to networks since networks are generally not embedded
in another space. In order to measure the fractal dimension of networks
we need the concept of renormalization.
In order to investigate self-similarity in networks, we use the
box-counting method and renormalization. For each size [l.sub.B], boxes
are chosen randomly (as in the cluster growing method) until the network
is covered, A box consists of nodes separated by a distance l <
[l.sub.B]. Then each box is replaced by a node (renormalization). The
renormalized nodes are connected if there is at least one link between
the un-renormalized boxes. This procedure is repeated until the network
collapses to one node. Each of these boxes has an effective mass (the
number of nodes in it) which can be used as shown above to measure the
fractal dimension of the network.
In order to establish whether these networks are indeed scale-free,
we determined the degree-distribution P(k), which is the probability of
finding a node with a degree k. As a connection between two nodes, we
considered only the links determined by A following B. The obtained
distribution is indeed scale-free and satisfies the power law with the
exponential: [gamma] = 2.71 which satisfies our condition to be between
2 and 3.
P(k) [approximately equal to] [ck.sup.-[gamma]]
log(P(k)) = ([-[gamma]) log (k) + log(c) y = (-[gamma])x + c (4)
The box number distribution was computed and we obtained the
dimension [d.sub.B] = 2.68. This means that this network is indeed a
fractal network.
3. CONCLUSIONS
Spam on Twitter is at the beginning of its life cycle. Although
comforting, this is not exactly good news, since spammers have an
impressive background in writing the perfect spam messages. It is only a
matter of months until this communication system will become as spammed
as the email system. There is a high chance that right now large
networks of future spam content providers are being set up in silence.
Current spam samples are easy to identify when analyzed from a broader
perspective, but from the user point of view is the same or even harder
than in case of email spam. Twitter introduced some limits for the
"following" and "followers" section, limits that are
not public, in order to stop following spam, but we believe that this
method would be more successful if they would also analyze connections
between these profiles. Also, this could be a downside in case of VIP
profiles like Britney Spears profile on Twitter. Although not a method
on its own, by using an overview analysis of this phenomenon and with
the help of fractal networks, large bot-nets can be identified and
prevented from convincing more legit users to follow them.
4. REFERENCES
Albert, R., Barabasi A., Statistical mechanics of complex networks.
Review of modern phishics 47-97
Bausch, S., McGiboney, M. Media Alert. Nielsen-Online (Available
from: www.nielsen-online.comi.
Bachman, M. Connecting the dots. Nielsen Online (Available from:
www.nielsen-online.com)
Cosoi, A. C., Petre L.G. Workshop on digital social networks.
SpamConference 2008, Boston, MIT
Erdos, P., Renyi A., On random graphs. Publ Math. Inst. Hung. Acad.
Sci, 290-297
MacManus, R. The Fractal Blogosphere. (Available from:
http://www.readwriteweb.com/about_readwriteweb.php Accessed
on:2009-03-14)
Ursianu, R., Sandu A., Self-Similarity of scale-free graphs.
Proceesings of CSCS 16, Bucharest, Romania, page 121.
***Aberden Group, Research Brief. February 2008, Nielsen Online
(Available from: www.nielsen-online.com)
***BBC News, How the Schiphol crash happened. (Available from
http://news.bbc.co.uk/2fhi/europe/7910215.stm. Accessed on: 2009-04-10)