Conference PaperPDF Available

Real-­‐time Classification of Malicious URLs on Twitter using Machine Activity Data

Authors:

Abstract and Figures

Massive online social networks with hundreds of millions of active users are increasingly being used by Cyber criminals to spread malicious software (malware) to exploit vul- nerabilities on the machines of users for personal gain. Twitter is particularly susceptible to such activity as, with its 140 character limit, it is common for people to include URLs in their tweets to link to more detailed information, evidence, news reports and so on. URLs are often shortened so the endpoint is not obvious before a person clicks the link. Cyber criminals can exploit this to propagate malicious URLs on Twitter, for which the endpoint is a malicious server that performs unwanted actions on the person’s machine. This is known as a drive-by-download. In this paper we develop a machine classification system to distinguish between malicious and benign URLs within seconds of the URL being clicked (i.e. ‘real-time’). We train the classifier using machine activity logs created while interacting with URLs extracted from Twitter data collected during a large global event – the Superbowl – and test it using data from another large sporting event – the Cricket World Cup. The results show that machine activity logs produce precision performances of up to 0.975 on training data from the first event and 0.747 on a test data from a second event. Furthermore, we examine the properties of the learned model to explain the relationship between machine activity and malicious software behaviour, and build a learning curve for the classifier to illustrate that very small samples of training data can be used with only a small detriment to performance.
Content may be subject to copyright.
Real-time Classification of Malicious URLs on
Twitter using Machine Activity Data
Pete Burnap, Amir Javed, Omer F. Rana, Malik S. Awan
School of Computer Science and Informatics
Cardiff University
Cardiff, UK
Email: burnapp@cardiff.ac.uk
Abstract—Massive online social networks with hundreds of
millions of active users are increasingly being used by Cyber
criminals to spread malicious software (malware) to exploit vul-
nerabilities on the machines of users for personal gain. Twitter is
particularly susceptible to such activity as, with its 140 character
limit, it is common for people to include URLs in their tweets
to link to more detailed information, evidence, news reports and
so on. URLs are often shortened so the endpoint is not obvious
before a person clicks the link. Cyber criminals can exploit this to
propagate malicious URLs on Twitter, for which the endpoint is a
malicious server that performs unwanted actions on the person’s
machine. This is known as a drive-by-download. In this paper
we develop a machine classification system to distinguish between
malicious and benign URLs within seconds of the URL being
clicked (i.e. ‘real-time’). We train the classifier using machine
activity logs created while interacting with URLs extracted from
Twitter data collected during a large global event – the Superbowl
– and test it using data from another large sporting event – the
Cricket World Cup. The results show that machine activity logs
produce precision performances of up to 0.975 on training data
from the first event and 0.747 on a test data from a second event.
Furthermore, we examine the properties of the learned model to
explain the relationship between machine activity and malicious
software behaviour, and build a learning curve for the classifier
to illustrate that very small samples of training data can be used
with only a small detriment to performance.
I. INT ROD UC TI ON
Online social networks (OSNs) (e.g. Twitter, Facebook,
Tumblr) are inherently vulnerable to the risk of collective
contagion and propagation of malicious viral material such
as spreading rumour [18] and antagonistic content [3] fol-
lowing widely publicized emotive events. Another misuse
case includes the spread of malicious software (malware) via
URLs, for which the endpoint is a Web page that contains a
malicious script. When the client device executes the script, it
attempts to exploit a vulnerability in the browser or a plugin
to perform malicious activity on the client device. This action
is commonly referred to as a drive-by-download [19]. The
2013 Security Intelligence Report from Microsoft reported
that malicious Web sites are now the top threat to enterprise
security [15].
Possibly the most prominent example of the injection
of malicious URLs into OSNs is the Koobface worm [25].
Koobface initially spread by using an infected machine to
send messages to Facebook ’friends’ of the infected user,
which included a link to a third-party website that infected
the machine of the user visiting it by installing malicious
software. The worm was effectively executed on a number of
OSNs due to the highly interconnected nature of the their users.
Thomas and Nicol subsequently analysed Koobface to identify
the social network accounts used to distribute malicious URLs
and notably identified that current defences flagged only 27%
of threats and took 4 days to respond. During this period,
81% of vulnerable users clicked on Koobface links [25]. This
highlights the ineffectiveness of existing drive-by-download
detection methods and motivates further research into ways in
which malware on OSNs can be identified, and its propagation
monitored and managed.
Being an open, real-time, and highly interconnected micro-
blogging platform, Twitter is one of the most widely used
OSNs. It is becoming a crowd-sourced news reporting platform
[8] and go-to source for the latest information and reaction
following large-scale events, such as natural disasters [20],
political elections [26] and terrorist attacks [3]. Due to the
140 character limit on posts imposed by Twitter, users often
post URLs to provide additional sources of information to back
up claims, give a more complete report of events, or supply
evidence for a statement. Twitter therefore has the potential
to become an environment within which Cyber criminals can
piggyback on large-scale events and lure information seeking
users to malicious endpoints. Identifying which events are most
popular is made easier due to the openly accessible ’trending
topics’ list that can be produced by analysing the stream of
Tweets over time, and which summarises the most frequently
used terms and hashtags [10].
To date the majority of research into malicious content
posted to Twitter has investigated the social network properties
of user accounts, such as friends and followers, to identify
malicious accounts. In this paper we explore whether the
behavioural analysis of malware can lead to the identification
of machine activity-related features to support the near real-
time classification of malicious URLs. The aim of this paper
is (i) to develop a real-time machine classification system
to distinguish between malicious and benign URLs within
seconds of the URL being clicked, and thus possibly before the
full extent of the planned malicious activity can be successfully
executed, and (ii) to examine and interpret the learned model
to explicate the relationship between machine activity and
malicious behaviour exhibited by drive-by-downloads on Twit-
ter. We achieve this by training several machine classification
models using machine activity logs generated while interacting
with URLs extracted from Twitter data collected during two
large sporting events – the Superbowl and the Cricket World
ASONAM '15, August 25-28, 2015, Paris, France
ACM ISBN 978-1-4503-3854-7/15/08
DOI: http://dx.doi.org/10.1145/2808797.2809353
ASONAM '15, August 25-28, 2015, Paris, France
ACM ISBN 978-1-4503-3854-7/15/08
DOI: http://dx.doi.org/10.1145/2808797.2809353
ASONAM '15, August 25-28, 2015, Paris, France
ACM ISBN 978-1-4503-3854-7/15/08
DOI: http://dx.doi.org/10.1145/2808797.2809281
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
970
Cup.
First, we present a log-based machine classifier for URLs
posted to Twitter, which can distinguish between malicious
and benign URLs within seconds of the interaction beginning
(i.e. ‘real-time’). This can be used as an early warning tool
to help identify and neutralise a drive-by-download during
large scale events, when it is known people are highly likely
to search for information and click on URLs. As far as we
are aware, this is the first study that provides a comparative
analysis of different types of modelling methods for the run-
time classification of URLs posted to Twitter. Second, the
volume of Twitter data generated around such events makes
it impossible to manually achieve this task, creating data
sampling, storage and analysis challenges. We refine the set
of measurable machine activity metrics to identify the most
predictive features by examining how they are used to make
decisions in the machine classification model. We explain
these features and furthermore, build a learning-curve chart
that demonstrates the sample required to train a model is
much smaller than 100%, suggesting in fact that using 1%
of the training data only has a small detrimental effect on
performance, reducing the overhead on data collection and
mitigating sampling concerns such as ‘how much data is
enough’ for this problem.
II. RE LATE D WOR K
A. Malware Propagation and Social Networks
It has been demonstrated that OSNs play a crucial role
in the spread of information between people, and the social
network structure can have a control effect on the speed of
propagation [13]. The literature contains a strong focus on
social network account features, such as number of friends
and followers. [21] analysed malware on Twitter and studied
the importance of social network connectivity between users,
also incorporating theory on epidemiology. They found that,
even with a low degree of connectivity and a low probability of
clicking links, the propagation of malware was high. [6] also
used infection models from the study of contagious diseases to
simulate user interactions and model the propagation of mal-
ware, finding that highly clustered networks and interactions
between ’friends’ rather than strangers slowed the propagation.
These findings suggest that users seeking information from
strangers on Twitter following a large-scale event would be an
ideal scenario for the spread of malicious content.
[29] analysed the ”cyber criminal ecosystem” on Twitter
using graph metrics including local clustering coefficient,
betweenness centrality, and bi-directional links ratio, and found
that malicious accounts tend to be socially connected forming
small networks, and such accounts are also more likely to be
connected via a ’friendship’ relationship.
B. Classifying Malicious Web Pages
To determine whether a Web page was likely to perform a
malicious activity, [4] used static analysis of scripts embedded
within a Web page to classify pages as malicious or benign.
[5] also studied static code to identify drive-by-downloads,
using a range of code features including sequences of method
calls, code obfuscation mechanisms and redirection techniques.
[7] provided an enhancement to static code analysis to handle
evasive malware that detects the presence of dynamic analysis
systems and tries to avoid detection. The enhancements include
a code similarity detection tool and an evasive behaviour
detector. [17] used client honeypots to interact with potentially
malicious servers and analyse the code using anti-virus and
HTML decomposition. [12] used URL redirect chains across
a corpus of Tweets to identify malicious web site reuse.
[16] developed a more interactive dynamic behaviour based
tool to identify malicious web pages, arguing that code-based
analysis is more manually intensive, and instead monitored the
run-time interactions between a web page and the client ma-
chine for several minutes. If there were persistent-state changes
on the client machine, such as new processes created, registry
files modified or executables created, this behaviour was used
to signal a malicious exploit. The study actually identified a
zero-day exploit of an unpatched java vulnerability that was
operating behind 25 malicious URLs, making a strong case
for dynamic behaviour modelling over static code analysis. [9]
subsequently reported that analysing run-time behaviours has
better performance in term of detection accuracy over static
analysis. The authors propose a two-step process, using static
analysis to provide a first-pass analysis, before forwarding
potentially malicious web pages for run-time analysis. Of
course, this approach has the potential to overlook zero-day
exploits where new code-based attack signatures are as yet
unknown, but the two-step approach reduces the overhead of
running run-time analysis on all URLs.
To summarise, while the majority of research into the
propagation of malicious content on OSNs has investigated
the social network properties of user accounts, such as friend
and followers to identify malicious accounts, we are more
concerned with the active analysis of malware behaviour on
Twitter, and understanding the features of machine behaviour
that enable the ’real time’ classification of malicious URLs
posted to Twitter, while reducing false positives. Thus, we are
interested in identifying malware via its behaviour, which is
logged within 5 seconds of the first interaction with the URL,
and at regular intervals for a 5 minute period of interaction.
As far as we are aware, this is the first study on the run-time
classification of URLs posted to Twitter surrounding large-
scale events.
III. DATA COLLECTION AND ANNOTATION
A. Data Collection
One of the key concerns with data-driven models is whether
the results can be generalized to other datasets (and events).
To build a model of malicious behaviour from machine log
data, and avoid a model built solely on the data characteristics
from a single event we collected data from two large-scale
events. Data from one event was used to train a classifier and
build a model, with data from another event being used to test
the model’s generalisability beyond a single event. Sporting
events have been reported to generate large volumes of Twitter
traffic so we collected data at the time of two world events
- the American Football Superbowl, and the cricket World
Cup. Data for our study were collected from Twitter using
its programmatically accessible streaming API using Tweepy
(tweepy.org). We chose Twitter as the OSN for the study
because it supports the collection of 1% of all daily tweets,
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
971
which was assumed to be of sufficient bandwidth to collect all
Tweets explicitly mentioning the two events. Large sporting
events were chosen as it has been reported that they have
produced the largest ever volume of Twitter traffic1.
Super Bowl and cricket World Cup tweets were collected
using event related search hashtags (i.e. #superbowlXLIX,
#CWC15). Tweets were also required to contain a URL to
be relevant to our study. On 1st February 2015, the day of the
Super Bowl final, we collected 122,542 unique URLs posted
in tweets containing the event-specific hashtags. As the cricket
World Cup final generated less traffic we also collected at the
time of the semi-finals. In total 7,961 URLs were collected
from the two semi-finals (24th and 26th March) and the final
(29th March).
B. Identifying Malicious URLs
The second stage of the experiment involved building up a
system to analyse the activity of all URLs collected from the
Twitter stream and determine whether to annotate the URL
as malicious or benign. To carry out stage two we utilised a
client-side honeypot system[19]. Client honeypots are active
defence systems that visit potentially malicious URLs and log
the system state during the interaction. They can perform static
analysis on Web page code, looking for evidence of malicious
scripts (e.g. [5] [7]. These are generally referred to as low
interaction honeypots. Alternatively, they can perform dynamic
analysis of the interaction behaviour between the Web page
and the client system, looking for evidence of malicious actions
(e.g. [16] [9]). These are known as high interaction honeypots.
Given our intention to build a model of malicious behaviour
and to incorporate potential zero day attacks for which the
code may be previously unseen, but the behaviour is clearly
malicious, we chose to use a high interaction client honeypot.
The Capture HPC toolkit [22] is an open source high inter-
action client honeypot that can be used to undertake run-time
behavioural analysis of malicious pages [1]. With a general
purpose operating system, it is a challenging task to deduce
behaviour from a log file as a large number of events can be
generated (and many when the system is idle). CaptureHPC
on the other hand restricts logging of events of particular
interest to a user. Once a server has been initiated, it can start
up multiple clients, each within its own virtual machine (via
VMWare) enabling multiple concurrent clients to co-exist in
their own virtual machine. Capture-HPC uses an exclusion list
of malicious behaviours and analyses the log files of a client
honeypot after visiting suspicious URLs. Changes to machine
state such as registry, file system or running processes can
lead to the URL being flagged as malicious [19]. Exclusion
lists have limitations in that they require constant updating as
malicious behaviour changes over time. New vulnerabilities are
identified and subsequently attack vectors change. However,
there are rigorous methods for determining the exclusion lists
as discussed in [19].
The methodology for linking streamed Twitter data to a
high interaction honeypot is as follows: (1) connect to the
Twitter Streaming API and send the search term to collect on
(#superbowlXLIX) and specifying that only posts containing
URLs should be returned. This returns Tweets as they are
1http://mashable.com/2014/07/09/brazil-germany-world-cup-most-tweeted/
posted. Write the details of the Tweets to a database; (2)
expand shortened URLs and remove duplicates; (3) for every
500 (new) URLs, upload a text file using a (clean) Capture
HPC virtual machine; (4) Capture HPC iterates through the
list, visiting each URL, and keeping the connection open for
5 minutes, logging machine activity on the client machine
and identifying whether the URL is malicious or benign via
reference to its exclusion list. The 5 minute interval is currently
a heuristic to ensure that a large number of sites can be
visited – it makes the significant assumption that any malicious
activity will be triggered within the first 5 minutes of the visit.
C. Architecture for suspicious URL annotation
The Capture HPC client honeypot system is designed to
operate in a Virtual Machine (VM) environment, to ensure
any side effects due to interacting with suspicious URLs
and logging interactions do not carry over to other activities.
Once the orchestration machine was bootstrapped, the Capture
HPC clients established a communication channel between
the machine on which it was running and the endpoint to
which it was directed by the next suspicious URL in the list.
The client honeypot then logged changes to the state of the
system, using file, registry and process monitors running on
the client machine. Based on the logged activity, Capture HPC
determines whether the URL should be classified as benign or
malicious.
Whether visiting a URL, opening an application, or being
idle, the operating system generates hundreds of system events,
many of which would be benign. These events should be
omitted from the log files generated by various monitors. To
implement exceptions, exclusion lists were created that can
be populated based on probable malicious behaviour, and are
portable to different operating systems. By default Capture
HPC client monitors logs all activity – however, this might
result in a large volume of data being recorded, much of
which may not be relevant to flag any malware-based activity.
Therefore, a user can either specify their own omission or an
inclusion rule to the existing list to enable recording of events
that are likely to be of interest in the context of particular
malware analysis, additional detail can be found in Seifert et
al. [22] and Puttaroo et al. [19].
Capture HPC relies on a malicious activity being executed
for it to assign the ’malicious’ label to a URL. However,
the activity log is collected from the point of first interaction
with the endpoint. This poses the question as to whether the
activities that precede a malicious event contain ’signals’ that
can be identified as being linked to an eventual exploit, and
can be used to flag this risk. Our paper essentially answers this
question by using the log data activities as predictive features
in the URL classification system.
D. Sampling and Feature Identification
From the pre-processed collection of tweets we randomly
sampled 2000 Tweets from each event. The training data,
sampled at regular intervals throughout the Super Bowl data
collection period, contained 1000 URLs identified by Capture
HPC as being malicious, and 1000 benign. This provided a
training set of 2000 tweets. The test set collected around the
time of the cricket World Cup consisted of 891 malicious
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
972
URLs and 1,100 benign (sampled 80% from the day of final
and 10% from each of the semi finals). To identify features that
were predictive of malicious behaviour we collected machine
log data during the period Capture HPC was interacting with
the URL. The metrics we measured were: i) CPU usage, ii)
Connection established/listening (yes/no), iii) Port Number, iv)
Process ID (number), v) Remote IP (established or not), vi)
Network Interface (type e.g. Wifi, Eth0), vii) Bytes Sent, viii)
Bytes received, ix) Packets Sent, x) Packets Received, and xi)
Time since start of interaction. In total there were 5.5 million
observations recorded from interacting with 2000 tweets (1000
malicious and 1000 benign). Each observation represents a
feature vector containing metrics and whether the URL was
annotated by Capture HPC as malicious or benign.
IV. DATA MOD EL IN G
A. Baseline Model Selection
Our data modelling activity was intended: (i) to identify
a model suitable for extracting features from machine activity
logs that would be predictive of malicious behaviour occurring
during an interaction with a URL endpoint, and (ii) to explicate
the relationship between machine activity and malicious be-
haviour. There are a number of methodological considerations
to take into account. First was whether to use a generative
or discriminative model. Our data contains logs of machine
activity, which occurs even when the system is idle, so it
is likely that any log will contain a great deal of ‘noise’ as
well as malicious behaviour. Table I presents a comparison
between the training and testing datasets with respect to the
mean and standard deviation of recorded machine activity.
It illustrates the high variance in the mean recorded values
of CPU usage, byes/packets sent/received, and ports used
between the two datasets, which suggests identifying similar
measurements between datasets for prediction purposes would
be challenging. The standard deviation in both datasets is
very similar, which suggests the variance is common to both
datasets, but the deviation is high, which suggests a large
amount of ’noise’ in the data.
TABLE I. DESCRIPTIVE STATIS TI CS FO R TRA IN AN D TES T DATASET S
AT T=60
Attribute Mean Std. Dev
Train Test Train Test
cpu 1.255354 6.26 2.144828 2.31
connection .86 .88 .34 .32
portnumber 49478.7 38284.41 18023.01 23180.74
processid 3343.329 3480.35 1947.882 1624
remoteip .86 .88 .34 .32
network 4 4 2 2
bytessent 1.01e+08 3.59e+08 2.06e+08 9.50e+08
bytesrecd 2.87e+08 3.12e+08 8.47e+08 8.90e+08
packetssent 470821.5 2442275 1472258 6659166
packetsrecd 539358 2849133 1843365 7742467
In addition to the ‘noise’ in the data – although the
training and testing datasets contain a well balanced number of
malicious and benign activity logs - the behaviours in both logs
are largely benign, creating a large skew in log activity towards
the benign type. The noise and skew could have an impact on
the effectiveness of a discriminative classifier in identifying
decision boundaries in the space of inputs (i.e. the inputs
may not be linearly separable, which could cause problem
using a perceptron-type classifier) even after a large number
of iterations (for instance if using a multilayer perceptron
developed using multiple layers of logistic regression). It could
be argued that for more complex relationships, such as multiple
sequential activities leading to a malicious machine exploit, a
generative model would be more appropriate to generate a full
probabilistic model of all variables (possible behaviours) given
a training dataset of machine logs. For example, a Bayesian
approach could be effective at capturing dependencies be-
tween variables over time. Or a Naive approach to Bayesian
modelling may be more suitable by assuming there are no
dependencies, but that the probabilistic value of individual
variables will be enough to determine likely behaviour.
The first phase of data modelling was therefore to conduct
a number of baseline experiments to determine which model
would be most appropriate based on prediction accuracy. We
used the Weka toolkit to compare the predictive accuracy of
(i) generative models that consider conditional dependencies
in the dataset (BayesNet) or assume conditional independence
(Naive Bayes), and (ii) discriminative models that aim to
maximise information gain (J48 Decision Tree) and build
multiple models to map input to output via a number of
connected nodes, even if the feature space is hard to linearly
separate (Multi-layer Perceptron).
TABLE II. BAYES NET RE SULT S
BayesNet Precision Recall F-Measure
Time Train Test Train Test Train Test
0 0.829 0.660 0.825 0.663 0.819 0.657
30 0.916 0.642 0.915 0.640 0.916 0.641
60 0.916 0.636 0.916 0.636 0.916 0.636
90 0.915 0.667 0.915 0.657 0.915 0.655
120 0.910 0.671 0.910 0.665 0.910 0.664
150 0.903 0.678 0.903 0.664 0.903 0.661
180 0.903 0.661 0.903 0.651 0.903 0.645
210 0.913 0.670 0.910 0.666 0.912 0.664
240 0.860 0.687 0.840 0.680 0.848 0.678
270 0.865 0.684 0.857 0.681 0.858 0.68
TABLE III. NAI VE BAYE S RESU LTS
Naive Precision Recall F-Measure
Time Train Test Train Test Train Test
0 0.506 0.503 0.510 0.448 0.422 0.373
30 0.592 0.513 0.617 0.500 0.590 0.486
60 0.595 0.545 0.619 0.524 0.594 0.498
90 0.591 0.595 0.616 0.542 0.589 0.490
120 0.585 0.605 0.615 0.560 0.573 0.526
150 0.575 0.620 0.610 0.565 0.559 0.528
180 0.584 0.624 0.615 0.556 0.566 0.503
210 0.574 0.632 0.593 0.579 0.539 0.537
240 0.533 0.504 0.547 0.506 0.470 0.449
270 0.554 0.575 0.558 0.564 0.494 0.552
TABLE IV. J48 DECISION TRE E RES ULTS
J48 Precision Recall F-Measure
Time Train Test Train Test Train Test
0 0.833 0.721 0.830 0.633 0.830 0.617
30 0.975 0.681 0.975 0.680 0.975 0.680
60 0.973 0.670 0.973 0.670 0.973 0.670
90 0.971 0.686 0.971 0.685 0.971 0.685
120 0.969 0.673 0.968 0.671 0.968 0.671
150 0.964 0.692 0.964 0.688 0.964 0.687
180 0.962 0.700 0.962 0.697 0.962 0.696
210 0.956 0.698 0.956 0.696 0.956 0.696
240 0.872 0.693 0.870 0.683 0.870 0.678
270 0.872 0.695 0.870 0.685 0.870 0.552
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
973
TABLE V. MULTI LAYE R PER CEP TRO N RES ULTS
Multi Precision Recall F-Measure
Time Train Test Train Test Train Test
0 0.710 0.655 0.709 0.648 0.709 0.620
30 0.786 0.720 0.788 0.690 0.853 0.680
60 0.971 0.690 0.971 0.668 0.971 0.662
90 0.816 0.726 0.815 0.704 0.809 0.699
120 0.920 0.745 0.900 0.723 0.920 0.719
150 0.950 0.747 0.915 0.725 0.950 0.721
180 0.925 0.743 0.902 0.720 0.925 0.716
210 0.893 0.746 0.873 0.727 0.863 0.723
240 0.912 0.737 0.892 0.721 0.912 0.717
270 0.850 0.747 0.825 0.729 0.850 0.724
B. Baseline Model Results
Results are provided using standard classification metrics,
Precision (a measure of false positives), Recall (a measure
of false negatives) and F-measure (a harmonized mean of P
and R). We present results tables for two classifiers belonging
to each type of model (generative and discriminative), and
each table presents the results when training and testing the
model using machine activity logs split into incremental time
windows (0-270 seconds) with aggregated log results. Note
that t=0 is not actually 0 seconds but 5 seconds after the
connection is opened. The classifiers used are BayesNet (BN),
Naive Bayes (NB), J48 Decision Tree (DT) and Multi Layer
Perceptron (MLP).
We can see from the training performance data (see Tables
2-5) that each model exhibits optimal performance between
t=30 and t=60, with very little improvement after this time.
The low error rates at t=60 in models that consider conditional
dependencies in the training phase suggests three things (i)
the features we are using to build the models are predictive
of malicious behaviour, (ii) malicious activity is probably oc-
curing within the first 60 seconds of the interaction, (iii) there
are conditional dependencies between the measured variables.
This is a logical result as we would expect certain machine
activity to have some conditional dependencies, for example
CPU usage and Bytes sent/received.
With reference to Figure 1, a chart of the correctly clas-
sified instances over time for the test dataset, the first point
of note is that the model that does not consider dependencies
between input variables (the NB model) performs much worse
than the other models. The model improves over time, but takes
until t=210 to reach peak performance, while other models
begin to plateau or decrease in performance around t=120 (see
Table III). The second point is that the discriminative models
outperform the generative models, suggesting that there are
distinct malicious activities that are linearly separable from
benign behaviour. This means that over time and, most im-
portantly, across different events, we can monitor and measure
specific machine behaviours that can be indicative of malicious
activity when clicking URLs. We cannot make strong claims
about this given that our optimal F-Measure performance was
only 0.724 at time t=270 using the MLP model (see Table
V). However, it is encouraging to see that the MLP model
exhibited a precision performance of 0.720, only slightly below
its optimum level, at time t=30. This demonstrates the model’s
ability to reduce false positives fairly early on in the interaction
(i.e. in real time), but still takes some time to improve the false
negatives and missed instance of malicious behaviour.
Fig. 1. Classifier performance over time
V. RE SU LTS
A. Model Analysis
TABLE VI. NO DE WEI GH TS BY CL ASS
Inputs Weights (Benign) Weights (Malicious)
Threshold -9.13 9.13
Node 1 -9.49 9.49
Node 2 9.61 -9.61
Node 3 17.66 -17.66
Node 4 26.81 -26.81
Node 5 9.45 -9.45
Node 6 1.96 -1.96
Node 7 8.73 -8.73
Node 8 4.34 -4.34
Node 9 -15.85 15.85
Table VII presents weightings assigned to each attribute
used as a predictive feature in the MLP model. These results
are extracted from the best performing iteration of the MLP
model during the training phase (t=60) to examine how the
model is representing a learned weighting between features.
The model produced 9 hidden nodes and Table VI shows
the weighting given to each node for each class (malicious
or benign), with 6 of the 9 nodes having values above the
threshold value, and nodes 3, 4 and 9 having high weightings
towards a particular class. Node 9 stands out as the most
discriminative positive weighted node for malicious URLs. If
we look further into this, see Table VII, we can identify that
Bytes Received variable is the highest weighting. If we look
at Node 3 in comparison, which is more heavily weighted
towards the benign class, we can see that Bytes sent/received
has a similar weighting but that the Packets sent/received is
negatively weighted in Node 3 and positively weighted in Node
9. This is an interesting finding as Web endpoints will almost
always send data to the machine visiting them. This model
demonstrates that there are measurable ‘norms’ for the inflow
of packets from Web pages and that there are measurable
deviations from this that can act as predictors of malicious
behaviour, as is happening in Nodes 3 and 9.
CPU has weightings higher than most other variables and
the threshold for the Node in Nodes 5 and 6, which suggests
this is also a predictive feature. It is interesting to note that
CPU weighting is at its highest when the network traffic
weights are at their lowest, suggesting that CPU is used as
a secondary feature when network traffic is not as useful
in making a classification decision. Furthermore, ProcessID,
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
974
TABLE VII. MLP AN ALYSI S
Attribute Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9
Threshold 85.07 42.21 -5.97 1.20 1.39 2.96 -109.11 50.20 12.76
CPU 0.46 0.86 1.18 0.33 3.85 123.46 0.32 -9.13 1.26
Connection 78.79 46.15 -0.22 0.01 -0.02 -14.29 -17.17 11.51 -0.13
PortNumber 5.17 0.23 0.28 0.31 0.41 0.68 0.14 -0.56 -0.52
ProcessID -1.10 -0.23 -0.25 -0.18 -0.73 33.42 -0.27 10.33 0.56
RemoteIP 78.77 46.12 -0.17 0.03 -0.03 -14.28 -17.13 11.44 -0.08
NIC=1 -8.72 -14.07 5.97 1.55 -1.40 -11.06 78.26 -30.34 -13.12
NIC=2 -73.87 -35.44 6.54 -1.31 -0.96 6.28 68.90 -48.84 -13.17
NIC=3 -74.14 -35.31 6.13 -1.23 -1.31 6.68 68.95 -49.39 -13.12
NIC=4 -7.43 -13.27 5.86 1.72 -1.56 -10.79 77.76 -29.36 -12.96
NIC=5 -180.28 -69.52 -6.72 -7.22 0.78 -1.13 105.03 -14.35 15.05
NIC=6 -6.48 -13.41 5.93 1.61 -1.46 -11.12 77.80 -29.73 -13.06
NIC=7 -74.16 -30.01 6.26 -1.36 -1.18 6.31 69.09 -49.04 -13.42
BytesSent 894.02 516.07 27.17 35.32 75.38 -13.97 -866.89 392.79 13.09
BytesReceived -29.16 -68.52 125.22 -4.11 150.17 -2.69 98.62 -47.79 88.93
PacketsSent -52.06 -49.15 -11.49 -2.73 32.62 -3.06 104.37 -50.02 18.38
PacketsReceived -59.21 -45.36 -10.36 -1.34 20.60 -2.07 106.07 -49.84 14.90
which records the maximum process ID on the machine, has
its highest weighting in the same Node as CPU (Node 6). This
can be interpreted as malware creating more processes on the
machine and pushing up CPU usage. The Connection attribute,
which measure the presence of a remote endpoint connection
is highly weighted in Node 1, which is weighted towards
the malicious class. At the same time, the identification of a
remote endpoint address (RemoteIP) is at its highest, and the
BytesSent attribute is extremely high, suggestive of an attack
based on data exfiltration (possibly advanced persistent threat)
across the ethernet network interface (NIC=5).
B. Sampled Learning
Finally, we investigate how much training data is required
to train an MLP model. Storing Twitter data around ongoing
real-world events is an issue given that events can last several
weeks. Furthermore, if the system were deployed it could run
on a daily basis to monitor malware and retrain learned models.
Less data means less storage space and less computational
time required to extract model features and run models. In
addition, interacting with URLs is a time-intensive process.
Questions could be asked around whether the training set is
missing a significant proportion of malicious activity given
that not all URLs can be visited in real-time with the rela-
tively low level of compute resources available to academic
researchers. Malicious endpoints are frequently taken down
and are not accessible after short periods of time following
their deployment in drive-by-downloads. Demonstrating that a
small training sample achieves similar performance to the full
samples alleviates this to some degree as it demonstrates the
most explanatory features are present in the smaller sample.
We retained the full test dataset and sampled (using no replace-
ment) from the training data at 1%, 5%, 10% and increments
of 10% up to 100. Figure 2 illustrates the percentage of
correctly classified instances with a 100% sample, down to
1% and shows that a sample of 20%, 30% and 40% yields
a performance of 68%, only 1% lower than the optimal
performance of 69%. The performance using a 1% sample
is 63% - a performance drop of only 5% on the optimal
performance with a complete sample. These results are based
on the mean of two runs for each sample, but it is worth noting
that both runs yielded almost identical results (e.g. for the 1%
sample, run 1 was 63.3% and run 2 was 63.4%)
Fig. 2. Correctly classified instances with sampled training data
VI. DI SC US SI ON A ND CO NC LU SIONS
We collect data from Twitter, using event-specific hashtags
to sample tweets posted at the time of two real-world sporting
events. Sporting events have been shown to produce large
volumes of microblog posts and within these it is common for
people to include URLs that point to additional information,
such as news articles, images and other sources of evidence to
support the short (max 140 character) post. Given the limited
space in a tweet, Twitter automatically shortens the URL so
the endpoint is not clear from the text. This makes tweets
particularly susceptible to drive-by-downloads – Cyber attacks
that direct users to a malicious endpoint that proceeds to attack
users’ machines. Given the surreptitious nature of these attacks,
the information seeking nature of people at the time of large
events, and the growth of Twitter as a news reporting and
citizen journalism platform, we sought to develop a machine
classifier that could classify a URL as malicious or benign
in ‘real-time’, i.e. within seconds of clicking the link. We
attempted to achieve this using machine activity log data,
such as CPU usage, network traffic and network connection
statistics, as features that could discriminate between malicious
and benign URLs. We also aimed to explicate the relationship
between machine activity and malicious system behaviour
surrounding Twitter event coverage. Finally, given the large
volumes of tweets surrounding events and potential issues with
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
975
sampling from all malicious sites due to their short lifespan
and the time intensiveness of interacting with all of them, we
aimed to understand the impact on classification performance
when using much smaller samples.
We built a number of machine classifiers and identified
that a Bayesian model, a Decision Tree (DT) and a Multi
Layer Perceptron (MLP) approach all worked extremely well
during training, achieving over 90% accuracy, up to 97% for
the DT and MLP. However, discriminative models performed
better than generative models in testing, and the MLP model
performed best overall with accuracy of up to 72% using
previously unseen data. The Bayesian approach performed best
in the early stages of the interaction (within 5 seconds of
clicking the URL), achieving 66% accuracy when the model
had the least information available. The high training scores
suggests the features used are indeed predictive of malicious
behaviour for a single event. The drop in performance on a
new event suggests attack vectors are slightly different across
events, but with a reasonably high degree of accuracy we
can claim some independence between predictive features and
events, though this should be tested in future with additional
events beyond sports and within everyday ’mundane’ blog
posts to add further weight to this claim.
Upon inspecting the decision-making process within the
MLP model, we found evidence to suggest that the key predic-
tive machine activity metric was network activity - particularly
packets sent and received. CPU use and process IDs also had
a clear raised and correlated weighting in the model, as did
the bytes sent from the network when correlating with new
connections to remote endpoints, suggesting data exfiltration
exercises can be distinguished from general data transfer.
A learning curve produced using a range of samples from
the training dataset, while still testing on the full testing
dataset, revealed only a small drop in classification perfor-
mance when compared to using the full training sample. This
suggests machine log data can be predictive of malicious
system behaviour even with very small samples, alleviating
some concerns over appropriate sampling mechanisms, lack
of a complete log of all Twitter URL activity, and the require-
ment for large amounts of data storage. However, this was a
binary classification task and if we aimed to further classify
malware into different types or families based on its behaviour,
which we intend to in future, we anticipate that the sampling
requirements may be different.
Twitter have recently introduced new policies to protect
their users against harm and it would appear there is a need
for an automated and reliable approach for an alert or warning
system to help detecting malicious URLs in the waves of micro
posts that surround real-world events. This work presents such
an approach and provides some insight into using machine
activity data to predict malicious behaviour within seconds of
interacting with a URL.
ACK NOW LE DG ME NT
This work was supported by the Engineering and Physical
Sciences Research Council (EPSRC) Global Uncertainties
Consortia for Exploratory Research in Security (CEReS) pro-
gramme, grant number: EP/K03345X/1.
REF ER EN CE S
[1] Y. Alosefer and O. Rana. Honeyware: a web-based low interaction client
honeypot. In Software Testing, Verification, and Validation Workshops
(ICSTW), 2010 Third International Conference on, pages 410–417. IEEE,
2010.
[2] P. Burnap, N. J. Avis, and O. F. Rana. Making sense of self-reported
socially significant data using computational methods. International
Journal of Social Research Methodology, 16(3):215–230, 2013.
[3] P. Burnap, M. L. Williams, L. Sloan, O. Rana, W. Housley, A. Edwards,
V. Knight, R. Procter, and A. Voss. Tweeting the terror: modelling the
social media reaction to the woolwich terrorist attack. Social Network
Analysis and Mining, 4(1):1–14, 2014.
[4] D. Canali, M. Cova, G. Vigna, and C. Kruegel. Prophiler: a fast filter
for the large-scale detection of malicious web pages. In Proceedings of
the 20th international conference on World wide web, pages 197–206.
ACM, 2011.
[5] M. Cova, C. Kruegel, and G. Vigna. Detection and analysis of drive-by-
download attacks and malicious javascript code. In Proceedings of the
19th International Conference on World Wide Web, WWW ’10, pages
281–290, New York, NY, USA, 2010. ACM.
[6] M. Faghani and H. Saidi. Malware propagation in online social
networks. In Malicious and Unwanted Software (MALWARE), 2009 4th
International Conference on, pages 8–14, Oct 2009.
[7] A. Kapravelos, Y. Shoshitaishvili, M. Cova, C. Kruegel, and G. Vigna.
Revolver: An automated approach to the detection of evasive web-based
malware. In USENIX Security, pages 637–652, 2013.
[8] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network
or a news media? In Proceedings of the 19th international conference
on World wide web, pages 591–600. ACM, 2010.
[9] V. L. Le, I. Welch, X. Gao, and P. Komisarczuk. Two-stage classification
model to detect malicious web pages. In Advanced Information Network-
ing and Applications (AINA), 2011 IEEE International Conference on,
pages 113–120, March 2011.
[10] C.-H. Lee. Unsupervised and supervised learning to evaluate event
relatedness based on content mining from social-media streams. Expert
Systems with Applications, 39(18):13338 – 13356, 2012.
[11] K. Lee, J. Caverlee, and S. Webb. Uncovering social spammers: Social
honeypots + machine learning. In Proceedings of the 33rd International
ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR ’10, pages 435–442, New York, NY, USA, 2010. ACM.
[12] S. Lee and J. Kim. Warningbird: Detecting suspicious urls in twitter
stream. In NDSS, 2012.
[13] K. Lerman and R. Ghosh. Information contagion: an empirical study
of the spread of news on digg and twitter social networks. CoRR,
abs/1003.2664, 2010.
[14] J. Martinez-Romo and L. Araujo. Detecting malicious tweets in trending
topics using a statistical analysis of language. Expert Systems with
Applications, 40(8):2992 – 3000, 2013.
[15] Microsoft. Security intelligence report. December 2013.
[16] Y. min Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, S. Chen,
and S. King. Automated web patrol with strider honeymonkeys: Finding
web sites that exploit browser vulnerabilities. In In NDSS, 2006.
[17] J. Nazario. Phoneyc: A virtual client honeypot. In Proceedings of the
2Nd USENIX Conference on Large-scale Exploits and Emergent Threats:
Botnets, Spyware, Worms, and More, LEET’09, pages 6–6, Berkeley, CA,
USA, 2009. USENIX Association.
[18] R. Procter, F. Vis, and A. Voss. Reading the riots on twitter: method-
ological innovation for the analysis of big data. International journal of
social research methodology, 16(3):197–214, 2013.
[19] M. Puttaroo, P. Komisarczuk, and R. Cordeiro De Amorim. Challenges
in developing capture-hpc exclusion lists. In Security of Information and
Networks (SIN), 2014 7th International Conference on, September 2014.
[20] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter
users: real-time event detection by social sensors. In Proceedings of
the 19th international conference on World wide web, pages 851–860.
ACM, 2010.
[21] A. Sanzgiri, J. Joyce, and S. Upadhyaya. The early (tweet-ing) bird
spreads the worm: An assessment of twitter for malware propagation.
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
976
Procedia Computer Science, 10(0):705 – 712, 2012. {ANT}2012 and
MobiWIS 2012.
[22] C. Seifert and R. Steenson. Capture-hpc. Internet: https://projects.
honeynet. org/capture-hpc, 2008.
[23] G. Stringhini, M. Egele, C. Kruegel, and G. Vigna. Poultry markets:
on the underground economy of twitter followers. In Proceedings of the
2012 ACM workshop on Workshop on online social networks, pages 1–6.
ACM, 2012.
[24] G. Stringhini, C. Kruegel, and G. Vigna. Detecting spammers on
social networks. In Proceedings of the 26th Annual Computer Security
Applications Conference, ACSAC ’10, pages 1–9, New York, NY, USA,
2010. ACM.
[25] K. Thomas and D. Nicol. The koobface botnet and the rise of social
malware. In Malicious and Unwanted Software (MALWARE), 2010 5th
International Conference on, pages 63–70, Oct 2010.
[26] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe. Pre-
dicting elections with twitter: What 140 characters reveal about political
sentiment. ICWSM, 10:178–185, 2010.
[27] A. Wang. Machine learning for the detection of spam in twitter
networks. In M. Obaidat, G. Tsihrintzis, and J. Filipe, editors, e-Business
and Telecommunications, volume 222 of Communications in Computer
and Information Science, pages 319–333. Springer Berlin Heidelberg,
2012.
[28] W. Webberley, S. Allen, and R. Whitaker. Retweeting: A study of
message-forwarding in twitter. In Mobile and Online Social Networks
(MOSN), 2011 Workshop on, pages 13–18. IEEE, 2011.
[29] C. Yang, R. Harkreader, J. Zhang, S. Shin, and G. Gu. Analyzing
spammers’ social networks for fun and profit: A case study of cyber
criminal ecosystem on twitter. In Proceedings of the 21st International
Conference on World Wide Web, WWW ’12, pages 71–80, New York,
NY, USA, 2012. ACM.
[30] S. Yardi, D. Romero, G. Schoenebeck, and danah boyd. Detecting spam
in a twitter network. First Monday, 15(1), 2009.
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
977
... This provides an opportunity for cybercriminals to take advantage of shortened URLs and harm the user's system or get access to their private data [15]. Drive-by download attacks and luring the users into providing private information are considered as growing dilemmas of cyber-attacks especially when cybercriminals target large-scale trending events on different online social platforms [16]. The trending topics and their associated terms or hashtags are the easiest approaches for cybercriminals to lure information-seeking users [17]. ...
... Attacks such as drive-by download attacks can be prevented by studying the behavioral analysis of malware which leads to the identification of malicious URLs [16,18]. Researchers have emphasized the detection of malicious and benign URLs and worked on building machine learning models for the classification of URLs. ...
... Researchers have emphasized the detection of malicious and benign URLs and worked on building machine learning models for the classification of URLs. The classification works on different kinds of features which include machine activity features [16], tweet features [19], URL based features [14,20], lexical features [10], etc. These features help in detecting whether a particular URL is harmful to a user and some works [21] include protecting the users from clicking those URLs. ...
Article
Full-text available
Malicious Uniform Resource Locators (URLs) embedded in emails or Twitter posts have been used as weapons for luring susceptible Internet users into executing malicious content leading to compromised systems, scams, and a multitude of cyber-attacks. These attacks can potentially might cause damages ranging from fraud to massive data breaches resulting in huge financial losses. This paper proposes a hybrid deep-learning approach named URLdeepDetect for time-of-click URL analysis and classification to detect malicious URLs. URLdeepDetect analyzes semantic and lexical features of a URL by applying various techniques, including semantic vector models and URL encryption to determine a given URL as either malicious or benign. URLdeepDetect uses supervised and unsupervised mechanisms in the form of LSTM (Long Short-Term Memory) and k-means clustering for URL classification. URLdeepDetect achieves accuracy of 98.3% and 99.7% with LSTM and k-means clustering, respectively.
... This provides an opportunity for cybercriminals to take advantage of shortened URLs and harm the user's system or get access to their private data [15]. Drive-by download attacks and luring the users into providing private information are considered as growing dilemmas of cyber-attacks especially when cybercriminals target large-scale trending events on different online social platforms [16]. The trending topics and their associated terms or hashtags are the easiest approaches for cybercriminals to lure information-seeking users [17]. ...
... Attacks such as drive-by download attacks can be prevented by studying the behavioral analysis of malware which leads to the identification of malicious URLs [16,18]. Researchers have emphasized the detection of malicious and benign URLs and worked on building machine learning models for the classification of URLs. ...
... Researchers have emphasized the detection of malicious and benign URLs and worked on building machine learning models for the classification of URLs. The classification works on different kinds of features which include machine activity features [16], tweet features [19], URL based features [14,20], lexical features [10], etc. These features help in detecting whether a particular URL is harmful to a user and some works [21] include protecting the users from clicking those URLs. ...
Article
Full-text available
Malicious Uniform Resource Locators (URLs) embedded in emails or Twitter posts have been used as weapons for luring susceptible Internet users into executing malicious content leading to compromised systems, scams, and a multitude of cyber-attacks. These attacks can potentially might cause damages ranging from fraud to massive data breaches resulting in huge financial losses. This paper proposes a hybrid deep-learning approach named URLdeepDetect for time-of-click URL analysis and classification to detect malicious URLs. URLdeepDetect analyzes semantic and lexical features of a URL by applying various techniques, including semantic vector models and URL encryption to determine a given URL as either malicious or benign. URLdeepDetect uses supervised and unsupervised mechanisms in the form of LSTM (Long Short-Term Memory) and k-means clustering for URL classification. URLdeepDetect achieves accuracy of 98.3% and 99.7% with LSTM and k-means clustering, respectively.
... In (Burnap, P., 2015), the authors developed a real-time system to determine the malicious URLs on Twitter based on ML. Machine activity log data features such as CPU usage, network traffic, and network connection statistics after linking streamed Twitter data to a high interaction honeypotNB. ...
Article
In recent years, online social networks (OSNs) have become a huge used platform for sharing activities, opinions, and advertisements. Spam content is considered one of the biggest threats in social networks. Spammers exploit OSNs for falsifying content as part of phishing, such as sharing forged advertisements, selling forged products, or sharing sexual words. Therefore, machine learning (ML) and deep learning (DL) techniques are the best methods for detecting phishing attacks and minimize their risk. This paper provides an overview of prior studies of OSNs spam detection modeling based on ML and DL techniques. The research papers are classified into three categories: the features used for prediction, the dataset size corresponding language used, real-time based applications, and machine learning or deep learning techniques. Challenges and opportunities in phishing attacks prediction using ML and DL techniques are also concluded in our study.
... The dataset contains 2456 instances and 30 attributes. The 30 attributes distributed over four features categories; Address bar, abnormal, HTML and JavaScript, and Domain [17][27] [28]. ...
Article
Full-text available
Maximizing user protection from Phishing website is a primary objective in the design of these networks. Intelligent phishing detection management models can assist designers to achieve this objective. Our proposed model aims to reduce the computational time and increase the security against the phishing websites by applying the intelligent detection model. In this paper, we employed Multilayer Perceptron (MLP) to achieve the highest accuracy and optimal training ratio to maximize internet security. The simulation results show the selection of the most significant features minimize the computational time. The optimal training percentage is 70% as it minimizes the time complexity and it increases the model accuracy.
... The usefulness and quality of this system's output is based on its input data. There are various studies that have used tweets as a data source for security analysis [10], [11], [12] and even more that have used tweets as a general social predictor [13], [14], [15]. In these studies the tweets that are processed are mined based on c 2020 Information Processing Society of Japan tweet hashtags. ...
Article
The Internet is constantly evolving, producing many new data sources that can be used to help us gain insights into the cyber threat landscape and in turn, allow us to better prepare for cyberattacks. With this in mind, we present an end-to-end real-time cyber situational awareness system which aims to retrieve security-relevant information from the social networking site Twitter.com. This system classifies and aggregates the data extracted and provides real-time cyber situational awareness information based on sentiment analysis and data analytics techniques. This research will assist security analysts in rapidly and efficiently evaluating the level of cyber risk in their organization and allow them to proactively take actions to plan and prepare for potential attacks before they happen.
Chapter
Cyber security is a very important requirement for users. With the rise in Internet usage in recent years, cyber security has become a serious concern for computer systems. When a user accesses a malicious Web site, it initiates a malicious behavior that has been pre-programmed. As a result, there are numerous methods for locating potentially hazardous URLs on the Internet. Traditionally, detection was based heavily on the usage of blacklists. Blacklists, on the other hand, are not exhaustive and cannot detect newly created harmful URLs. Recently, machine learning methods have received a lot of importance as a way to improve the majority of malicious URL detectors. The main goal of this research is to compile a list of significant features that can be utilized to detect and classify the majority of malicious URLs. To increase the effectiveness of classifiers for detecting malicious URLs, this study recommends utilizing host-based and lexical aspects of the URLs. Malicious and benign URLs were classified using machine learning classifiers such as AdaBoost and Random Forest algorithms. The experiment shows that Random Forest performs really well when checked using voting classifier on AdaBoost and Random Forest Algorithms. The Random Forest achieves about 99% accuracy.KeywordsCyber securityMalicious URL detectionLexical featuresCount featuresBinary featuresRandom forestAdaBoostMachine learningVoting classifier
Chapter
Digital marketing has become an essential element of higher education institution activities. Accordingly, higher education institutions need to adapt their marketing communications to modern realities. The authors' intention was to highlight strategic and tactical aspects relevant to digital marketing communication processes based on literature review, the research of marketing activity of higher education institutions from Poland and Ukraine on the Internet, and own experience. The authors suggested the conceptual model for digital marketing communication of higher education institution that is universal, applicable to any type of higher education institution regardless of its profile, form of ownership and country.KeywordsDigital marketingCustomer data platformHigher education institutionsMarketing communication modelOmnichannel marketingWeb 2.0
Article
Full-text available
This paper tests disruption strategies in Twitter networks containing malicious URLs used in drive-by download attacks. Cybercriminals use popular events that attract a large number of Twitter users to infect and propagate malware by using trending hashtags and creating misleading tweets to lure users to malicious webpages. Due to Twitter’s 280 character restriction and automatic shortening of URLs, it is particularly susceptible to the propagation of malware involved in drive-by download attacks. Considering the number of online users and the network formed by retweeting a tweet, a cybercriminal can infect millions of users in a short period. Policymakers and researchers have struggled to develop an efficient network disruption strategy to stop malware propagation effectively. We define an efficient strategy as one that considers network topology and dependency on network resilience, where resilience is the ability of the network to continue to disseminate information even when users are removed from it. One of the challenges faced while curbing malware propagation on online social platforms is understanding the cybercriminal network spreading the malware. Combining computational modelling and social network analysis, we identify the most effective strategy for disrupting networks of malicious URLs. Our results emphasise the importance of specific network disruption parameters such as network and emotion features, which have proved to be more effective in disrupting malicious networks compared to random strategies. In conclusion, disruption strategies force cybercriminal networks to become more vulnerable by strategically removing malicious users, which causes successful network disruption to become a long-term effort.
Article
Social networks have generated immense amounts of data that have been successfully utilized for research and business purposes. The approachability and immediacy of social media have also allowed ill-intentioned users to perform several harmful activities that include spamming, promoting, and phishing. These activities generate massive amounts of low-quality content that often exhibits duplicate, automated, inappropriate, or irrelevant content that subsequently affects users’ satisfaction and imposes a significant challenge for other social media-based systems. Several real-time systems were developed to tackle this problem by focusing on filtering a specific kind of low-quality content. In this paper, we present a fine-grained real-time classification approach to identify several types of low-quality tweets (i.e., phishing, promoting, and spam tweets) written in Arabic. The system automatically extracts textual features using deep learning techniques without relying on hand-crafted features that are often time-consuming to be obtained and are tailored for a single type of low-quality content. This paper also proposes a lightweight model that utilizes a subset of the textual features to identify spamming Twitter accounts in a real-time setting. The proposed methods are evaluated on a real-world dataset (40, 000 tweets and 1, 000 accounts), showing superior performance in both models with accuracy and F1-scores of 0.98. The proposed system classifies a tweet in less than five milliseconds and an account in less than a second.
Conference Paper
Full-text available
In this paper we discuss the challenges faced whilst developing exclusion lists for the high-interaction client honeypot, Capture-HPC. Exclusion lists are Capture client system behaviours which are used in the decision making process when determining if a particular behaviour is malicious or benign. As exclusion lists are the main decision making method used by Capture-HPC to classify a given webpage as benign or malicious, we identify a number of issues with current research which are often overlooked. Exclusion lists by nature require constant updating as they are developed to meet the specific requirements of a particular operating system, web browser and application system environment. Any changes to these would mean the possibility of a given client to display different benign behaviour which consequently means new exclusions required. As a result of their specific version requirements, exclusion lists are not transferable from clients. We propose a set of recommendations to aid in the creation of exclusion lists. We also present and discuss some common drive-by-download attacks which we have captured using our Windows 7 compatible exclusion lists.
Article
Full-text available
For social scientists, the widespread adoption of social media presents both an opportunity and a challenge. Data that can shed light on people’s habits, opinions and behaviour is available now on a scale never seen before, but this also means that it is impossible to analyse using conventional methodologies and tools. This article represents an experiment in applying a computationally assisted methodology to the analysis of a large corpus of tweets sent during the August 2011 riots in England.
Article
Full-text available
Little is currently known about the factors that promote the propagation of information in online social networks following terrorist events. In this paper we took the case of the terrorist event in Woolwich, London in 2013 and built models to predict information flow size and sur-vival using data derived from the popular social networking site Twitter. We define information flows as the propaga-tion over time of information posted to Twitter via the action of retweeting. Following a comparison with differ-ent predictive methods, and due to the distribution exhib-ited by our dependent size measure, we used the zero-truncated negative binomial (ZTNB) regression method. To model survival, the Cox regression technique was used because it estimates proportional hazard rates for inde-pendent measures. Following a principal component ana-lysis to reduce the dimensionality of the data, social, temporal and content factors of the tweet were used as predictors in both models. Given the likely emotive reaction caused by the event, we emphasize the influence of emotive content on propagation in the discussion sec-tion. From a sample of Twitter data collected following the event (N = 427,330) we report novel findings that identify that the sentiment expressed in the tweet is statistically significantly predictive of both size and survival of infor-mation flows of this nature. Furthermore, the number of offline press reports relating to the event published on the day the tweet was posted was a significant predictor of size, as was the tension expressed in a tweet in relation to sur-vival. Furthermore, time lags between retweets and the co-occurrence of URLS and hashtags also emerged as significant.
Conference Paper
Full-text available
In recent years, attacks targeting web browsers and their plugins have become a prevalent threat. Attackers deploy web pages that contain exploit code, typically written in HTML and JavaScript, and use them to compromise unsuspecting victims. Initially, static techniques, such as signature-based detection, were adequate to identify such attacks. The response from the attackers was to heavily obfuscate the attack code, rendering static techniques insufficient. This led to dynamic analysis systems that execute the JavaScript code included in web pages in order to expose malicious behavior. However, today we are facing a new reaction from the attackers: evasions. The latest attacks found in the wild incorporate code that detects the presence of dynamic analysis systems and try to avoid analysis and/or detection. In this paper, we present Revolver, a novel approach to automatically detect evasive behavior in malicious JavaScript. Revolver uses efficient techniques to identify similarities between a large number of JavaScript programs (despite their use of obfuscation techniques, such as packing, polymorphism, and dynamic code generation), and to automatically interpret their differences to detect evasions. More precisely, Revolver leverages the observation that two scripts that are similar should be classified in the same way by web malware detectors (either both scripts are malicious or both scripts are benign); differences in the classification may indicate that one of the two scripts contains code designed to evade a detector tool. Using large-scale experiments, we show that Revolver is effective at automatically detecting evasion attempts in JavaScript, and its integration with existing web malware analysis systems can support the continuous improvement of detection techniques.
Article
Twitter is a microblogging website where users read and write millions of short messages on a variety of topics every day. This study uses the context of the German federal election to investigate whether Twitter is used as a forum for political deliberation and whether online messages on Twitter validly mirror offline political sentiment. Using LIWC text analysis software, we conducted a content-analysis of over 100,000 messages containing a reference to either a political party or a politician. Our results show that Twitter is indeed used extensively for political deliberation. We find that the mere number of messages mentioning a party reflects the election result. Moreover, joint mentions of two parties are in line with real world political ties and coalitions. An analysis of the tweets’ political sentiment demonstrates close correspondence to the parties' and politicians’ political positions indicating that the content of Twitter messages plausibly reflects the offline political landscape. We discuss the use of microblogging message content as a valid indicator of political sentiment and derive suggestions for further research.
Conference Paper
In this paper we discuss the challenges faced whilst developing exclusion lists for the high-interaction client honeypot, Capture-HPC. Exclusion lists are Capture client system behaviours which are used in the decision making process when determining if a particular behaviour is malicious or benign. As exclusion lists are the main decision making method used by Capture-HPC to classify a given webpage as benign or malicious, we identify a number of issues with current research which are often overlooked. Exclusion lists by nature require constant updating as they are developed to meet the specific requirements of a particular operating system, web browser and application system environment. Any changes to these would mean the possibility of a given client to display different benign behaviour which consequently means new exclusions required. As a result of their specific version requirements, exclusion lists are not transferable from clients. We propose a set of recommendations to aid in the creation of exclusion lists. We also present and discuss some common drive-by-download attacks which we have captured using our Windows 7 compatible exclusion lists.
Conference Paper
The rapidly growing online social networking sites have been infiltrated by a large amount of spam. In this paper, I focus on one of the most popular sites Twitter as an example to study the spam behaviors. To facilitate the spam detection, a directed social graph model is proposed to explore the “follower” and “friend” relationships among users. Based on Twitter’s spam policy, novel content-based features and graph-based features are also proposed. A Web crawler is developed relying on Twitter’s API methods. A spam detection prototype system is proposed to identify suspicious users on Twitter. I analyze the data set and evaluate the performance of the detection system. Classic evaluation metrics are used to compare the performance of various traditional classification methods. Experiment results show that the Bayesian classifier has the best overall performance in term of F-measure. The trained Bayesian classifier is also applied to the entire data set to distinguish the suspicious behaviors from normal ones. The result shows that the spam detection system can achieve 89% precision.
Conference Paper
Microblogging communities have grown rapidly over the past few years, with Twitter being one of the most popular. Such microblogging services enable users to 'vote for' and highlight the interesting or useful posts of other users by forwarding or 'retweeting' to their social neighbours. Studying retweeting enables researchers to understand how information is disseminated around Twitter's social structure and shows which type of information is likely to propagate further. This paper represents preliminary studies of some properties of message-forwarding in Twitter, paying close attention to the dynamics and patterns represented by retweets. These are illustrated through empirical data highlighting properties of retweets, relating these to the friend-follower graph and to message propagation speeds.
Article
The growing number of people using social media to communicate with their peers and document their personal everyday feelings and views is creating a ‘data on an epic scale’ that provides the opportunity for social scientists to conduct research such as ethnography, discourse and content analysis of social interactions, providing an additional insight into today’s society. However, the tools and methods required to conduct such analysis are often isolated and/or proprietary. The Cardiff Online Social Media Observatory (COSMOS) provides an integrated virtual research environment for supporting the collection, analysis, and visualization of social media data, providing researchers with an innovative facility on which to conduct hypothetical experiments that lead to defensible results. This study presents a methodology for Digital Social Research and explains how the features of COSMOS aim to underpin it.