Conference PaperPDF Available

SONAR: Automatic Detection of Cyber Security Events over the Twitter Stream

Authors:

Abstract and Figures

Everyday, security experts face a growing number of security events that affecting people well-being, their information systems and sometimes the critical infrastructure. The sooner they can detect and understand these threats, the more they can mitigate and forensically investigate them. Therefore, they need to have a situation awareness of the existing security events and their possible effects. However, given the large number of events, it can be difficult for security analysts and researchers to handle this flow of information in an adequate manner and answer the following questions in near-real time: what are the current security events? How long do they last? In this paper, we will try to answer these issues by leveraging social networks that contain a massive amount of valuable information on many topics. However, because of the very high volume, extracting meaningful information can be challenging. For this reason, we propose SONAR: an automatic, self-learned framework that can detect, geolocate and categorize cyber security events in near-real time over the Twitter stream. SONAR is based on a taxonomy of cyber security events and a set of seed keywords describing type of events that we want to follow in order to start detecting events. Using these seed keywords, it automatically discovers new relevant keywords such as malware names to enhance the range of detection while staying in the same domain. Using a custom taxonomy describing all type of cyber threats, we demonstrate the capabilities of SONAR on a dataset of approximately 47.8 million tweets related to cyber security in the last 9 months. SONAR could efficiently and effectively detect, categorize and monitor cyber security related events before getting on the security news, and it could automatically discover new security terminologies with their event. Additionally, SONAR is highly scalable and customizable by design; therefore we could adapt SONAR framework for virtually any type of events that experts are interested in.
Content may be subject to copyright.
SONAR: Automatic Detection of Cyber Security Events over the
Twier Stream
Quentin Le Sceller
Security Research Centre
Concordia University
Montreal, Quebec, Canada
q_lescel@encs.concordia.ca
ElMouatez Billah Karbab
Security Research Centre
Concordia University
Montreal, Quebec, Canada
e_karbab@encs.concordia.ca
Mourad Debbabi
Security Research Centre
Concordia University
Montreal, Quebec, Canada
debbabi@encs.concordia.ca
Farkhund Iqbal
College of Technological Innovation
Zayed University
Abu Dhabi, U.A.E
farkhund.Iqbal@zu.ac.ae
ABSTRACT
Everyday, security experts face a growing number of security events
that aecting people well-being, their information systems and
sometimes the critical infrastructure. The sooner they can detect
and understand these threats, the more they can mitigate and foren-
sically investigate them. Therefore, they need to have a situation
awareness of the existing security events and their possible eects.
However, given the large number of events, it can be dicult for
security analysts and researchers to handle this ow of information
in an adequate manner and answer the following questions in near-
real time: what are the current security events? How long do they
last? In this paper, we will try to answer these issues by leveraging
social networks that contain a massive amount of valuable infor-
mation on many topics. However, because of the very high volume,
extracting meaningful information can be challenging. For this
reason, we propose SONAR: an automatic, self-learned framework
that can detect, geolocate and categorize cyber security events in
near-real time over the Twitter stream. SONAR is based on a taxon-
omy of cyber security events and a set of seed keywords describing
type of events that we want to follow in order to start detecting
events. Using these seed keywords, it automatically discovers new
relevant keywords such as malware names to enhance the range
of detection while staying in the same domain. Using a custom
taxonomy describing all type of cyber threats, we demonstrate the
capabilities of SONAR on a dataset of approximately 47.8 million
tweets related to cyber security in the last 9 months. SONAR could
eciently and eectively detect, categorize and monitor cyber se-
curity related events before getting on the security news, and it
could automatically discover new security terminologies with their
event. Additionally, SONAR is highly scalable and customizable by
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ARES ’17, Reggio Calabria, Italy
©2017 ACM. 978-1-4503-5257-4/17/08...$15.00
DOI: 10.1145/3098954.3098992
design; therefore we could adapt SONAR framework for virtually
any type of events that experts are interested in.
CCS CONCEPTS
Information systems Clustering
;
Nearest-neighbor search
;
Data stream mining
;
Security and privacy Usability in
security and privacy
;
Networks Online social networks
;
KEYWORDS
Cyber security events detection, security awareness, Twitter, frame-
work, social media, word embedding
ACM Reference format:
Quentin Le Sceller, ElMouatez Billah Karbab, Mourad Debbabi, and Farkhund
Iqbal. 2017. SONAR: Automatic Detection of Cyber Security Events over the
Twitter Stream. In Proceedings of ARES ’17, Reggio Calabria, Italy, August
29-September 01, 2017, 11 pages.
DOI: 10.1145/3098954.3098992
1 INTRODUCTION
In 2015, on average, a zero-day vulnerability was found every week
[
35
] and this number is expected to rise to one-per-day by 2021 [
36
].
From the perspective of a security practitioner, this might be hard
to cope without the proper security awareness tools. Moreover, in
the case of a newly discovered vulnerability, the delay between
the disclosure and the moment when the security practitioner is
aware of it is crucial. For example, attack such as the MongoDB
ransomware [
38
] can be mitigated as soon as the user is aware of
the aw and the possible patch.
In order to learn such information, the analyst has a wide range
of sources available: specialized press, tech forums and even special-
ized communication protocol for the dissemination of cyber threat
information [
16
]. However, social media are the largest and most
powerful sources providing a massive source of information about
every possible topic. Characteristics of Twitter [
19
] make it the
foremost choice in the case of real-time event detection. Everyday,
approximatively 500 millions tweets are sent by 100 million active
users [
6
]. Major events are discussed on social media and often the
rst reference for such event is on those social networks. In our
case, an event is a phenomenon happening during a certain time
ARES ’17, August 29-September 01, 2017, Reggio Calabria, Italy entin Le Sceller et al.
period which stimulates people to post about it on social media. For
example, denial of service attacks are often rst reported by users of
the website or service under attack. Users of online social networks
witnessing or participating in an event are naturally incentivized to
discuss it on social media (e.g. tweeting “I cannot reach X website”
). Moreover, Twitter act as a platform where not only cybersecurity
media but also mainstream media will tweet about ongoing events
and where people can react to such events.
Nonetheless, dealing with document such as tweets for the pur-
pose of detecting cyber security related events poses a wide variety
of challenges. First, in order to have a general overview of the cur-
rent situation, the framework must be able to proceed in real time
and in a timely manner a very high volume data. Hence, all the
algorithms used should be fast, scalable and possibly distributed.
Secondly, given the nature of tweets, documents retrieved are often
unusable: there is a lot of duplicates, o-topic, badly written and
incomplete documents. As we can get very valuable information
(e.g. information on a data breach), we can also get a lot of irrelevant
information (e.g. how I feel today). One of the main challenge lies
in correctly ltering the noise from the retrieved content. Finally,
the lexical eld of cyber security event is growing and constantly
changing which makes the tracking of new terminology a hard and
time consuming task. Having an autonomous system that lever-
age social network’s data for security event detection would be
extremely useful for a security analyst. Moreover, the ability to
have an evolving scope of detected cyber security related events
represent a substantial advantage.
To address these challenges, we introduce SONAR, a tool that
detect, geolocate, categorize and monitor cyber security related
events over the Twitter Stream. The tool itself only needs as input
a taxonomy along with a list of keywords for each taxon. While
other event detection systems exist, SONAR is unique by its realtime
aspect, its highly scalable architecture and its automatic keyword
detection: using an algorithm based on word embedding, it auto-
matically discovers new keywords in order to detect new relevant
events. Events, which are clusters of tweets, are detected using a
sliding window and are described by the initial tweet (also called
the rst story) and the number of similar documents found. Events
are then categorized and displayed on a user interface. SONAR
detects events in constant time using locality sensitive hashing
(LSH) and can technically process millions of documents per day.
The contributions of this paper are the following:
The design and development of SONAR, a novel, automatic,
self-learned, cyber security event detector from social me-
dia feeds.
A self-learned keywords mechanism based on word em-
bedding to discover new relevant keywords over a corpus
of short documents.
The evaluation of SONAR on 9 months of Twitter data
demonstrating the eectiveness of the tool.
SONAR can be used for many dierent purposes. As a cyber
security situation awareness tool, it provides to the experts a global
view on what is currently happening in the cyberspace and prior-
itize their actions toward the new threat. In addition, it is a tool
for forensics to observe the chronology of cyber threats on social
media and identify potential suspects (for example the rst person
talking about a newly data breach or malware). Finally, it’s a very
convenient tool for security teams as it can help them protecting
their organization from newly discovered vulnerabilities. Despite
the fact that, in this paper, SONAR is tuned to work only with tweets
written in english none of techniques used are language-specic
and so every language is technically supported.
The remainder of the paper is structured as follows. In Section 2,
we describe the approach used in SONAR. In Section 3, we detail the
architecture and design. Finally, in Sections 4 and 5 we discuss and
evaluate the implementation before presenting concluding remarks.
2 APPROACH
Figure 1 presents a high-level overview of the framework archi-
tecture. The inner workings of SONAR can be broken down into 3
phases. In phase 1, the stream generator makes use of Twitter API
and a list of keywords to retrieve a continuous stream of tweets
(see Section 3.2). This stream of tweets is preprocessed and then
stored in a database. Then, in phase 2, the event detection algorithm
detects events over the Twitter stream (see Section 3.4.3). Events
are composed of a set of tweets discussing the same topic during
the same timeframe. These events are geolocated, classied and
displayed on an intuitive user interface. As we use predened key-
words to create the stream in phase 1, the system is unaware of
new keywords such as new malware naming or new type of DDoS
attacks. Therefore, in order to get those documents, we created a
component that track those new security keywords automatically
and add them in the previous list of keywords: the keyword nder
(see Section 3.5). This component discovers new relevant keywords
among the previous tweets and add them to the list of keywords
used by the stream generator (phase 3). Subsequently, the stream
is enriched with tweets containing the discovered keywords. Next
section will introduce in details each previous component.
3 ARCHITECTURE AND DESIGN
3.1 Taxonomy
There have been several attempts to create cyber security related
taxonomies. There are taxonomies for cyber attack in general
[
14
,
34
] or on SCADA systems [
41
], 3G networks [
18
] and even
cyber conict taxonomy [
10
]. However, in our case, we were look-
ing for a more general and less specic cyber attack taxonomy.
A classication closer to what is discussed on social media: from
general account hacking to specic vulnerabilities such as SQL
injection. Table 1 represents the taxonomy created for SONAR. We
subdivided it into 5 levels. Level 1 is the main level, the subject of
our taxonomy: cyber security. Level 2 contains the six main cat-
egories: DoS, Botnet, Malware, Vulnerability, Social Engineering
and Data Breach. The taxonomy in its current state can go as deep
as level 5 (e.g. “SSDP” ood).
Each taxon, independently of its level, is associated with a set of
keywords. Keywords can go from unigram to n-gram. For example,
the taxon “APDoS” which stand for “Advanced Persistent Denial of
Service” is associated with 4 keywords: “Permanent DoS”, “Perma-
nent Denial of Service”, “PDoS” and “Phlashing” (a denial of service
attack that exploits a vulnerability in network-based rmware up-
dates). The taxon “SQL Injection” is associated with 3 keywords :
“SQL Attack”, “SQLi” and “SQL Injection”. Every tweet containing
Automatic Detection of Cyber Security Events over the Twier Stream ARES ’17, August 29-September 01, 2017, Reggio Calabria, Italy
Twitter DB
Event
Detection
Event Classification
and Geolocation
Keyword
Finder
UI Analyst
Event Tweets Enhanced Events
Stream
Generator
Keywords
Tweets
Keywords
Tweets
Phase 1
Phase 3
Phase 2
Tweets
Tweets
Figure 1: System Overview of SONAR.
Denial of Service Botnet
?Permanent DoS Malware
?Distributed DoS ?Adware
BAPDoS ?Ransomware
BF./A./R.1?Rootkit
·BitTorrent ?Trojan Horse
·CharGen ?Virus
·DNS ?Worm
·ICMP Vulnerability
·Kad ?Backdoor
·Netbios ?Buer Overow
·NTP ?Cross Site Scripting
·QOTD BDOM-Based XSS
·Quake Network BReected XSS
·SNMP BStored XSS
·SSDP ?SQL Injection
·Steam Social Engineering
BSlow Read ?Phishing
BSYN Flood Data Breach
Table 1: Cyber Security Taxonomy
one or more of the previous keywords will be retrieved. These key-
words must be carefully selected in order to false positives: analysts
must avoid loosely related (e.g. virus) or imprecise keywords.
The initial keywords provided by the analyst will be called seed
keywords. Using keyword stemming, we deal with cases such as
plural form and “hashtag” form (without space e.g. “denialofser-
vice”). We will see later in the paper how SONAR deals with these
keywords.
3.2 Data Collection
Using Twitter API and the list of seed keywords (in our case, a
list of 200 cyber security related keywords from basic unigrams to
more complex 6-grams), we query Twitter. The result is a stream
of tweets containing at least one of the previous keyword. Using a
list of keywords to query the Twitter stream has multiple benets.
First, we avoid a lot of noises: we receive only tweets that contain
the relevant keywords hence that are highly probable to be relevant.
Secondly, because of the API limitation, we can only get up to 1% of
the Twitter stream, which is approximately 1 million tweets per day.
With this method, we ensure that we get as most relevant tweets
1Flooding Amplication Reection
as possible. In our case, we get approximately 200,000 tweets per
day which is roughly 2 tweets per second. However this number
vary greatly depending on the events occurring. It can go as low
as 67,000 tweets per day when nothing signicant is happening to
almost 500,000 tweets when Wikileaks releases the hacked DNC
voicemails on July 27th, 2016.
Finally, the tool used a list of blacklisted n-gram in order to
avoid false positives. Tweets containing these forward n-grams
will be automatically discarded and not stored in the database.
For example, a document containing the expression “hacked to
death” is irrelevant despite the fact that it contains the keyword
“hacked”. In the example, the n-gram to blacklist is “hacked to death”.
This blacklisted keywords were added after evaluation of the tool.
This basic keyword ltering is a quick and simple way to discard
irrelevant keywords
3.3 Data Preprocessing
Before being processed by the event detection algorithm, documents
need to be preprocessed. Preprocessing is done in 5 steps:
(1) Stop words are removed from the document.
(2)
Non-ASCII characters, hashtag symbol and one character
words are removed.
(3) Punctuation, except English contractions, is removed.
(4) Tweets are then tokenized using Twokenizer [26].
(5)
Tweets are automatically classied using the previous tax-
onomy: each tweet is mapped with the category of the
keyword used to retrieve it. For example, the tweet “Back-
door in x software” is retrieved using the unigram “back-
door”, hence the tweet will be classied as “Cyber Secu-
rity/Vulnerability/Backdoor”.
This preprocessed documents are then the inputs for the event
detection algorithm.
3.4 Event Detection
Event detection is a clustering task: a system receive a stream
of documents and must organize it in the most appropriate event-
based cluster. In our case, we chose the approach used by Petrović et
al. [
28
] which is considered as one of the state-of-the-art technique
and used in multiple papers [23, 25].
3.4.1 Problem Statement.
Traditionally, documents are represented as vectors where each
coordinate is the term frequency (possibly term frequency-inverse
document frequency weighted [
29
]) of a particular term. Similarity
between two documents can be computed using various metrics.
ARES ’17, August 29-September 01, 2017, Reggio Calabria, Italy entin Le Sceller et al.
However, according to the work of Allan et al. in the UMass [
9
], the
cosine similarity outperforms the weighted sum and KL divergence
when working on rst story detection. The cosine similarity is a
well known metric in Information Retrieval and is the inner product
between of two normalized vectors. Considering document A and
B, the formula is the following:
similarity=cos θ=
A·B
|| A|| ||B|| =
n
Í
i=1AiBi
sn
Í
i=1A2
isn
Í
i=1B2
i
(1)
We wish to build event related clusters using a constant stream
of documents. The classic approach is to compare each new docu-
ment with the previous one and if the similarity is beyond a chosen
threshold, the new document is considered as a rst story. Algo-
rithm 1 describes the exact pseudo-code of the UMass system which
is the nearest neighbor search.
foreach document d in corpus do
foreach term t in d do
foreach document d’ that contains t do
update distance(d, d’);
end
end
dism in (d)=mind0{dis tance(d,d0)};
end
Algorithm 1: Classic First Story Detection Approach
However, using this approach, the algorithm has a running time
of
O(dN )
where
N
is the number of document in the corpus and
d
is the dimensionality of the document. Hence, in the case of large
corpus with high dimensionality, nding the rst story (in our case
the rst tweet) in real time is not feasible.
3.4.2 Locality-Sensitive Hashing.
The problem of the complexity of the nearest neighbor search have
been widely studied and Locality Sensitive Hashing (LSH) is a
method to reduce it. Petrović et al. used the hashing scheme pro-
posed by Charikar[
15
]. Using this hashing scheme, the probability
of two points
p
and
q
colliding considering a random hyperplane
vector uis:
Pcoll =Pu[hu(p)=hu(q)] =1θ(p,q)
π(2)
where hu(p)=siдn(u·p).
With a high number of hyperplanes we can increase the proba-
bility of collision with a non-similar point. The size of the candidate
set that contains the nearest neighbor can also be decreased by
using multiple hash tables. With LSH, clustering can be performed
in O(1)time.
3.4.3 First Story Detection.
Algorithm 2 presents the pseudo-code of Petrović et al. First Story
Detection algorithm using LSH.
We choose to run the algorithm with 13 hyperplanes and 70
hash tables with a maximum distance of 0.80, which are parameters
close to the ones used by Petrović et al.. But instead of running the
algorithm every 100,000 tweets, we run it every hour. In order to
only keep the cluster that are signicant, SONAR keeps the clus-
ters that satisfy the three following conditions. First, the Shannon
entropy (used to measure the amount of information) in the cluster
must be more than a xed value (e.g. 2.7). This will discard clusters
that are composed of the same repeated document. Second, clusters
must contain more than 10 documents, otherwise we would keep
clusters of only one document. Finally, clusters contain document
written by dierent authors. This is to ensure that a small group
of person cannot create a cluster considered as an event. However,
this approach is not resilient towards Sybil attacks. With all this
conditions, we try to dodge spam and keep only the signicant
events (see limitations in section 6).
input : threshold t
foreach document d in corpus do
add d to LSH;
Sset of points that collide with d in LSH;
dism in (d) ← 1;
foreach document d’ in S do
c=distance (d,d0);
if c<dism in (d)then
dism in (d) ← c;
end
end
if dism in (d) ≥ tthen
compare d to a xed number of most recent documents
as in algorithm 1 and update dismi n if necessary;
end
assign score dismi n (d) to d;
add d to inverted index;
end
Algorithm 2: Petrović and al. LSH-based approach
Moreover, SONAR implements an algorithm that will merge
similar events happening in more than an hour interval. This al-
gorithm will retrieve the rst story (the rst tweet of a cluster) of
each events in the last hours and compare with the latest events
found. This allows us to have events distributed on several days
instead of multiple events (for example is the event is not discussed
during 1 hour and reappears in the stream after).
3.5 Keywords Finder
Earlier in this paper, we discussed the advantages of using a list of
keywords to query Twitter instead of directly listening to the raw
stream. However, this approach presents some limitations. First of
all, there is no way for an analyst to be sure that he did not forget
a keyword at the initialization and hence missing all the events
Automatic Detection of Cyber Security Events over the Twier Stream ARES ’17, August 29-September 01, 2017, Reggio Calabria, Italy
containing the specic word. Next, an analyst will insert in SONAR
the seed keywords based on his previously acquired knowledge,
which might eventually falls short within a few months time. For
example, a ransomware is a new type of malware and let’s imagine
that the analyst is not aware of it and consequently do not insert
in the list of seed keywords. SONAR is capable of retrieving tweets
containing the keyword “malware” but not the ones containing
“ransomware”. However, in multiple tweets the word “ransomware”
appears with the keyword “malware”: our keyword nder algorithm
will discover this relation and suggest the keyword “ransomware”
to the analyst. In this subsection, we describe the algorithm that
SONAR uses to discover automatically new keywords.
3.5.1 Word Embeddings.
The term “word embeddings” was coined in 2003 by Bengio et al
[
13
], but was eventually popularized by Mikolov et al. with word2vec
[
24
], a toolkit that “provides an ecient implementation of the con-
tinuous bag-of-words and skip-gram architectures for computing
vector representations of words”. Word embeddings refer to dense
representations of words in a low-dimensional vector space. Unsu-
pervised learned word embedding models have shown to be highly
eective in natural language processing (e.g. discover semantic
relation between words embeddings).
In our case, we would like to discover new keywords, which
are semantically close to the seed keywords. For example, we are
interested by the category “Ransomware” and we would like to get
automatically new names of ransomware such as TeslaCrypt or
CryptoLocker using the initial tweets.
While word2vec is historically the rst widely used word em-
beddings model, we choose to use GloVe [
27
] by Pennington et al.
(2014) as it was reported to outperform the previous in terms of
accuracy and running time [30, 33].
3.5.2 GloVe.
Let us consider that
X
is the matrix of word-word co-occurrence
counts. We denote by
Xi j
the number of times
i
occurs with the
context word
j
and
Pi j =P(j|i)=Xi j /Xj
the probability that the
word
j
appears in the context word
i
. Pennington et al. started with
the following idea, the ratios of co-occurrence probabilities (i.e.
Pik /Pj k
) is more accurate to spot relevant word from irrelevant
word. Hence the following starting point:
F(wi,wj,˜
wk)=Pik
Pjk
(3)
where
wRd
are word vectors and
˜
wRd
is a context word
vectors of dimension d.
Finally, the model is dened by the following function:
J=
V
Õ
i,j=1
f(Xij )(wT
i˜
wj+bi+˜
bjlog Xi j )2(4)
where
b
is the bias and
f
is a weighting function which has the
following properties:
(1) f(0)=0
(2) (x,y) ∈ R2
+,x<y,f(x) ≤ f(y)
(3) f(x) must be “relatively small for large values of x”.
Tweets
GloVe Model
Training
Keyword
Finder
GloVe Model
Original List
of Keywords
Enriched List
of Keywords
Tweets DB
Figure 2: Architecture of the keyword nder component.
In that case, Pennington et al. found that the following function
suits well:
f(x)=((x/xma x )αif x<xmax
1otherwise (5)
3.5.3 Algorithm.
Ultimately, our goal is to nd keywords that have a close se-
mantic relationship with the seed keyword of a given category and
at the same time represent solely this category. For example, the
word “Linux” may have a close semantic relationship with the “Mal-
ware” category but does not represent it well. To achieve this, we
designed an algorithm that used language processing techniques
and the previous word embedding model.
Figure 2 represents the high level architecture of our keyword
nder component. We train a GloVe model on a large corpus of pre-
processed tweets. This gives us a trained model that will be one of
the input of our keyword nder algorithm. Let
C={C1,C2, .. ., Cn}
be a set of top level categories (e.g. “Denial of Service”, “Malware”,
etc.) from the taxonomy, where
n
is the total number of categories.
Let
Kn={Kn,1, .., Kn,i}
be the set of
i
unigram seed keywords that
belongs to category
n
and its respective subcategories (for example
the category “Denial of Service” and the subcategories “DDoS”,
“APDoS”...). In this case,
K1,1
could be the keyword “DDoS”’. Con-
sequently, each top level category is represented by a set of seed
keywords
Kn
. Using the previously trained GloVe model, we look
for the nearest neighbors of each keyword of each category. The
nearest neighbors are found using the model and the cosine simi-
larity between the candidate keyword vector and the seed keyword
vector. To restrict the number of candidates, we dened a threshold
α
for the cosine similiarity (i.e. only keywords that are signicantly
close are selected). Each discovered nearest neighbor is assigned
to the category of the seed keyword used to retrieved it. Hence we
form,
Mn=Mn,1, .., Mn,j
, a set of
j
candidate discovered unigram
for category
n
. Each
Mn
is then considered as a document. Finally,
to compute the list of each keyword to add for each category we
leverage the so-called Term Frequency-Inverse Document Frequency.
Term frequency-inverse document frequency (Tf-idf ) is a famous
technique adopted in the eld of natural language processing. The
ARES ’17, August 29-September 01, 2017, Reggio Calabria, Italy entin Le Sceller et al.
latter computes vectors of inputted text documents by considering
both the frequency in the individual documents and in the whole
set. Let
D={d1,d2, . . . , dn}
be a set of text documents, where
n
is the number of documents, and let
d={w1,w2, . . . , wm}
be a
document, where
m
is the number of words in
d
. The tf-idf of a
word
w
and document
d
is the product of term frequency of
w
in
d
and the inverse document frequency of
w
, as shown in Formula 6.
The term frequency (Formula 7) is the occurrence number of
w
in
d
.
Finally, the inverse document frequency of
w
(Formula 8) represents
the number of documents
n
divided by the number of documents
that contain
w
in the logarithmic form. The computation of tf-idf
is very scalable, which suites our needs (Section 5.3).
tf-idf(w,d)=tf(w,d) × idf(w)(6)
tf(w,d)=|wd,d={w1,w2, ...wn}:w=wi|(7)
idf(w)=loд|D|
1+|d:wd|(8)
Using tf-idf and our documents
Mn
(which are sets of candidate
discovered keywords for category
n
), we keep only the keywords
that have a high term frequency in
Mn
and a low document fre-
quency across all the
Mn
. In other words, we keep only candidate
keyword that are category specic. We also dene a threshold
β
in
order to keep only highly relevant keywords. Algorithm 3 describes
the inner working of the KeywordFinder component. Results of
this algorithm using a model trained with a month of tweets are
available in section 5.3.
input : threshold α, threshold β, trained GloVe model, list of
seed keywords per category
output : a set of keywords per category L
foreach category i in set of category C do
foreach seed keyword s in category i do
Tinearest neighbors of swith threshold >α
end
end
Vectors TF-IDF(T)
foreach category i in C do
foreach keyword k in Tido
Lik if tf-idf>β
end
end
Algorithm 3: KeywordFinder Algorithm
4 IMPLEMENTATION
This section presents the implementation of SONAR in details. Fig-
ure 3 displays a technical and simplied overview. Scalability and
performance are two important objectives in the design of SONAR
as the framework. It must be well equipped to handle any type of
event from the least popular (few documents per seconds) to the
most popular (hundred of documents per second). SONAR is built
upon open source frameworks:
Apache Zookeeper[
4
] for the coordination between the
components.
Apache Kafka[2] for the streaming platform.
Apache Spark[3] for the computation.
Cassandra[1] for the storage and query of data.
ELK Stack[
5
] (Elasticsearch, Logstash and Kibana) for the
storage and consultation of the data.
Each choice of framework is the result of a thoughtful analysis:
Apache Kafka is highly scalable distributed streaming plat-
form capable of handling millions of messages per second.
We designed SONAR with the idea in mind that we might
have access to more than 1% of the Twitter feed hence
handling millions of tweets per day.
Apache Spark is a fast engine for big data processing, highly
scalable and ecient engine with several machine learning
libraries integrated.
Apache Cassandra is one the fastest database. As SONAR
works in real time, we needs a robust solution that can
handle a high number of queries per second.
Data Collection
. Tweets are collected using Java Twitter4J
2
library directly linked with Kafka. The Spark cluster receives the
stream of tweets and with Spark Streaming, tweets are categorized
and stored them into an Elasticsearch index and simultaneously in
Cassandra database.
Event Detection
or rst story detection (FDS) is done with a
Java implementation of the algorithm of Petrović et al. on Spark.
Events are categorized according to the taxonomy and geo-mapped
using Google Map Geocoding API. Each tweet is geolocated using
the GPS coordinates when available or the location of the user. This
allows us to geographically locate where an event is most discussed.
Keyword Renement
is done using the GloVe implementation
available on GitHub
3
. The model is trained using the previous
algorithm and a python/R script regularly looks on a regular basis
for new relevant keywords.
SONAR is designed to be distributed over multiple servers for
maximum eciency. When a particular component is overloaded,
the maintainer do not have to change the entire system in order to
speed up the framework, he just needs to add another node to the
slow component.
5 EVALUATION
This section evaluate three dierent components of SONAR. First,
we analyze the speed of our implementation of Petrović et al. First
Story Detection Algorithm, then we demonstrate the result of the
keyword renement component. Finally, we present a few auto-
matically discovered events by SONAR and show how the tool
knowledgeably summarize the information.
5.1 First Story Detection
We ran the test on a Linux virtual machine with 8GB of RAM
and 4 core of an Intel i7-4790. Parameters are the followings: 13
2http://twitter4j.org
3https://github.com/stanfordnlp/GloVe
Automatic Detection of Cyber Security Events over the Twier Stream ARES ’17, August 29-September 01, 2017, Reggio Calabria, Italy
Storage
Consultation
Storage
Consultation
Twitter
Tweets
Kafka
Cluster
FSD
Geo-Mapping
Tweets
GloVe
Keyword
Extender
Discovered
Keywords
Preprocessing
Categorisation
Spark
Cluster
Spark
Cluster
Spark
Cluster
Cassandra
ELK Stack
Cassandra
ELK Stack
Figure 3: Technical overview of SONAR.
0 0.25 0.5 0.75 1 1.25 1.5
·105
0
20
40
60
80
100
120
Number of documents processed
Time per 100 documents (milliseconds)
Figure 4: Processing time per 100 documents.
hyperplanes and 70 hash tables with a maximum distance of 0.80. In
that evalutation, we are not directly interested in performances but
in the fact that, according to Petrović et al., the algorithm must run
in constant time. Figure 4 shows that our Spark implementation do
run in constant time with approximately 60 milliseconds for 100
documents (we can actually expect faster results on a real Spark
cluster).
Consequently, we are able to detect in near real-time 759 events
per month on average. This number can go up to 1328 in December
2016 (see gure 5).
5.2 Keywords Finder
In this section, we present the results of the Keyword Renement
component trained on dierent periods. As it is the most time
consuming operation of SONAR, we run it on a bi-weekly basis
(representing approximately 3 million tweets) but it can be run
06-16
07-16
08-16
09-16
10-16
11-16
12-16
01-17
02-17
03-17
250
500
750
1,000
1,250
Number of discovered events
Figure 5: Number of events found from June 26 2016 to
March 31 2017.
Category Keywords
Social Engineering scam
Malware cybercrime, locky, spyware
Botnet mirai, iot, iotsecurity
Denial of Service ddosattack, dyn, cyberattack
Vulnerability scripting
Table 2: Sample of discovered keywords for October 2016
and December 2016.
every week. Results for the month of October and December 2016
are available in table 2, result of December 2016 are written in itali-
cized text. We choose high values of
α
(semantically close nearest
neighbors) and
β
(discovered keywords which are unique to each
category) in order to avoid false positives. Before integrating this
new keywords as seed keywords, the analyst must verify them
ARES ’17, August 29-September 01, 2017, Reggio Calabria, Italy entin Le Sceller et al.
manually. Keywords discovered in October 2016 such as “locky” for
“Malware” category, “mirai” for “Botnet” category and “ddosattack”
for “Denial of Service” category can be safely incorporated into the
seeds keywords. However, keywords such as “dyn”, “scam” and “iot”
should be dismissed as they can possibly retrieve extraneous tweets
in the future. “Scripting” and “spyware” keywords were discovered
in December 2016 and are highly relevant to their category.
5.3 Event Evaluation
In this section, we review the relevance of the discovered events
and we analyze two signicant events that happened between Sep-
tember 2016 and January 2017.
5.3.1 Relevance of the discovered events.
Not all the events detected by SONAR might be relevant for a secu-
rity practitioner. Numerous events are about hacked accounts of
public gures. Also an important part of the discovered are parts
of misinformation campaign (especially during the US presidential
election). Table 3 presents the notable events from January 25 to
February 1. During this week 100 events were automatically discov-
ered, 23 were considered as highly relevant which is approximately
25% of the events.
5.3.2 The Yahoo data breach - September 22, 2016.
The “Yahoo data breach”
4
reported in September 2016 is considered
as one of the largest data breach in the history of internet. 500 mil-
lions of Yahoo accounts were compromised and data such as names,
email addresses, telephone numbers, encrypted or unencrypted
security questions and answers, dates of birth, and encrypted pass-
words were leaked. On September 22 2016, SONAR captured this
event, all its dierent steps and its propagation on the network.
Figure 6 shows what the user interface display when looking at
the Yahoo Data Breach event. We were able to get all the dierent
milestones of this event on September 22 at time displayed:
(1)
Around 1 EST, people started talking about the tweet from
RECODE “Yahoo is expected to conrm massive data breach,
impacting hundreds of millions of users...”.
(2)
Few hours later, at 4 EST, one another step is the event is
reached “Yahoo may conrm massive data breach”.
(3)
At 2:42, a tweet from Associated Press is highly retweeted
“ BREAKING: Yahoo conrms huge data breach aecting
500 million accounts.
(4)
We learn after that Yahoo blames a “State-Sponsored Actor”
and two days later, we learn that the company is being sued.
SONAR successfully captures the dierent states of the event. Be-
cause of the number of people talking about the event at the same
time, an analyst could be aware of the situation as soon at the news
is propagated on Twitter.
5.3.3 Dyn Denial of Service Aack - October 21, 2016.
The 2016 Dyn Denial of Service Attack
5
is a “cyberattack that
took place on October 21, 2016, and involved multiple distributed
denial-of-service attacks (DDoS attacks) targeting systems operated
by Domain Name System (DNS) provider Dyn, which caused major
4See https://en.wikipedia.org/wiki/Yahoo!_data_breaches
5See https://en.wikipedia.org/wiki/2016_Dyn_cyberattack
Internet platforms and services to be unavailable to large swathes
of users in Europe and North America.” This attack can be broken
down in three attacks. First attack was from 7 am to 9:20 am EST.
A second attack started at 11:52 am and a third attack was reported
at 4pm.
At 8 am EST time, we started receiving an abnormal number
of tweets labeled as "Denial of Service" category (see gure 7. 30
minutes later, we start receiving the rst event. People reports
having trouble accessing services such as Github and Spotify. Using
SONAR, we can see directly the three attacks against DYN on that
day using the total number of tweets.
6 LIMITATIONS
Despite promising results, SONAR still have drawbacks. The tech-
nique of using keywords dramatically reduce the number of false
positives, however we still get events that we can qualify as “non
interesting” e.g. “someone hacked my account”. This can be solved
by a security assessment algorithm on the event (see section 8.
Conclusion and Future Work). Despite the fact the we try to dodge
spams, SONAR can still be abused by organized group of users and
can capture events that are not occurring in the real world (e.g.
“Fake news” during the US election). An interesting perspective
would be able to identify bots on Twitter.
The keyword nder algorithm will most of the time only nd
durable trend and not short-lived keyword (such as random mal-
ware name). While this can be an advantage to get simpler results, it
can be interesting for an analyst to detect so short-lived keywords.
Furthermore, analyst must choose the seed keywords very care-
fully in order to avoid events that are out of our scope of interest.
Other research paper such as ReDites[
25
] used a classier for this
task, however we are looking for a generic solution that can work
for every taxonomy. It is also still possible to miss events because
of the way we retrieve tweets (using predened keywords).
Finally, the algorithm that we used will fail sometimes at detect-
ing the true rst mention of an event on social media. For example,
if the rst tweet about an event is published at time
t
but the story
is only discussed at
tsl i di nдw i ndow
then we will probably miss the
rst document. However, we consider this phenomenon marginal
since we consider that the sliding window is long enough and the
user of Twitter fast enough to relay the event in that time window.
7 RELATED WORK
With opinion mining, event detection is one of the most studied task
on Twitter [
7
,
8
,
11
,
12
,
20
,
22
,
28
,
32
,
37
,
40
]. In 2013, Abdelhaq et
al.[
7
] presented a framework to detect localized events using bursty
keywords and the spatial distribution of the document. Similarly,
Twittermonitor by Mathioudakis et al. [
21
] can detect trend in the
Twitter stream by using those bursty keywords.
Aggarwal and Subbian [
8
] tackles this problem by using a xed
number of cluster and clusters summaries in order to minimize the
number of comparison. Clusters that have a high growth rate are
considered as events.
Becker et al. [
11
] use a clustering technique proposed by Yang et
al. [
39
] and a classier, with features such as hashtags and retweets,
to detect events. Also, LI et al. [
20
] used a classier with hashtag,
Automatic Detection of Cyber Security Events over the Twier Stream ARES ’17, August 29-September 01, 2017, Reggio Calabria, Italy
Figure 6: SONAR User interface displaying all the events linked to the Yahoo data breach.
General Events
Top Russian Cybercrimes Agent Reportedly Involved in U.S. Election Hacking Arrested in Moscow on Charges of Treason
Data Breach Database Site “LeakedSource” Goes Oine After Alleged Police Raid
WWE Social Media Accounts Hacked By OurMine
Cardinals ned $2 million, must send two draft picks to Astros as hacking penalty
Spanish Police Claim to Have Arrested Phineas Fisher - Hacking Team Hacker
Uber Pays Hacker US $9,000 for Partner Firm?s Bug
Denial of Service Events
Sonic Customers Oine As Local Internet Provider Hit With DDoS Attack
Malware Events
Disk-nuking malware takes out Saudi Arabiangear.
Data-Stealing Ransomware App Found in Ocial Google Play Store
Rootnik Android malware variant designed to frustrate researchers
The Nuke HTTP bot Malware oered for sale on a Dark Web forum
38% of Android VPN Apps on GooglePlay Store Plagued with Malware .
Ransomware Hacks Lock Hotel Guests in Their Rooms
Ransomware Killed 70% of Washington DC CCTV Ahead of Inauguration
Police Department Loses Years Worth of Evidence in Ransomware Incident
Newly Discovered Banking Malware Creates Fresh Threat to Users
Malicious Oce les using leless UAC bypass to drop KEYBASE malware
Vulnerability Events
CVE-2017-3797 Vulnerability in Cisco WebEx Meetings Server
WordPress4.7.2UpdateFixesXSS,SQLInjectionBugs
CVE-2017-3422 Vulnerability in the Oracle One-to-One Fulllment component of Oracle E-Business Suite
Netgear Exploit Found in 31 Models Lets Hackers Turn Your Router Into a Botnet
Social Engineering Events
Phishers new social engineering trick: PDF attachments with malicious links
Data Breach Events
180,000 Members of an Underground Adult Website Have Been Leaked Online
Table 3: Notable events discovered from January 25 to February 1.
ARES ’17, August 29-September 01, 2017, Reggio Calabria, Italy entin Le Sceller et al.
Figure 7: Screenshot of SONAR UI on October 21, 2016.
links and unigram as features in order to nd crime and disaster
related events.
In 2010, Sakaki et al. [
32
] introduced a real-time system to detect
earthquake and warn people that are endangered to seek shelter.
An idea also used by Crooks and al.[
17
] in 2013. While systems
such as ReDites [
25
], Twittermonitor [
21
] and approach designed
by Ritter et al. [
31
] tackle the problem of real time clustering on
Twitter, SONAR adds multiple enhancements: the framework is
generic (the only changes in the system is the taxonomy and the
list of seed keywords), highly scalable and dynamic (it discovers
new keywords over time). This features make SONAR very useful
for real time security awareness.
8 CONCLUSION
We have presented SONAR: an automatic, self-learned framework
that can detect, geolocate, categorize cyber security events in near
real-time over the Twitter Stream. As we have highlighted in the
limitations section, improvements can be done and technically, the
implementations can still be optimized. However, wehave collected
9 months of cyber security events. During this time period, major
events such as the Yahoo data breach, Dyn denial of service attack
and the U.S. election happened. This make our dataset a great
resource for future work.
As future work, we would like to identify the target and the
source of an event (e.g. Anonymous claims responsibility for a
DDOS attack on a company). Moreover, we are currently working
on a event grading system to more eciently identify the important
events that need the analyst attention. Finally, this database of
events could help us building a model to predict future events.
ACKNOWLEDGMENTS
This research is supported by: Natural Sciences and Engineering
Research Council (NSERC) of Canada Discovery Grant, the National
Cyber-Forensics and Training Alliance (NCFTA) Canada, and the
Research Incentive Fund R15048, Research Oce, Zayed University,
United Arab Emirates.
REFERENCES
[1] Apache Cassandra: https://cassandra.apache.org.
[2] Apache Kafka: https://kafka.apache.org.
[3] Apache Spark: https://spark.apache.org.
[4] Apache Zookeeper: https://zookeeper.apache.org.
[5] ELK Stack: http://www.elastic.co.
[6]
2013. New Tweets per second record, and how! Twitter Ocial Blog https://blog.
twitter.com/2013/new-tweets-per- second-record-and-how (August 2013).
[7]
Hamed Abdelhaq, Christian Sengstock, and Michael Gertz. 2013. EvenTweet:
Online Localized Event Detection from Twitter. Proc. VLDB Endow. 6, 12 (Aug.
2013), 1326–1329.
[8]
Charu C. Aggarwal and Karthik Subbian. Event Detection in Social Streams. In
Proceedings of the 2012 SIAM International Conference on Data Mining. 624–635.
[9]
James Allan, Victor Lavrenko, Daniella Malin, and Russell Swan. 2000. Detections,
Bounds, and Timelines: UMass and TDT-3. In Proceedings of Topic Detection and
Tracking Workshop. 164–174.
[10]
S. D. Applegate and A. Stavrou. 2013. Towards a Cyber Conict Taxonomy. In
2013 5th International Conference on Cyber Conict (CYCON 2013). 1–18.
[11]
Hila Becker, Dan Iter, Mor Naaman, and Luis Gravano. 2012. Identifying Content
for Planned Events Across Social Media Sites. In WSDM’12. 533–542.
[12]
Hila Becker, Mor Naaman, and Luis Gravano. 2011. Beyond Trending Topics:
Real-World Event Identication on Twitter. In ICWSM’11. 438–441.
[13]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2001. A
neural probabilistic language model. In Advances in Neural Information Processing
Systems 13 (NIPS’00). 933–938.
[14]
James J. Cebula and Lisa R. Young. 2010. A Taxonomy of Operational Cyber
Security Risks. Technical Report. Carnegie Mellon University.
[15]
Moses S. Charikar. 2002. Similarity Estimation Techniques from Rounding
Algorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory
of Computing (STOC ’02). ACM, New York, NY, USA, 380–388.
[16]
Julie Connolly, Mark Davidson, and Charles Schmidt. 2014. The Trusted Auto-
mated eXchange of Indicator Information. Technical Report. The MITRE Corpora-
tion.
[17]
Andrew Crooks, Arie Croitoru, Anthony Stefanidis, and Jacek Radzikowski.
2013. Earthquake: Twitter as a Distributed Sensor System. In Transactions in GIS,
Vol. 17. 124–147.
[18]
Kameswari Kotapati, Peng Liu, Yan Sun, and Thomas F. LaPorta. 2005. A Tax-
onomy of Cyber Attacks on 3G Networks. Springer Berlin Heidelberg, Berlin,
Heidelberg, 631–633.
[19]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is
Twitter, a Social Network or a News Media?. In Proceedings of the 19th Interna-
tional Conference on World Wide Web (WWW ’10). 591–600.
[20]
R. Li, K. H. Lei, R. Khadiwala, and K. C. C. Chang. 2012. TEDAS: A Twitter-based
Event Detection and Analysis System. In 2012 IEEE 28th International Conference
on Data Engineering. 1273–1276.
[21]
Michael Mathioudakis and Nick Koudas. 2010. TwitterMonitor: Trend Detection
over the Twitter Stream. In Proceedings of the 2010 ACM SIGMOD International
Conference on Management of Data (SIGMOD ’10). ACM, New York, NY, USA,
1155–1158.
[22]
Richard McCreadie, Craig Macdonald, Iadh Ounis, Miles Osborne, and Sasa
Petrovic. 2013. Scalable Distributed Event Detection for Twitter. In Proceedings
of IEEE International Conference on Big Data.
[23]
Andrew J. McMinn, Yashar Moshfeghi, and Joemon M. Jose. 2013. Building a
Large-scale Corpus for Evaluating Event Detection on Twitter. In Proceedings of
the 22Nd ACM International Conference on Information & Knowledge Management.
409–418.
[24]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. 2013.
Distributed Representations of Words and Phrases and their Compositionality. In
Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou,
M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc.,
3111–3119.
[25]
Miles Osborne, Sean Moran, Richard Mccreadie, Alexander Von Lunen, Martin
Sykora, Elizabeth Cano, Neil Ireson, Craig Macdonald, Iadh Ounis, Yulan He,
Tom Jackson, Fabio Ciravegna, and Ann O’Brien. 2014. Real-time detection,
tracking, and monitoring of automatically discovered events in social media. In
52nd Annual Meeting of the Association for Computational Linguistics: System
Demonstrations. 37–42.
[26]
Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schnei-
der, and Noah A. Smith. 2013. Improved Part-of-Speech Tagging for Online
Conversational Text with Word Clusters. In Human Language Technologies: Con-
ference of the North American Chapter of the Association of Computational Linguis-
tics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia,
USA. 380–390.
[27]
Jerey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
Global Vectors for Word Representation. In Empirical Methods in Natural Lan-
guage Processing (EMNLP). 1532–1543.
[28]
Saša Petrović, Miles Osborne, and Victor Lavrenko. 2010. Streaming First Story
Detection with Application to Twitter. In Human Language Technologies: The
2010 Annual Conference of the North American Chapter of the Association for
Computational Linguistics (HLT ’10). Association for Computational Linguistics,
Stroudsburg, PA, USA, 181–189.
Automatic Detection of Cyber Security Events over the Twier Stream ARES ’17, August 29-September 01, 2017, Reggio Calabria, Italy
[29]
Anand Rajaraman and Jerey David Ullman. 2011. Mining of Massive Datasets.
Cambridge University Press, New York, NY, USA.
[30]
Radim Řehůřek. 2014. Making sense of word2vec. (2014). https://
rare-technologies.com/making- sense-of- word2vec/.
[31]
Alan Ritter, Evan Wright, William Casey, and Tom Mitchell. 2015. Weakly
Supervised Extraction of Computer Security Events from Twitter. In Proceedings
of the 24th International Conference on World Wide Web (WWW ’15). International
World Wide Web Conferences Steering Committee, Republic and Canton of
Geneva, Switzerland, 896–905.
[32]
Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake Shakes
Twitter Users: Real-time Event Detection by Social Sensors. In Proceedings of the
19th International Conference on World Wide Web (WWW ’10). ACM, New York,
NY, USA, 851–860.
[33]
Dmitriy Selivanov. 2015. GloVe vs word2vec revisited. (2015). http://dsnotes.
com/post/glove-enwiki/.
[34]
Chris Simmons, Charles Ellis, Sajjan Shiva, Dipankar Dasgupta, and Qishi Wu.
2009. AVOIDIT: A Cyber Attack Taxonomy. Technical Report. University of
Memphis.
[35]
Symantec. 2016. Internet Security Threat Report. (2016). https://www.symantec.
com/content/dam/symantec/docs/reports/istr-21- 2016-en.pdf.
[36]
Cybersecurity Ventures. 2016. Zero Day Re-
port. (2016). http://cybersecurityventures.com/
zero-day- vulnerabilities-attacks- exploits-report- 2017/.
[37]
Jianshu Weng, Yuxia Yao, Erwin Leonardi, and Francis Lee. 2011. Event Detection
in Twitter. In Proceedings of the 5th International AAAI Conference on Weblogs
and Social Media. 401–408.
[38]
Computer World. 2016. MongoDB ransomware attacks and lessons
learned. (2016). http://www.computerworld.com/article/3157766/linux/
mongodb-ransomware- attacks-and- lessons-learned.html.
[39]
Yiming Yang, Tom Pierce, and Jaime Carbonell. 1998. A Study on Retrospective
and On-Line Event Detection. In SIGIR’98. 28–36.
[40]
Qiankun Zhao, Prasenjit Mitra, and Bi Chen. 2007. Temporal and information
ow based event detection from social text streams. In Proceeding AAAI’07
Proceedings of the 22nd national conference on Articial intelligence. 1501–1506.
[41]
Bonnie Zhu, Anthony Joseph, and Shankar Sastry. 2011. A taxonomy of cy-
ber attacks on SCADA Systems. In Proceedings of The 2011 IEEE International
Conference on Internet of Things (iThings’11). pp. 380–388.
... In addition, each data source (and there are a lot of possible venuese.g., more than 40 major blogs and newspapers are reported by Feedspot (2021) ) has its own targeted fields and technologies of main interest, as well as its subjectivity in expressing the breath and impact of the vulnerability. Manual searches in multiple sources is an overall tedious process, and the wide scattering of news feeds makes the dayby-day surveillance a daunting task; as such, an automated detection of zero-day vulnerabilities from OSINT data becomes a necessity (Le Sceller et al., 2017). ...
... Several papers (Attarwala et al., 2017;Dionísio et al., 2019;Le Sceller et al., 2017) introduced Twitter-based approaches to design a pipeline for threat detection and, more generally, semantic analysis. Dionísio et al. (2019) start from a set of customers, whose experts chose the Twitter cybersecurity accounts to monitor. ...
... Le Sceller et al. (2017) query tweets based on keywords. Their aim is to detect and characterize cybersecurity events using only texts from tweets. ...
... Cyber threat detection is generally known as the process of automatic scraping of the webspace and Open Source Intelligence (OSINT) to detect possible cybersecurity vulnerabilities (Sabottke et al., 2015;Riebe et al., 2021b;Le Sceller et al., 2017). Social Media platforms, like Twitter, are part of OSINT and propose a great space to share and discuss possible cybersecurity vulnerabilities. ...
... There are some automated systems and research that already scrape Twitter and other OSINT sources to detect cyber threats. Some examples are the CySecAlert system from Riebe et al. (2021b) or SONAR from Le Sceller et al. (2017), which collect cyber threat relevant tweets from Twitter, filter them, and present them in a manageable dashboard. ...
... Our pipeline reaches a F1-score of 80.63 on a specialized cyber threat dataset, which is 21.93 points above the score of a classical learning scheme. Other work, such as the cyber threat detection systems of Riebe et al. (2021b) or Le Sceller et al. (2017), allow for coarse-grained information gathering. To the best of our knowledge, our system is the first to provide rapid detection of specialized cyber threat information. ...
Preprint
Gathering cyber threat intelligence from open sources is becoming increasingly important for maintaining and achieving a high level of security as systems become larger and more complex. However, these open sources are often subject to information overload. It is therefore useful to apply machine learning models that condense the amount of information to what is necessary. Yet, previous studies and applications have shown that existing classifiers are not able to extract specific information about emerging cybersecurity events due to their low generalization ability. Therefore, we propose a system to overcome this problem by training a new classifier for each new incident. Since this requires a lot of labelled data using standard training methods, we combine three different low-data regime techniques - transfer learning, data augmentation, and few-shot learning - to train a high-quality classifier from very few labelled instances. We evaluated our approach using a novel dataset derived from the Microsoft Exchange Server data breach of 2021 which was labelled by three experts. Our findings reveal an increase in F1 score of more than 21 points compared to standard training methods and more than 18 points compared to a state-of-the-art method in few-shot learning. Furthermore, the classifier trained with this method and 32 instances is only less than 5 F1 score points worse than a classifier trained with 1800 instances.
... Of course, RNN has the benefit of reading sentences that are read by a human. However, as the layer deepens, gradient explosions and vanishing problems occur, which can degrade performance [39]. To avoid this problem, the long-term memory (LSTM) technique has been proposed in [20]. ...
... Weakly supervised learning Tweets containing "DDoS" [39] An automatic, self-learned framework that can detect, geolocate, and categorize cybersecurity events in nearreal time over the Twitter stream ...
Conference Paper
Full-text available
In these times of increasing cybersecurity threats, monitoring and analysing cybersecurity events in a timely and effective way is the key to promote social media security. Twitter is one of the world's widely used social media platforms where users can share their preferences, images, opinions, and events. The Twitter platform can promptly aggregate cyber-related events and provide a source of information about cyber threats. Likewise, Deep Learning can play a critical role to help social media providers achieve a more accurate assessment of cybersecurity threats. In this paper, we have reviewed various threats and discussed deep learning techniques to detect cybersecurity threats on Twitter.
... Although closed sources offer a niche in the world of threat intelligence, since they offer distinct information, their volume is much smaller than the available information in open sources [11]. Social media sources like Twitter do have a very active cybersecurity community, which have been the focus of different studies in the past [12][13][14][15][16][17][18][19]. Especially for the use cases of threat event detection [17,19], exploit prediction [14,15] or hacker demasking [20] Twitter has been shown useful. ...
... Social media sources like Twitter do have a very active cybersecurity community, which have been the focus of different studies in the past [12][13][14][15][16][17][18][19]. Especially for the use cases of threat event detection [17,19], exploit prediction [14,15] or hacker demasking [20] Twitter has been shown useful. Other sources in the domain of cybersecurity used for information extraction are blogs [21], bug reports, and security advisories [22], forums [7], the dark web [7,23], or official security information sources like vulnerability databases [24]. ...
Conference Paper
Full-text available
Security Operation Centers are tasked with collecting and analyzing cyber threat data from multiple sources to communicate warning messages and solutions. These tasks are extensive and resource consuming , which makes supporting approaches valuable to experts. However , to implement such approaches, information about the challenges these experts face while performing these tasks is necessary. We therefore conducted semi-structured expert interviews to identify these challenges. By doing so, valuable insights into these challenges based on expert knowledge is acquired, which in return could be leveraged to develop automated approaches to support experts and address these challenges.
... Pour détecter les attaques à l'échelle intersystème d'information, il faut un jeu de données à la même échelle. Nous avons choisi d'utiliser les réseaux sociaux pour cela, car la surveillance des réseaux sociaux pour détecter les attaques informatiques s'est déjà montrée efficace (Khandpur et al., 2017, Ritter et al., 2015, Sabottke et al., 2015, Sceller et al., 2017. Pour détecter les attaques, nous utiliserons donc des algorithmes de détection d'évènements : (Atefeh et Khreich, 2015) en fournissent une revue de la littérature. ...
... The SONAR system [30] is much closer to our design philosophy. It offers very rich information on ongoing threats, using sophisticated search criteria. ...
Article
Full-text available
Data from Online Social Networks, search engines, and the World Wide Web are forms of unstructured knowledge that are not regularly used in cybersecurity systems. The main reason for the reluctance to utilize them is the difficulty to process them effectively and extract valuable information. In this paper, we present the Systemic Analyzer In Network Threats (SAINT) Observatory Subsystem or SAINToS for short, a novel platform for the acquisition and analysis of Open-Source Intelligence feeds. The proposed framework integrates different information pools to create a supplementary view of the evolving cybercriminal activity. The aim of SAINToS, is to provide additional models, methodologies, and mechanisms to enrich existing cybersecurity analysis. As a significant amount of related information is not standardized in the form of structured data tables or machine-processable formats (e.g., XML or JSON), secondary data sources, such as social networks and blogs, are expected to expand the scope and effectiveness of existing approaches. The emphasis of this work, is placed on the harmonization and visualization of data from different sources. As a result, these sources can be better understood and reused. In addition, the SAINToS, besides its standalone functionality and capabilities, can provide input, in standard formats, to additional major threat intelligence platforms.
... Events themselves can be placed in different categories, such as type of event or whether it was identified using supervised or unsupervised techniques [1]. In the former, dictionaries or datasets of recent significant events can be used to help classifiers make their choice of labels from a pool of alternatives [2]. If event detection is supervised or based on a predefined list of events, keyword-based filtering can be used ( [?], [4] ). ...
Preprint
Full-text available
The detection of events from online social networks is a recent, evolving field that attracts researchers from across a spectrum of disciplines and domains. Here we report a time-series analysis for predicting events. In particular, we evaluated the frequency distribution of top n-grams of terms over time, focusing on two indicators: high-frequency n-grams over both short and long periods of time. Both indicators can refer to certain aspects of events as they evolve. To evaluate the models accuracy in detecting events, we built and used a Twitter dataset of the most popular hashtags that surrounded the well-documented protests that occurred at the University of Missouri (Mizzou) in late 2015.
... As a result, social media is seen as a source for extracting timely and relevant security information, and numerous threat intelligence tools, such as Spider-Foot [21] and IntelMQ [19], collect open source intelligence from them. Scholars [25,60,63,64,77,84,87] have also proposed methods for identifying new vulnerabilities from social media and updating security databases. ...
Preprint
Full-text available
Recently, threat intelligence and security tools have been augmented to use the timely and relevant security information extracted from social media. However, both ordinary users and malicious actors may spread misinformation, which can misguide not only the end-users but also the threat intelligence tools. In this work, for the first time, we study the prevalence of cybersecurity and privacy misinformation on social media, focusing on two different topics: phishing websites and Zoom's security & privacy. We collected Twitter posts that were warning users about phishing websites and tried to verify these claims. We found about 22% of these tweets to be not valid claims. We then investigated posts about Zoom's security and privacy on multiple platforms, including Instagram, Reddit, Twitter, and Facebook. To detect misinformation related to Zoom, we first created a groundtruth dataset and a taxonomy of misinformation and identified the textual and contextual features to be used for training classifiers to detect posts that discuss the security and privacy of Zoom and detect misinformation. Our classifiers showed great performance, e.g., Reddit and Facebook misinformation classifier reached an accuracy of 99% while Twitter and Instagram reached an accuracy of 98%. Employing these classifiers on the posts from Instagram, Facebook, Reddit, and Twitter, we found that respectively about 3%, 10%, 4%, and 0.4% of Zoom's security and privacy posts as misinformation. This highlights the need for social media platforms to dedicate resources to curb the spread of misinformation, and for data-driven security tools to propose methods to minimize the impact of such misinformation on their performance.
Conference Paper
Full-text available
Twitter contains a wealth of timely information, however staying on top of breaking events requires that an information analyst constantly scan many sources, leading to information overload. For example, a user might wish to be made aware whenever an infectious disease outbreak takes place, when a new smartphone is announced or when a distributed Denial of Service (DoS) attack might affect an organization's network connectivity. There are many possible event categories an analyst may wish to track, making it impossible to anticipate all those of interest in advance. We therefore propose a weakly supervised approach, in which extractors for new categories of events are easy to define and train, by specifying a small number of seed examples. We cast seed-based event extraction as a learning problem where only positive and unlabeled data is available. Rather than assuming unlabeled instances are negative, as is common in previous work, we propose a learning objective which regular-izes the label distribution towards a user-provided expectation. Our approach greatly outperforms heuristic negatives, used in most previous work, in experiments on real-world data. Significant performance gains are also demonstrated over two novel and competitive baselines: semi-supervised EM and one-class support-vector machines. We investigate three security-related events breaking on Twitter: DoS attacks , data breaches and account hijacking. A demonstration of security events extracted by our system is available at:
Article
Full-text available
Social media feeds are rapidly emerging as a novel avenue for the contribution and dissemination of information that is often geographic. Their content often includes references to events occurring at, or affecting specific locations. Within this article we analyze the spatial and temporal characteristics of the twitter feed activity responding to a 5.8 magnitude earthquake which occurred on the East Coast of the United States (US) on August 23, 2011. We argue that these feeds represent a hybrid form of a sensor system that allows for the identification and localization of the impact area of the event. By contrasting this with comparable content collected through the dedicated crowdsourcing ‘Did You Feel It?’ (DYFI) website of the U.S. Geological Survey we assess the potential of the use of harvested social media content for event monitoring. The experiments support the notion that people act as sensors to give us comparable results in a timely manner, and can complement other sources of data to enhance our situational awareness and improve our understanding and response to such events.
Conference Paper
Full-text available
We introduce ReDites, a system for realtime event detection, tracking, monitoring and visualisation. It is designed to assist Information Analysts in understanding and exploring complex events as they unfold in the world. Events are automatically detected from the Twitter stream. Then those that are categorised as being security-relevant are tracked, geolocated, summarised and visualised for the end-user. Furthermore, the system tracks changes in emotions over events, signalling possible flashpoints or abatement. We demonstrate the capabilities of ReDites using an extended use case from the September 2013 Westgate shooting incident. Through an evaluation of system latencies, we also show that enriched events are made available for users to explore within seconds of that event occurring.
Article
Twitter, as a form of social media, is fast emerging in recent years. Users are using Twitter to report real-life events. This paper focuses on detecting those events by analyzing the text stream in Twitter. Although event detection has long been a research topic, the characteristics of Twitter make it a non-trivial task. Tweets reporting such events are usually overwhelmed by high flood of meaningless "babbles". Moreover, event detection algorithm needs to be scalable given the sheer amount of tweets. This paper attempts to tackle these challenges with EDCoW (Event Detection with Clustering of Wavelet-based Signals). EDCoW builds signals for individual words by applying wavelet analysis on the frequency-based raw signals of the words. It then filters away the trivial words by looking at their corresponding signal auto-correlations. The remaining words are then clustered to form events with a modularity-based graph partitioning technique. Experimental studies show promising result of EDCoW. We also present the design of a proofof- concept system, which was used to analyze netizens' online discussion about Singapore General Election 2011.
Book
The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.
Article
Microblogging services such as Twitter, Facebook, and Foursquare have become major sources for information about real-world events. Most approaches that aim at extracting event information from such sources typically use the temporal context of messages. However, exploiting the location information of georeferenced messages, too, is important to detect localized events, such as public events or emergency situations. Users posting messages that are close to the location of an event serve as human sensors to describe an event. In this demonstration, we present a novel framework to detect localized events in real-time from a Twitter stream and to track the evolution of such events over time. For this, spatio-temporal characteristics of keywords are continuously extracted to identify meaningful candidates for event descriptions. Then, localized event information is extracted by clustering keywords according to their spatial similarity. To determine the most important events in a (recent) time frame, we introduce a scoring scheme for events. We demonstrate the functionality of our system, called Even-Tweet, using a stream of tweets from Europe during the 2012 UEFA European Football Championship.