Conference PaperPDF Available

1st International Workshop on Search and Mining Terrorist Online Content & Advances in Data Science for Cyber Security and Risk on the Web

Authors:

Abstract

The deliberate misuse of technical infrastructure (including the Web and social media) for cyber deviant and cybercriminal behaviour, ranging from the spreading of extremist and terrorism-related material to online fraud and cyber security attacks, is on the rise. This workshop aims to better understand such phenomena and develop methods for tackling them in an effective and efficient manner. The workshop brings together interdisciplinary researchers and experts in Web search, security informatics, social media analysis, machine learning, and digital forensics, with particular interests in cyber security. The workshop programme includes refereed papers, invited talks and a panel discussion for better understanding the current landscape, as well as the future of data mining for detecting cyber deviance.
1st International Workshop on Search and Mining
Terrorist Online Content & Advances in Data Science
for Cyber Security and Risk on the Web
Theodora Tsikrika
Centre for Research and Technology Hellas
theodora.tsikrika@iti.gr
Stefanos Vrochidis
Centre for Research and Technology Hellas
stefanos@iti.gr
Babak Akhgar
Sheffield Hallam University
B.Akhgar@shu.ac.uk
Pete Burnap
Cardiff University
burnapp@cardiff.ac.uk
Vasilis Katos
Bournemouth University
vkatos@bournemouth.ac.uk
Matthew L. Williams
Cardiff University
williamsm7@cardiff.ac.uk
ABSTRACT
The deliberate misuse of technical infrastructure (including the
Web and social media) for cyber deviant and cybercriminal
behaviour, ranging from the spreading of extremist and terrorism-
related material to online fraud and cyber security attacks, is on
the rise. This workshop aims to better understand such phenomena
and develop methods for tackling them in an effective and
efficient manner. The workshop brings together interdisciplinary
researchers and experts in Web search, security informatics, social
media analysis, machine learning, and digital forensics, with
particular interests in cyber security. The workshop programme
includes refereed papers, invited talks and a panel discussion for
better understanding the current landscape, as well as the future of
data mining for detecting cyber deviance.
Keywords
cybercrime; cyber security; terrorist and extremist content; data
mining; security informatics
1. OVERVIEW AND MOTIVATION
Cyber deviance refers to the deliberate misuse of technical
infrastructure for subversive purposes and includes (but is not
limited to): the spreading of extremist propaganda [1],
antagonistic or hateful commentary [2], the distribution of
malware [3], online fraud, denial of service attacks, etc. Better
understanding of such phenomena on the Web and social media
allows for their early detection and underpins the development of
effective models for predicting cyber security threats.
To this end, the 1st International Workshop on Cyber Deviance
Detection (CyberDD) workshop held at WSDM 2017 in
Cambridge, UK, focusses on two research tracks: (i) Detecting
and Mining Terrorist Online Content and (ii) Advances in Data
Science for Cyber Security and Risk on the Web. The efforts by
major Web search engines and social media platforms
(independently and in partnership) [4][5][6] towards addressing
terrorist and violent extremist content that may appear on their
services acutely demonstrate both the important challenges faced
by Web Search and Data Mining practitioners, as well as the
pressing need for research towards developing effective and
efficient solutions. Moreover, the exploitation of the recent
advances in Data Science for understanding, detecting, and
forecasting cybercrime requires interdisciplinary approaches that
blend them with Criminology research so as to gain true insights
based on theories, methods, and data.
This workshop targets researchers and practitioners in Web
search, data mining, security informatics, multimedia
understanding, social media analysis, machine learning, and
digital forensics, with particular interests in cyber security. It also
targets industry representatives from search engines and social
media platforms that aim to tackle the challenges of terrorist and
violent extremist content appearing on their services, as well as
criminologists and law enforcement representatives interested in
recent advances in cyber deviance detection and understanding.
2. OBJECTIVES
The main goals of this workshop are: (i) to present original
research on Web search and data mining methods for the
detection, extraction, and analysis of Web content related to
terrorism and violent extremism, as well as methods of
quantitative analysis for the purposes of better understanding and
forecasting threats to cyber security emanating from the Web, (ii)
to bring together researchers from the WSDM, (cyber) security
informatics, and criminology communities, as well as industry
representatives from search engines and social media platforms, to
share ideas and experiences in designing and implementing such
methods, (iii) to evaluate the effectiveness, efficiency, and
maturity of such techniques, and (iv) to raise awareness of the
privacy, legal, and ethical implications of the proposed methods
and techniques.
3. TOPICS OF INTEREST
The two tracks of the workshop focus on different, yet
complementary topics.
The track on “Detecting and Mining Terrorist Online Content
welcomes original research on detection, search, and mining
methods that focus on the particularly challenging and
idiosyncratic terrorist and violent extremist content on the Web
(including the social media and the dark Web). Such content
appears in multiple languages and media (e.g. text, images, video,
and audio), it is highly volatile, often with short longevity, and it
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for third-
party components of this work must be honored. For all other uses, contact
the Owner/Author. Copyright is held by the owner/author(s).
WSDM 2017, February 06-10, 2017, Cambridge, United Kingdom
ACM 978-1-4503-4675-7/17/02.
http://dx.doi.org/10.1145/3018661.3022760
823
may be covert, even when it is publicly available. The proposed
methods and techniques should aim to allow for the development
of effective and efficient systems and tools that are of particular
interest to major search engines and social media platforms in
their efforts to detect terrorist and violent extremist online content
that may appear on their services, whilst striving to protect though
fundamental citizens rights. The topics include, but are not
limited to, the following areas:
Discovery and detection of terrorist online content: crawling
the Web, social media, and darknets;
Data, entity, and relationship extraction from terrorism-
related multimedia content;
Multilingual and multimedia search, mining, classification,
and clustering of Web terrorist content;
User profiling, persona modelling, and activity mining;
Social network analysis for terrorism communities detection
and key player identification;
Search interaction log analysis for terrorism detection;
Experiments and evaluation in Web search and data mining
of terrorist online content;
Credibility of discovered terrorist online information; and
Predictive modelling and early warning for terrorist threats.
The track on “Advances in Data Science for Cyber Security and
Risk on the Webaims to discuss advances in Data Science and
associated methods of quantitative analysis for the purposes of
better understanding and forecasting threats to cyber security
emanating from the Web. This essentially means any interactive
socio-technical system that is underpinned by networked
protocols including, but not limited to, social networking sites,
command and control Web servers, SMS, Web-linked sensor
devices, email, Web logs and wikis etc.; such systems have been
theorised as Social Machines [7]. This track also aims to blend
Criminology with Data Science for the study of cybercrime and
cyber security [8]. Topics of interest include Web mining,
machine learning and statistical modelling for:
Malware classification and clustering;
Cybercrime in distributed systems;
Understanding interactive social systems and implications for
cyber security;
Web, IoT and cyber security;
Modelling deviant behaviour on the Web;
Criminological theory adapted to the Web;
Understanding motivations to commit cybercrime; and
Cybercrime and global politics.
4. REVIEW PROCESS
The call for papers solicited submissions in the form of long (8
pages) and short (4 pages) papers. Each submission was reviewed
by at least three Programme Committee members and final
decisions were made by the workshop chairs. The Programme
Committee members include several experts in the field who
worked diligently in reviewing the submitted papers and
providing constructive feedback to the authors.
5. PROGRAMME
The program will be presented in the form of a half-day workshop
that will include presentations of the accepted papers, two invited
talks and a small panel. The accepted papers propose methods for
discovering, identifying, and managing terrorist and extremist
content on the Web. The invited talks from leading experts
(researchers, industry representatives of social media platforms
and search engines, or representatives of law enforcement and
public services tackling cyber deviance) will focus on the current
landscape of the detection and analysis of online content related to
terrorism, violent extremism, and hate crime, as well on the
human and social aspects of cyber deviant and cybercriminal
behaviour from a criminology perspective. For the panel
discussion, we intend to focus on the open challenges identified
during the workshop and in particular on the privacy, ethical, and
legal implications of the proposed methods and approaches.
6. ACKNOWLEDGMENTS
This workshop is partially supported by the EC H2020 project
TENSOR (700024). We are also very grateful to the members of
the Programme Committee.
7. REFERENCES
[1] Chatfield, A. T., Reddick, C. G., and Brajawidagda, U. 2015.
Tweeting propaganda, radicalization and recruitment: Islamic
state supporters multi-sided Twitter networks. In Proceedings of
the 16th Annual International Conference on Digital
Government Research. ACM. New York, NY, USA, 239-249.
DOI: http://dx.doi.org/10.1145/2757401.2757408.
[2] Burnap, P. and Williams, M. L. 2016. Us and them: identifying
cyber hate on Twitter across multiple protected
characteristics. EPJ Data Science, Volume 5, Issue 11. DOI:
http://dx.doi.org/10.1140/epjds/s13688-016-0072-6.
[3] Burnap, P., Javed, A., Rana, O. F. and Awan, M.S. 2015. Real-
time Classification of Malicious URLs on Twitter using
Machine Activity Data. In Proceedings of the 2015 IEEE/ACM
International Conference on Advances in Social Networks
Analysis and Mining (ASONAM 2015), ACM, New York, NY,
USA, 970-977. DOI:
http://dx.doi.org/10.1145/2808797.2809281.
[4] Microsoft’s approach to terrorist content online. Retrieved
December 12, 2016, from Microsoft’s official blog:
https://blogs.microsoft.com/on-the-
issues/2016/05/20/microsofts-approach-terrorist-content-online.
[5] Combating Violent Extremism. Retrieved December, 12, 2016,
from Twitter’s official blog:
https://blog.twitter.com/2016/combating-violent-extremism.
[6] Partnering to Help Curb Spread of Online Terrorist Content.
Retrieved December 12, 2016, from Facebook’s newsroom.
http://newsroom.fb.com/news/2016/12/partnering-to-help-curb-
spread-of-online-terrorist-content/.
[7] Hendler, J. and Berners-Lee, T. 2010. From the Semantic Web
to social machines: A research challenge for AI on the World
Wide Web, Artificial Intelligence, Volume 174, Issue 2, Pages
156-161, ISSN 0004-3702, DOI:
http://dx.doi.org/10.1016/j.artint.2009.11.010.
[8] Choo, KKR. 2011. The cyber threat landscape: Challenges and
future research directions, Computers & Security, Volume 30,
Issue 8, Pages 719-731, ISSN 0167-4048, DOI:
http://dx.doi.org/10.1016/j.cose.2011.08.004.
824
... This workshop is designed to better realize this kind of phenomena as well as create means of dealing with them within an efficient as well as effective manner. The workshop provides together interdisciplinary researchers as well as specialists within Internet search, security informatics, social press analysis, device learning, as well as Digital forensics, with specific interests within cyber security [8]. ...
Article
Full-text available
Data mining, the extraction of hidden predictive information through large databases, can be an effective new technology with good potential to assist organizations gives attention to important information in their data warehouses. Data mining tools estimate future trends as well as behaviors, enabling businesses to produce proactive, knowledge-driven decisions. The automated, prospective analyses proposed by data mining move beyond the studies of past events offered by retrospective tools typical of decision support systems. Data mining tools could answer business questions that typically were too time intensive to solve. They scour databases for hidden patterns, finding predictive information that experts may possibly miss as it lies outside their objectives. This paper targets on several data mining methods to recognize effective cluster techniques for big data cluster formation.
... As a result, search engines have been acknowledged for their role in facilitating access to illicit pharmaceuticals and substances (4), prolonging the existence of and access to hate speech online (5,6), providing access to information on topics such as self-harm and suicide (7)(8)(9), and have sparked much legal debate surrounding privacy and "the right to be forgotten" (10). In addition, search engines have previously been (and arguably still are; albeit subject to greater regulation) utilized as a tool to seek access to extremist material and images depicting child sexual abuse sparking much critical commentary, prompting the proactive blocking of specific search results (11)(12)(13)(14). While content-based offenses are arguably easier to regulate and detect (through the use of specific illicit search criteria) offenses surrounding an individual's collective search habits provide a greater challenge. ...
Article
The use of search engines and associated search functions to locate content online is now common practice. As a result, a forensic examination of a suspect's online search activity can be a critical aspect in establishing whether an offense has been committed in many investigations. This article offers an analysis of online search URL structures to support law enforcement and associated digital forensics practitioners interpret acts of online searching during an investigation. Google, Bing, Yahoo!, and DuckDuckGo searching functions are examined, and key URL attribute structures and metadata have been documented. In addition, an overview of social media searching covering Twitter, Facebook, Instagram, and YouTube is offered. Results show the ability to extract embedded metadata from search engine URLs which can establish online searching behaviors and the timing of searches.
Article
Full-text available
Hateful and antagonistic content published and propagated via the World Wide Web has the potential to cause harm and suffering on an individual basis, and lead to social tension and disorder beyond cyber space. Despite new legislation aimed at prosecuting those who misuse new forms of communication to post threatening, harassing, or grossly offensive language - or cyber hate - and the fact large social media companies have committed to protecting their users from harm, it goes largely unpunished due to difficulties in policing online public spaces. To support the automatic detection of cyber hate online, specifically on Twitter, we build multiple individual models to classify cyber hate for a range of protected characteristics including race, disability and sexual orientation. We use text parsing to extract typed dependencies, which represent syntactic and grammatical relationships between words, and are shown to capture ‘othering’ language - consistently improving machine classification for different types of cyber hate beyond the use of a Bag of Words and known hateful terms. Furthermore, we build a data-driven blended model of cyber hate to improve classification where more than one protected characteristic may be attacked (e.g. race and sexual orientation), contributing to the nascent study of intersectionality in hate crime.
Conference Paper
Full-text available
Massive online social networks with hundreds of millions of active users are increasingly being used by Cyber criminals to spread malicious software (malware) to exploit vul- nerabilities on the machines of users for personal gain. Twitter is particularly susceptible to such activity as, with its 140 character limit, it is common for people to include URLs in their tweets to link to more detailed information, evidence, news reports and so on. URLs are often shortened so the endpoint is not obvious before a person clicks the link. Cyber criminals can exploit this to propagate malicious URLs on Twitter, for which the endpoint is a malicious server that performs unwanted actions on the person’s machine. This is known as a drive-by-download. In this paper we develop a machine classification system to distinguish between malicious and benign URLs within seconds of the URL being clicked (i.e. ‘real-time’). We train the classifier using machine activity logs created while interacting with URLs extracted from Twitter data collected during a large global event – the Superbowl – and test it using data from another large sporting event – the Cricket World Cup. The results show that machine activity logs produce precision performances of up to 0.975 on training data from the first event and 0.747 on a test data from a second event. Furthermore, we examine the properties of the learned model to explain the relationship between machine activity and malicious software behaviour, and build a learning curve for the classifier to illustrate that very small samples of training data can be used with only a small detriment to performance.
Conference Paper
Islamic State (IS) terrorist networks in Syria and Iraq pose threats to national security. IS' exploitation of social media and digital strategy plays a key role in its global dissemination of propaganda, radicalization, and recruitment. However, systematic research on Islamic terrorist communication via social media is limited. Our research investigates the question: How do IS members/supporters use Twitter for terrorism communication: propaganda, radicalization, and recruitment? Theoretically, we drew on microeconomic network theories to develop a theoretical framework for multi-sided Twitter networks in the global Islamic terrorist communication environment. Empirically, we collected 3,039 tweets posted by @shamiwitness who was identified in prior research as "an information disseminator" for the IS cause. Methodologically, we performed social network analysis, trend and content analyses of the tweet data. We find strong evidence for Shamiwitness-intermediated multi-sided Twitter networks of international mass media, regional Arabic mass media, IS fighters, and IS sympathizers, supporting the framework's utility.
Article
The advent of social computing on the Web has led to a new generation of Web applications that are powerful and world-changing. However, we argue that we are just at the beginning of this age of “social machines” and that their continued evolution and growth requires the cooperation of Web and AI researchers. In this paper, we show how the growing Semantic Web provides necessary support for these technologies, outline the challenges we see in bringing the technology to the next level, and propose some starting places for the research.
Article
Cyber threats are becoming more sophisticated with the blending of once distinct types of attack into more damaging forms. Increased variety and volume of attacks is inevitable given the desire of financially and criminally-motivated actors to obtain personal and confidential information, as highlighted in this paper. We describe how the Routine Activity Theory can be applied to mitigate these risks by reducing the opportunities for cyber crime to occur, making cyber crime more difficult to commit and by increasing the risks of detection and punishment associated with committing cyber crime. Potential research questions are also identified.
from Twitter's official blog: https
  • Combating Violent
Combating Violent Extremism. Retrieved December, 12, 2016, from Twitter's official blog: https://blog.twitter.com/2016/combating-violent-extremism.
The cyber threat landscape: Challenges and future research directions, Computers & Security
  • Kkr Choo
Choo, KKR. 2011. The cyber threat landscape: Challenges and future research directions, Computers & Security, Volume 30, Issue 8, Pages 719-731, ISSN 0167-4048, DOI: http://dx.doi.org/10.1016/j.cose.2011.08.004.