Content uploaded by Ryan Scrivens
Author content
All content in this area was uploaded by Ryan Scrivens on Aug 11, 2019
Content may be subject to copyright.
179
CHAPTER 11
SEARCHING FOR EXTREMIST
CONTENT ONLINE USING THE
DARK CRAWLER AND SENTIMENT
ANALYSIS
Ryan Scrivens, Tiana Gaudette, Garth Davies and
Richard Frank
ABSTRACT
Purpose – This chapter examines how sentiment analysis and web-crawling
technology can be used to conduct large-scale data analyses of extremist
content online.
Methods/approach – The authors describe a customized web-crawler that was
developed for the purpose of collecting, classifying, and interpreting extremist
content online and on a large scale, followed by an overview of a relatively novel
machine learning tool, sentiment analysis, which has sparked the interest of
some researchers in the eld of terrorism and extremism studies. The authors
conclude with a discussion of what they believe is the future applicability of
sentiment analysis within the online political violence research domain.
Findings – In order to gain a broader understanding of online extremism,
or to improve the means by which researchers and practitioners “search for a
needle in a haystack,” the authors recommend that social scientists continue
to collaborate with computer scientists, combining sentiment analysis soft-
ware with other classication tools and research methods, as well as validate
sentiment analysis programs and adapt sentiment analysis software to new
and evolving radical online spaces.
Methods of Criminology and Criminal Justice Research
Sociology of Crime, Law and Deviance, Volume 24, 179–194
Copyright © 2019 by Emerald Publishing Limited
All rights of reproduction in any form reserved
ISSN: 1521-6136/doi:10.1108/S1521-613620190000024016
180 RYAN SCRIVENS ET AL.
Originality/value – This chapter provides researchers and practitioners who are
faced with new challenges in detecting extremist content online with insights
regarding the applicability of a specic set of machine learning techniques and
research methods to conduct large-scale data analyses in the eld of terrorism
and extremism studies.
Keywords: Sentiment analysis; web-crawler; machine learning;
terrorism; extremism; internet
INTRODUCTION
Violent extremists and those who subscribe to radical beliefs have left their digital
footprints online since the inception of the World Wide Web. Notable examples
include Anders Breivik, the Norwegian far-right terrorist convicted of killing 77
people in 2011, who was a registered member of a white supremacy web forum
(Southern Poverty Law Center, 2014) and had ties to a far-right wing social media
site (Bartlett & Littler, 2011); Dylann Roof, the 21 year old who murdered nine
Black parishioners in Charleston, South Carolina, in 2015, and who allegedly
posted messages on a white power website (Hankes, 2015); and Aaron Driver, the
Canadian suspected of planning a terrorist attack in 2016, who showed explicit
support for the so-called “Islamic State” (IS) on several social media platforms
(Amarasingam, 2016).
It should come as little surprise that, in an increasingly digital world, iden-
tifying signs of extremism online sits at the top of the priority list for counter-
extremist agencies (Cohen, Johansson, Kaati, & Mork, 2014), with the current
focus of government-funded research on the development of advanced informa-
tion technologies and risk assessment tools to identify and counter the threat of
violent extremism on the Internet (Sageman, 2014). Within this context, criminol-
ogists have argued that successfully identifying radical content online (i.e., behav-
iors, patterns, or processes), on a large scale, is the rst step in reacting to it (e.g.,
Bouchard, Joffres, & Frank, 2014; Davies, Bouchard, Wu, Joffres, & Frank, 2015;
Frank, Bouchard, Davies, & Mei, 2015; Mei & Frank, 2015; Williams & Burnap,
2015). Yet in the last 10 years alone, it is estimated that the number of individuals
with access to the Internet has increased threefold (Internet World Stats, 2019),
from over 1 billion users in 2005 to more than 3.8 billion as of 2019 (Internet Live
Stats, 2019). With all of these new users, more information has been generated,
leading to a ood of data.
It is becoming increasingly difcult, nearly impossible really, to manually
search for violent extremists, potentially violent extremists, or even users who
post radical content online because the Internet contains an overwhelming
amount of information. These new conditions have necessitated guided data l-
tering methods, those that can side-step – and perhaps one day even replace – the
laborious manual methods that traditionally have been used to identify relevant
Searching for Extremist Content Online 181
information online (Brynielsson et al., 2013; Cohen et al., 2014). As a result of
this changing landscape, governments around the globe have engaged researchers
to develop advanced information technologies, machine learning algorithms, and
risk assessment tools to identify and counter extremism through the collection
and analysis of large-scale data made available online (see Chen, Mao, Zhang, &
Leung, 2014). Whether this work involves nding radical users of interest (e.g.,
Klausen, Marks, & Zaman, 2018), measuring digital pathways of radicalization
(e.g., Hung, Jayasumana, & Bandara, 2016), or detecting virtual indicators that
may prevent future terrorist attacks (e.g., Johansson, Kaati, & Sahlgren, 2016),
the urgent need to pinpoint extremist content online is one of the most signicant
challenges faced by law enforcement agencies and security ofcials worldwide
(Sageman, 2014).
We have been part of this growing eld of research at the International
CyberCrime Research Centre (ICCRC), situated in Simon Fraser University’s
School of Criminology.1 Our work has ranged from identifying radical users in
online discussion forums (e.g., Scrivens, Davies, & Frank, 2017) to understand-
ing terrorist organizations’ online recruitment efforts on various online platforms
(e.g., Davies et al., 2015), to evaluating linguistic patterns presented in the online
magazines of terrorist groups (e.g., Macnair & Frank, 2018a, 2018b). These expe-
riences have provided us with insights regarding the applicability of a specic
set of machine learning techniques and research methods to conduct large-scale
data analyses of extremist content online.2 In what follows, we will rst describe
a customized web-crawler that was developed at the ICCRC for the purpose of
collecting, classifying, and interpreting extremist content online and on a large
scale. Second, we will provide an overview of a relatively novel machine learning
tool, sentiment analysis, which has sparked the interest of some researchers in
the eld of terrorism and extremism studies who are faced with new challenges in
detecting extremist content online. Third, we conclude with a discussion of what
we believe is the future applicability of sentiment analysis within the online politi-
cal violence research domain.
Before proceeding, however, it is necessary to outline how we conceptualize
extremist content online. We dene it as text-, audio-, and/or video-based online
material containing radical views – counter to mainstream opinion – that may or
may not promote violence in the name of a radical belief. At the ICCRC, we focus
primarily on text-based extremist content that has radical right-wing or jihadi
leanings. For the former, radical right-wing material is characterized by racially,
ethnically and sexually dened nationalism, which is typically framed in terms
of white power and grounded in xenophobic and exclusionary understandings
of the perceived threats posed by such groups as non-whites, Jews, immigrants,
homosexuals, and feminists (see Perry & Scrivens, 2016). For the latter, we dene
jihadi material as supportive of the creation of an expansionist Islamic state or
khalifa, the imposition of sharia law with violent jihad as a central component,
and the use of local, national, and international grievances affecting Muslims (see
Moghadam, 2008).
182 RYAN SCRIVENS ET AL.
EXTRACTING EXTREMIST CONTENT ONLINE:
THE DARK CRAWLER
In recent years, researchers have shown a vested interest in developing web-
crawler tools to collect large volumes of content on the Internet. This interest has
made its way into terrorism and extremism studies in the past 10 plus years (e.g.,
Abbasi & Chen, 2005; Bouchard et al., 2014; Chen, 2012; Fu, Abbasi, & Chen,
2010; Zhang et al., 2010; Zhou, Qin, Lai, Reid, & Chen, 2005). Some research-
ers have used standard, off-the-shelf web-crawler tools that are readily available
online for a fee, while others have developed custom-written computer programs
to collect high volumes of information online. The Dark Crawler (TDC) is one
example of this latter approach.
Web-crawlers, also known as “crawlers,” “data scrapers,” and “data parsers,”
are the tools used by all search engines to automatically map and navigate the
Internet as well as collect information about each website and webpage that a
web-crawler visits. There are many off-the-shelf web-crawler solutions available
on the Internet for purchase, such as Win Web Crawler,3 WebSPHINX,4 Black
Widow,5 or BeautifulSoup.6 Once an end-user decides on a website from which
parsing will begin, the crawler recursively follows the links from that webpage
until some user-specied condition is met (discussed below), capturing all content
along the way. During this process, the software tracks all the links between it and
other websites and, if an end-user so chooses, the software will follow and retrieve
those links as well. The content, as it is retrieved, is then saved to the hard-drive of
the user for later analysis. In short, the purpose of most web-crawlers is to simply
save the retrieved content onto a hard-drive, essentially “ripping” a webpage or
website because it contains content desired by the end-user. More advanced ana-
lytic capabilities usually are not part of the package.
Similar in spirit, TDC browses the World Wide Web, but since it is a custom-
written computer program, it is much more exible than the abovementioned
process. In particular, TDC is capable of seeking out extremist content online,
among other types of content, based on user-dened keywords and other
parameters (discussed below). As TDC visits each page, it captures all of the
content on that page for later analysis, while simultaneously collecting informa-
tion about the content and making decisions as to whether the page includes
extremist content. The idea of this approach is based on a combination of the
work associated with the Dark Web project at the University of Arizona (see
Chen, 2012) and a previous project at the ICCRC that identied and explored
online child exploitation websites (e.g., Allsup, Thomas, Monk, Frank, &
Bouchard, 2015; Joffres, Bouchard, Frank, & Westlake, 2011; Frank, Westlake,
& Bouchard, 2010; Monk, Allsup, & Frank, 2015; Westlake & Bouchard, 2015;
Westlake, Bouchard, & Frank, 2011). TDC has since demonstrated its ben-
et in investigating online networks and communities in general (e.g., Frank,
Macdonald, & Monk, 2016; Macdonald & Frank, 2016, 2017; Macdonald,
Frank, Mei, & Monk, 2015; Mikhaylov & Frank, 2016, 2018; Zulkarnine, Frank,
Monk, Mitchell, & Davies, 2016) and extremist content online in particular
Searching for Extremist Content Online 183
(e.g., Bouchard et al., 2014; Davies et al., 2015; Frank et al., 2015; Levey,
Bouchard, Hashimi, Monk, & Frank, 2016; Mei & Frank, 2015; Scrivens et al.,
2017; Scrivens, Davies, & Frank, 2018; Scrivens & Frank, 2016; Wong, Frank,
& Allsup, 2015).
TDC is a system that can be distributed across multiple virtual machines,
depending on the number of machines that are available. The rst of four steps,
as expressed in Fig. 1, is to dene a task along with its parameters. TDC can
handle multiple tasks simultaneously, each of which is given a priority, speci-
ed by the end-user. The priority of each task is determined by the number of
machines allocated to it. For example, if tasks I, II, and III are given priority 50,
80, and 70, respectively, then task I will receive 25% of the available resources
(50/(50+80+70) = 0.25 = 25%). If more machines are added to TDC, then the
absolute number of resources available for each task will increase but the relative
amount of resources available for each task will remain the same.
Each task consists of four parameters to prevent it from perpetually crawling
the Internet and wandering into websites and webpages unrelated to extremism.
The parameters are as follows:
Fig. 1. Overview of The Dark Crawler.
184 RYAN SCRIVENS ET AL.
Number of Webpages
For practical purposes, since the number of webpages on the Internet is innite,
restrictions must be placed on the number of webpages that are retrieved by the
web-crawler. Theoretically, any web-crawler could crawl for a very long time and
store the entire collection of webpages on the Internet. For our purposes at the
ICCRC, however, this is infeasible for several reasons. First, the amount of stor-
age that would be required to warehouse the extracted data is beyond the scope of
any sensible research project. Second, webpages are created at a much higher rate
than what can be extracted with TDC. Lastly, a copy of the “full Internet” is not
required to draw meaningful conclusions about a particular topic under investi-
gation; extracting large scale, representative samples, is more than adequate.
Number of Domains
The number of Internet domains that TDC will collect data on must be speci-
ed. When limiting a crawl to n pages, the crawler will attempt to distribute the
sampling equally across all websites that it has encountered, meaning that TDC
will sample similar numbers of pages from each of the sites it visits. As a result,
at the end of the task, if w websites are sampled, all sites will have approximately
the same number of pages retrieved (=n/w) and analyzed.
Trusted Domains
A set of trusted domains are then specied by the end-user, which tells the crawler
that all contents on those domains should not be extracted. As an example, it can
be assumed, with a high level of certainty, that a website such as www.microsoft.
com does not include extremist material. Having said that, TDC is trained to
assume that the website does not contain extremist content, and as such, it would
not retrieve any pages from that site. Without having this mechanism in place,
TDC could wander into a search engine, directing it completely off topic and
making the resulting extracted network irrelevant to the specied topic and task.
Keywords
The purpose of TDC is to nd, analyze, and map out the websites and web-
pages that include extremist content. To achieve this, TDC recursively retrieves
all webpages that are linked from the webpage it is currently reviewing. However,
since extremist webpages consist of a very small subset of all the webpages on
the Internet (Frank et al., 2015), it would be expected that, unconstrained, TDC
would very quickly start to retrieve webpages that are completely unrelated to
extremism. As a result, some mechanism must be built into TDC that controls
which webpages it uses in its exploration process. This is done through the use
of keywords, which are user-specied words that have been found to be indica-
tive of extremist content (e.g., Bouchard et al., 2014; Davies et al., 2015; Wong et
al., 2015) and thus indicate to TDC that the pages being retrieved are on-topic.
Within the extremist domain, such keywords could include gun, weapon, or terror.
To make this mechanism more robust, though, a word counter can be included
Searching for Extremist Content Online 185
to TDC, which indicates a minimum threshold on the number of keywords that
must exist on a page before the TDC considers it on topic.
Once a task has been decided, each webpage is downloaded by TDC (Fig. 1–
Step 2). If the downloaded page meets the parameters laid out above, then the
page is considered “on topic,” all page content is saved and all page links are fol-
lowed out of it recursively (Step 3). The webpage contents are then stored in the
database, and various reports and analyses can be performed (Step 4).
DATA MINING EXTREMIST CONTENT ONLINE:
SENTIMENT ANALYSIS
The use of keywords presents a useful rst step in identifying large-scale patterns
in extremist content online (e.g., Chalothorn & Ellman, 2012; Bouchard et al.,
2014; Davies et al., 2015; Wong et al., 2015). However, the use of single keywords
may lead to misleading interpretations of content (Mei & Frank, 2015; Scrivens
& Frank, 2016). If, for example, on a particular webpage, the words gun and con-
trol are found within close proximity of each other, it might be concluded that the
page is discussing gun control. This would most likely not indicate an extremist
page but more likely that the page was written by a proponent or opponent of
gun ownership. This page, of course, is not relevant to TDC’s data collection. On
the other hand, a page containing the words gun and control could be discussing
“controlling someone with a gun” within the context of kidnapping for example
which, in this case, TDC should continue with its analysis. In other words, key-
words can give an indication of the content within a webpage but cannot be used
to determine exactly what that content is about. For a more complete understand-
ing of the content of a specic piece of text, more powerful computational tools,
such as sentiment analysis, are required.
Sentiment analysis, also known as “opinion mining,” is a category of comput-
ing science that specializes in evaluating the opinions found in a piece of text by
organizing data into distinct classes and sections, and then assigning a piece of
text with a positive, negative, or neutral polarity value (Abbasi & Chen, 2005).
Sentiment analysis also provides a more targeted view of textual data by allowing
for the demarcation between cases that are sought after and those without any
notable relevance. Sentiment analysis has been used in a wide variety of settings,
including customer review analysis for products (e.g., Feldman, 2013), assess-
ments of attitudes toward events or products on social media platforms (e.g.,
Ghiassi, Skinner, & Zimbra, 2013), and for various analyses of extremist content
online (e.g., Bermingham, Conway, McInerney, O’Hare, & Smeaton, 2009, Chen,
2008; Williams & Burnap, 2015). Sentiment analysis has become increasingly
popular in terrorism and extremism studies because, as the amount of “opinion-
ated data” online grows exponentially, sentiment analysis software offers a wide
range of applications that can help address previously untapped and challenging
research problems (see Liu, 2012). Based on the notion that an author’s opinion
toward a particular topic is reected in the choice and intensity of words he or
she chooses to communicate, sentiment analysis software allows for identication
186 RYAN SCRIVENS ET AL.
and classication of opinions found in a piece of text (Abbasi, Chen, & Salem,
2008). Typically, this process occurs through a two-step process that produces a
“polarity value”:
(1) a body of text is split into sections (sentences) to determine subjective and
objective content and
(2) subjective content is classied by the software as being either positive, neu-
tral, or negative, where positive scores reect positive attitudes and negative
scores reect negative attitudes (see Feldman, 2013).
Worth highlighting, though, is the fact that sentiment analysis is not without
its limitations. It is estimated, for example, that 21% of the time humans cannot
agree among themselves about the sentiment within a given piece of text (Ogneva,
2010), with some individuals unable to understand subtle context or irony.
Understandably, sentiment analysis systems cannot be expected to have 100%
accuracy when compared to the opinions of humans. Sentiment analysis does,
however, provide insight into authors’ expressions and reactions toward certain
events or actions, for example, and one sentiment analysis program that has been
widely used by criminologists in terrorism and extremism studies is SentiStrength
(e.g., Frank et al., 2015; Levey et al., 2016; Macnair & Frank, 2018a, 2018b; Mei &
Frank, 2015; Scrivens et al., 2017, 2018; Scrivens & Frank, 2016).
SentiStrength uses a “lexical approach” – which maintains that an essential
part of understanding a language rests on the ability to understand the patterns
of and meanings associated with language (see Lewis, 1993) – and is based on
human coded data as well as various lexicons (i.e., dictionaries of phases and
words). Although the program is designed for short informal online texts, there
are features in place that allow longer texts to be analyzed (Thelwall & Buckley,
2013). SentiStrength analyzes a text by attributing positive, neural, or negative
values to words within the text, and these values are augmented by “booster
words” that can inuence the values assigned to the text as well as negating words,
punctuation, and other features that are uniquely suited for studying an online
context (Thelwall & Buckley, 2013). One of the features of SentiStrength is its
ability to analyze the sentiment around any given keyword. For example, the
phrase “I love apples but hate oranges” can be analyzed for the sentiment around
apples (resulting in a positive outcome) as well as oranges (resulting in a negative
outcome). The words around a given set of keywords are compared to the senti-
ment dictionary and their resulting values make up the total sentiment score for
any given text. Logically, a negative value implies negative sentiment for the ideas
expressed in the text while a positive value implies overall support.
To analyze a given piece of text for multiple keywords, multiple iterations must
be done, with each iteration consisting of an analysis of the same piece of text but
with a different keyword (resulting in the sentiment toward that keyword). Due
to the very specic nature of this procedure, it is necessary to input each form
of a particular word being analyzed. For example, to analyze sentiment around
the word kill, it is necessary to also analyze the words kills, killing, killed, and
all other derivatives of the word. Multiple iterations of SentiStrength are then
Searching for Extremist Content Online 187
applied, each one returning the detected sentiment around the specic keyword
and its derivatives. At the end of this process, each text is linked to multiple senti-
ment values, one for each keyword. Those multiple values are then averaged to
produce a single sentiment score for any given piece of text.
THE FUTURE OF DETECTING EXTREMIST
SENTIMENT ONLINE
There’s been a shift in recent years in how researchers investigate online com-
munities, ranging from the study of how extremists communicate through social
media (e.g., Bermingham et al., 2009) to the analysis of users connecting through
online health forums (e.g., Wang, Kraut, & Levine, 2012), for example. In short,
researchers in this area are shifting from manual identication of specic online
content to algorithmic techniques to do similar yet larger-scale tasks. The use of
such analytical approaches is becoming increasingly apparent in criminology and
criminal justice research (see Hannah-Moffat, 2018). This is a symptom of what
some have described as the “big data” phenomenon – that is, a massive increase
in the amount of data that is readily available, particularly online (see Chen et
al., 2014).
Logically, a number of researchers who study how terrorists and extremists use
the Internet have turned to sentiment analysis and other machine learning tech-
niques to identify and, by extension, analyze content of interest on a large scale.
In what follows, we will discuss what we believe is the future of the applicability of
sentiment analysis to explore extremist content online, drawing from the recom-
mendations of the below-listed studies in combination with our own experience
with sentiment analysis. In short, we suggest that the future of sentiment analysis
in exploring extremist content online should: (1) encourage social scientists and
computer scientists to collaborate with one another; (2) consider a combination
of analyses or more features to increase classier’s effectiveness; (3) continue to
validate sentiment analysis programs; and (4) apply and adapt sentiment analysis
to new and evolving radical online spaces.
Collaborations
In order to gain a broader understanding of online extremism, or to improve the
means by which researchers and practitioners “search for a needle in a haystack,”
social scientists and computer scientists must collaborate with one another.
Historically, large-scale data analyses have been conducted by computer scien-
tists and technical experts, which can be problematic in complex elds such as
terrorism and extremism research. Computer and technical experts tend to take a
high-level methodological perspective, measuring levels of – or propensity toward –
radicalization, or ways of identifying violent extremists, or predicting the next
terrorist attack. But searching for radical material online without a fundamen-
tal understanding of the radicalization process or how terrorists and extremists
use the Internet can be counterproductive. Social scientists, on the other hand,
may be well-versed in terrorism and extremism research, but most tend to be
188 RYAN SCRIVENS ET AL.
ill-equipped to manage large-scale data – from collecting to formatting to archiv-
ing large volumes of information. Bridging the computer science and social
science approaches to build on the strengths of each discipline offers perhaps
the best chance to construct a useful framework for detecting extremist content
online, as well as assisting authorities in addressing the threat of violent extrem-
ism as it evolves in the online milieu.
Combinations
A myriad of research shows that combining sentiment analysis with other meth-
ods and/or semantic-oriented approaches improves the detection of extremist
content online, on three fronts. This, we argue, is the future in detecting extremist
content online. First, research suggests that sentiment analysis software, in com-
bination with classication software, is an effective method to detect extremist
content online and on a large scale. In particular, combing sentiment analysis
with classication software – which identies similarities and differences in data-
sets and makes determinations about the data in a decision tree format – can
pinpoint extremist websites (see Mei & Frank, 2015; see also Scrivens & Frank,
2016). In addition, combining sentiment analysis with affect analysis –a machine
learning technique that measures the emotional content of communications – can
detect and measure the intensity levels associated with a broad range of emotions
in text found in online forums (see Chen, 2008; see also Figea, Kaati, & Scrivens,
2016). Research similarly suggests that the effectiveness of the sentiment analysis
in detecting extremist content can be signicantly boosted with additional clas-
sication feature sets, such as syntactic, stylistic, content-specic, and lexicon fea-
tures (Abbasi & Chen, 2005; Abbasi et al., 2008; Yang et al., 2011).
Second, research suggests that combining sentiment analysis with commonly
used research methods and frameworks in criminology and criminal justice
research can aid in the detection of extremist content online. Social network anal-
ysis, for example, in combination with sentiment analysis can be used to identify
users on YouTube who may have had a radicalizing agenda (see Bermingham et
al., 2009), detect extremist content on Twitter (Wei, Sing, & Martin, 2016), and
model online propaganda (see Burnap et al., 2014) and cyberhate (see Williams &
Burnap, 2015) on Twitter following the Woolwich terrorism incident. In addition,
combining sentiment analysis with geolocation software can be used to organ-
ize the opinions found on Twitter accounts using hashtags associated with IS
(Mirani & Sasi, 2016). Lastly, sentiment analysis, in combination with an algo-
rithm that incorporates criminal career measures (i.e., volume, severity, and dura-
tion) developed by Blumstein, Cohen, Roth, and Visher (1986), has been used to
account for unique components of a users’ online posting behavior and detect
the most radical authors in a large scale sample of online postings (see Scrivens
et al., 2017).
Third, in online political violence research, a growing emphasis has been placed
on the integration of a temporal component with sentiment analysis – a combi-
nation that we believe is key to providing insight into the patterns and trends
that characterize the development of extremist communities over time online.
Searching for Extremist Content Online 189
For example, combining sentiment analysis with survival analysis has been a use-
ful way to measure propaganda surges (see Burnap et al., 2014) and levels of
cyberhate (Williams & Burnap, 2015) on Twitter over time after a terrorist attack.
Other approaches that have proven to be effective in understanding temporal pat-
terns in radical online communities include: (1) mapping users’ sentiment and
affect changes in extremist forums (see Figea et al., 2016); (2) identifying signi-
cant temporal spikes in extremist forums that coincide with real-world events (see
Park, Beck, Fletche, Lam, & Tsang, 2016); and measuring the evolution of senti-
ment found in IS-produced online propaganda magazines (see Macnair & Frank,
2018b; see also Vergani & Bluic, 2015). Most recently, combining sentiment
analysis with semi-parametric group-based modeling to measure the evolution of
radical posting trajectories has shown to be an effective way to detect large-scale
temporal patterns in radical online communities (see Scrivens et al., 2018).
Validation
Another common thread that binds together the aforementioned studies is the
need to assess and potentially improve the classication accuracy and content
identication offered by sentiment analysis software. Researchers, for example,
have proposed that future work includes a “comparative human evaluation”
component to validate a sentiment program’s classications (e.g., Chalothorn &
Ellman, 2012; Figea et al., 2016). Macnair and Frank (2018a) further added that:
computer-assisted techniques such as sentiment analysis, in conjunction with human oversight,
can aid in the overall process of locating, identifying, and eventually, countering the narratives
that exist within extremist media. (p. 452)
This technique would have humans rate opinions in sentences and compare the
results to a sentiment analysis program. By comparing how a human might clas-
sify a piece of text to a sentiment analysis program, researchers can gain insight
into the accuracy of sentiment analysis’ classications. Future studies should also
integrate a qualitative understanding of how machine learning tools in general
and sentiment analysis software in particular make decisions about the content
that the tools analyze. Doing so may increase the reliability of the results and
increase the likelihood of identifying radical content online (Scrivens & Frank,
2016).
Also, it is not yet clear which sentiment analysis program is the most accu-
rate or effective overall in detecting extremist content online. Some research
does, however, draw comparisons between the performance of several sentiment
methods (i.e., SentiWordNet, SASA, PANAS-t, Emoticons, SentiStrength, LIWC,
SenticNet, and Happiness Index) (see Gonçalves, Benevenuto, Araújo, & Cha,
2013), but comparisons of this sort have yet to be explored within the online
political violence domain. One notable exception is the exploration of the linguis-
tic patterns on Twitter following the Manchester attacks and the Las Vegas shoot-
ing terrorist attacks (see Kostakos, Nykänen, Martinviita, Pandya, & Oussalah,
2018). Comparing the results of NLTK+SentiWordNet and SentiStrength soft-
ware, the authors concluded that SentiStrength performed at a higher level than
190 RYAN SCRIVENS ET AL.
the other applications. Building from that study, future work should continue to
explore and test the wide variety of programs currently available to determine if
there is indeed one ‘superior’ method, or if the appropriate methodology is con-
text-specic. A combination of sentiment analysis tools could also be integrated
into an analysis, in an attempt to cross-validate each other (Scrivens et al., 2017).
Adaptation
The ways in which violent extremists communicate online will continue to
evolve, shifting from uses on traditional discussion forums and social media
platforms to lesser known spaces on the Internet. For example, in addition to
the use of dedicated extreme right forums and all major social media platforms,
a diversity of more general online forums or forum-like online spaces are also
hosting increasing amounts of extreme right content. These include the popu-
lar social news aggregation, web content rating, and discussion site Reddit and
image-based bulletin board and comment site 4chan (Scrivens & Conway, in
press). Sites such as these, which contrary to most mainstream social media
platforms, do not have clear-cut anti-hate speech policies (see Gaudette, Davies,
& Scrivens, 2018), may provide unique insight into the expressions and man-
ifestations of virtual hate, especially on a large scale using sentiment analy-
sis tool outlined in this chapter. In addition, a new generation of right-wing
extremists are moving to more overtly hateful, yet to some extent more hidden
platforms, including the likes of 8chan, Voat, Gab, and Discord (see Davey &
Ebner, 2017). Similarly, between 2013 and 2016, IS’s media production out-
lets developed content that was largely distributed via major (and some minor)
social media and other online platforms. These included prominent IS presences
not only on Facebook, Twitter, and YouTube, but also on Ask.fm, JustPaste.it,
and the Internet Archive (Scrivens & Conway, in press). Having said that, as the
ways in which extremists communicate online will undoubtedly evolve, so too
must the ways in which researchers detect extremist content online. Sentiment
analysis software, in combination with other means of analyzing data, should
be applied to these increasingly popular platforms.
CONCLUSION
Since the advent of the Internet, violent extremists and those who subscribe
to radical views from across the globe have exploited online resources to build
transnational “virtual communities.” The Internet is a fundamental medium that
facilitates these radical communities, not only in “traditional” hate sites, web
forums, and commonly used social media sites, but in lesser known, oftentimes
more hidden spaces online as well (Scrivens & Conway, in press). Researchers and
practitioners have attempted to identify and monitor extremist content online but
increasingly have been overwhelmed by the sheer volume of data in these growing
spaces. Simply put, the manual analysis of online content has become increas-
ingly less feasible.
Searching for Extremist Content Online 191
As a result, researchers and practitioners have sought to develop different
methods of extracting data, especially through the use of web-crawlers, as well
develop different methods for managing this large-scale phenomenon to sift
through and detect extremist content. A relatively novel machine learning tool,
sentiment analysis, has sparked the interest of some researchers in the eld of
terrorism and extremism studies who face new challenges in detecting the spread
of extremist content online. Though this area of research is in its infancy, senti-
ment analysis is showing signs of success and may represent the future of how
researchers and practitioners study extremism online – particularly on a large
scale. This, however, will require that the social scientists continue to collabo-
rate with the computer scientists, combining sentiment analysis software with
other classication tools and research methods, validating sentiment analysis
programs, and adapting sentiment analysis software to new and evolving radical
online spaces.
NOTES
1. For more information, see https://www.sfu.ca/iccrc.html.
2. An array of machine learning techniques is used in the online political violence
research terrain that are not discussed in detail in this chapter. For a list of some of these
techniques, see Chen (2012).
3. For more information, see http://www.winwebcrawler.com.
4. For more information, see http://www.cs.cmu.edu/∼rcm/websphinx.
5. For more information, see http://sbl.net.
6. For more information, see https://pypi.org/project/beautifulsoup4/.
REFERENCES
Abbasi, A., & Chen, H. (2005). Applying authorship analysis to extremist-group web forum messages.
Intelligent Systems, 20(5), 67–75.
Abbasi, A., Chen, H., & Salem, A. (2008). Sentiment analysis in multiple languages: Feature selec-
tion for opinion classication in web forums. ACM Transactions on Information Systems, 26(3),
1–34.
Allsup, R., Thomas, E., Monk, B., Frank, R., & Bouchard, M. (2015). Networking in child exploi-
tation – Assessing disruption strategies using registrant information. In Proceedings of the
2015 IEEE/ACM international conference on advances in social networks analysis and mining
(ASONAM), Paris, France (pp. 400–407).
Amarasingam, A. (2016). What Aaron told me: An expert on extremism shares his conversations with
the terror suspect. National Post, August 11. Retrieved from https://nationalpost.com/news/
canada/what-aaron-told-me-an-expert-on-extremism-shares-his-conversations-with-the-terror-
suspect. Accessed on January 23, 2019.
Bartlett, J., & Littler, M. (2011). Insight the EDL: Populist politics in the digital era. London: Demos.
Bermingham, A., Conway, M., McInerney, L., O’Hare, N., & Smeaton, A. F. (2009). Combining social
network analysis and sentiment analysis to explore the potential for online radicalisation. In
Proceedings of the 2009 international conference on advances in social network analysis mining
(ASONAM), Athens, Greece (pp. 231–236).
Blumstein, A., Cohen, J., Roth, J. A., & Visher, C. A. (1986). Criminal careers and ‘career criminals.’
Washington, DC: National Academy Press.
Bouchard, M., Joffres, K., & Frank, R. (2014). Preliminary analytical considerations in designing
a terrorism and extremism online network extractor. In V. Mago & V. Dabbaghian (Eds.),
Computational models of complex systems (pp. 171–184). New York, NY: Springer.
192 RYAN SCRIVENS ET AL.
Brynielsson, J., Horndahl, A., Johansson, F., Kaati, L., Martenson, C., & Svenson, P. (2013). Analysis
of weak signals for detecting lone wolf terrorists. Security Informatics, 2(11), 1–15.
Burnap, P., Williams, M. L., Sloan, L., Rana, O., Housley, W., Edwards, A., …, Voss, A. (2014).
Tweeting the terror: Modelling the social media reaction to the Woolwich terrorist attack.
Social Network Analysis and Mining, 4, 1–14.
Chalothorn, T., & Ellman, J. (2012). Using SentiWordNet and sentiment analysis for detecting radical
content on web forums. In Proceedings of the 6th conference on software, knowledge, information
management and application (SKIMA), Chengdu, China (pp. 9–11).
Chen, H. (2008). Sentiment and affect analysis of dark web forums: Measuring radicalization on the
Internet. In Proceedings of the 2008 IEEE international conference on intelligence and security
informatics (ISI), Taipei, Taiwan (pp. 104–109).
Chen, H. (2012). Dark web: Exploring and data mining the dark side of the web. New York, NY: Springer.
Chen, M., Mao, S., Zhang, Y., & Leung, V. C. (2014). Big data: Related technologies, challenges and
future prospects. New York, NY: Springer.
Cohen, K., Johansson, F., Kaati, L., & Mork, J. (2014). Detecting linguistic markers for radical vio-
lence in social media. Terrorism and Political Violence, 26(1), 246–256.
Davey, J., & Ebner, J. (2017). The fringe insurgency: Connectivity, convergence and mainstreaming of the
extreme right. London: Institute for Strategic Dialogue.
Davies, G., Bouchard, M., Wu, E., Joffres, K., & Frank, R. (2015). Terrorist and extremist organiza-
tions’ use of the Internet for recruitment. In M. Bouchard (Ed.), Social networks, terrorism and
counter-terrorism: Radical and connected (pp. 105–127). New York, NY: Routledge.
Feldman, R. (2013). Techniques and applications for sentiment analysis. Communications of the ACM,
56(4), 82–89.
Figea, L., Kaati, L., & Scrivens, R. (2016). Measuring online affects in a white supremacy forum. In
Proceedings of the 2016 IEEE international conference on intelligence and security informatics
(ISI), Tucson, Arizona, USA (pp. 85–90).
Frank, R., Bouchard, M., Davies, G., & Mei, J. (2015). Spreading the message digitally: A look into
extremist content on the Internet. In R. G. Smith, R. C.-C. Cheung, & L. Y.-C. Lau (Eds.),
Cybercrime risks and responses: Eastern and western perspectives (pp. 130–145). London:
Palgrave Macmillan.
Frank, R., Macdonald, M., & Monk, B. (2016). Location, location, location: Mapping potential
Canadian targets in online hacker discussion forums. In Proceedings of the 2016 European intel-
ligence and security informatics conference (EISIC), Uppsala, Sweden (pp. 16–23).
Frank, R., Westlake, B. G., & Bouchard, M. (2010). The structure and content of online child exploita-
tion networks. In Proceedings of the 10th ACM SIGKDD workshop on intelligence and security
informatics (ISI-KDD), Washington, DC, USA, Article 3.
Fu, T., Abbasi, A., & Chen, H. (2010). A focused crawler for dark web forums. Journal of American
Society for Information Science and Technology, 61(6), 1213–1231.
Gaudette, T., Davies, G., & Scrivens, R. (2018). Upvoting extremism, part I: An assessment of extreme
right discourse on Reddit. VOX-Pol Network of Excellence Blog. Retrieved from https://www.
voxpol.eu/upvoting-extremism-part-i-an-assessment-of-extreme-right-discourse-on-reddit.
Accessed on January 23, 2019.
Ghiassi, M., Skinner, J., & Zimbra, D. (2013). Twitter brand sentiment analysis: A hybrid system using
n-gram analysis and dynamic articial neutral network. Expert Systems with Applications,
40(16), 6266–6282.
Gonçalves, P., Benevenuto, M., Araújo, F., & Cha, M. (2013). Comparing and combining sentiment
analysis methods. In Proceedings of the 1st ACM conference on online social networks, Boston,
MA, USA (pp. 27–38).
Hankes, K. (2015). Dylann Roof may have been a regular commenter at neo-Nazi website The
Daily Stormer. Southern Poverty Law Center. Retrieved from http://www.splcenter.org/
blog/2015/06/22/dylann-roof-may-have-been-a-regular-commenter-at-neo-nazi-website-the-
daily-stormer. Accessed on January 23, 2019.
Hannah-Moffat, K. (2018). Algorithmic risk governance: big data analytics, race and information
activism in criminal justice debates. Theoretical Criminology.
Searching for Extremist Content Online 193
Hung, B. W. K., Jayasumana, A. P., & Bandara, V. W. (2016). Detecting radicalization trajectories using
graph pattern matching algorithms. In Proceedings of the 2016 IEEE international conference on
intelligence and security informatics (ISI), Tucson, Arizona, USA (pp. 313–315).
Internet Live Stats. (2019). Total number of websites. Retrieved from http://www.internetlivestats.com/
total-number-of-websites. Accessed on January 23, 2019.
Internet World Stats. (2019). Internet growth statistics. Retrieved from http://www.internetworldstats.
com/emarketing.htm. Accessed on January 23, 2019.
Joffres, K., Bouchard, M., Frank, R., & Westlake, B. G. (2011). Strategies to disrupt online child por-
nography networks. In Proceedings of the European intelligence and security informatics confer-
ence (EISIC), Athens, Greece (pp. 163–170).
Johansson, J., Kaati, L., & Sahlgren, M. (2016). Detecting linguistic markers of violent extremism
in online environments. In M. Khader, L. S. Neo, G. Ong, E. T. Mingyi, & J. Chin (Eds.),
Combating violent extremism and radicalization in the digital era (pp. 374–390). Hershey, PA:
Information Science Reference.
Klausen, J., Marks, C. E., & Zaman, T. (2018). Finding extremists in online social networks. Operations
Research, 66(4), 957–976.
Kostakos, P., Nykänen, M., Martinviita, M., Pandya, A., & Oussalah, M. (2018). Meta-terrorism: iden-
tifying linguistic patterns in public discourse after an attack. In Proceedings of the 2018 IEEE/
ACM international conference on advances in social networks analysis and mining (ASONAM),
Barcelona, Spain (pp. 1079–1083).
Levey, P., Bouchard, M., Hashimi, S., Monk, B., & Frank, R. (2016). The emergence of violent narra-
tives in the life-course trajectories of online forum participants. Canadian Network for Research
on Terrorism, Security and Society Report, Waterloo, ON, Canada.
Lewis, M. (1993). The lexical approach: The state of ELT and the way forward. Hove: Language
Teaching Publications.
Liu, B. (2012). Sentiment analysis and opinion mining. San Rafael, CA: Morgan and Claypool.
Macdonald, M., & Frank, R. (2016). The network structure of malware development, deployment and
distribution. Global Crime, 18(1), 49–69.
Macdonald, M., & Frank, R. (2017). Shufe up and deal: Use of a capture–recapture method to esti-
mate the size of stolen data markets. American Behavioral Scientist, 61(11), 1313–1340.
Macdonald, M., Frank, R., Mei, J., & Monk, B. (2015). Identifying digital threats in a hacker web
forum. In Proceedings of the 2015 international symposium on foundations of open source intel-
ligence and security informatics (FOSINT), Paris, France (pp. 926–933).
Macnair, L., & Frank, R. (2018a). The mediums and the messages: Exploring the language of Islamic
State media through sentiment analysis. Critical Studies on Terrorism, 11(3), 438–457.
Macnair, L., & Frank, R. (2018b). Changes and stabilities in the language of Islamic State magazines:
A sentiment analysis. Dynamics of Asymmetric Conict, 11(2), 109–120.
Mei, J., & Frank, R. (2015). Sentiment crawling: Extremist content collection through a sentiment
analysis guided web-crawler. In Proceedings of the international symposium on foundations of
open source intelligence and security informatics (FOSINT), Paris, France (pp. 1024–1027).
Mikhaylov, A., & Frank, R. (2016). Cards, money and two hacking forums: An analysis of online
money laundering schemes. In Proceedings of the 2016 European intelligence and security infor-
matics conference (EISIC), Uppsala, Sweden (pp. 80–83).
Mikhaylov, A., & Frank, R. (2018). Illicit payments for illicit goods: Noncontact drug distribution on
Russian online drug marketplaces. Global Crime, 19(2), 146–170.
Mirani, T. B., & Sasi, S. (2016). Sentiment analysis of ISIS related tweets using absolute location. In
Proceedings of the 2016 international conference on computational science and computational
intelligence (CSCI), Las Vegas, NV, USA (pp. 1140–1145).
Monk, B., Allsup, R., & Frank, R. (2015). LECENing places to hide: Geo-mapping child exploitation
material. In Proceedings of the 2015 IEEE intelligence and security informatics (ISI), Baltimore,
MD, USA (pp. 73–78).
Moghadam, A. (2008). The Sala-jihad as a religious ideology. CTC Sentinel, 1(3), 14–16.
Ogneva, M. (2010). How companies can use sentiment analysis to improve their business. Retrieved
from http://mashable.com/2010/04/19/sentiment-analysis. Accessed on January 23, 2019.
194 RYAN SCRIVENS ET AL.
Park, A. J., Beck, B., Fletche, D., Lam, P., & Tsang, H. H. (2016). Temporal analysis of radical dark
web forum users. In Proceedings of the 2016 IEEE/ACM international conference on advances
in social networks analysis and mining (ASONAM), San Francisco, CA, USA (pp. 880–883).
Perry, B., & Scrivens, R. (2016). Uneasy alliances: A look at the right-wing extremist movement in
Canada. Studies in Conict and Terrorism, 39(9), 819–841.
Sageman, M. (2014). The stagnation in terrorism research. Terrorism and Political Violence, 26(4),
565–580.
Scrivens, R., & Conway, M. (in press). The roles of ‘old’ and ‘new’ media tools and technologies in
the facilitation of violent extremism and terrorism. In R. Leukfeldt & T. J. Holt (Eds.),
Cybercrime: The Human Factor. New York, NY: Routledge.
Scrivens, R., Davies, G., & Frank, R. (2017). Searching for signs of extremism on the web: An introduc-
tion to sentiment-based identication of radical authors. Behavioral Sciences of Terrorism and
Political Aggression, 10(1), 39–59.
Scrivens, R., Davies, G., & Frank, R. (2018). Measuring the evolution of radical right-wing posting
behaviors online. Deviant Behavior.
Scrivens, R., & Frank, R. (2016). Sentiment-based classication of radical text on the web. In
Proceedings of the 2016 European intelligence and security informatics conference (EISIC),
Uppsala, Sweden (pp. 104–107).
Southern Poverty Law Center. (2014). White homicide worldwide. Retrieved from https://www.spl-
center.org/20140331/white-homicide-worldwide. Accessed on January 23, 2019.
Thelwall, M., & Buckley, K. (2013). Topic-based sentiment analysis for the social web: The role of
mood and issue-related words. Journal of the American Society for Information Science and
Technology, 64(8), 1608–1617.
Vergani, M., & Bluic, A. M. (2015). The evolution of the ISIS’ language: A quantitative analysis of the
language of the rst year of Dabiq magazine. Security, Terrorism, and Society, 2, 7–20.
Wang, Y-C., Kraut, R., & Levine, J. M. (2012). To stay or leave? The relationship of emotional and
informational support to commitment in online health support groups. In Proceedings of the
ACM 2012 conference on computer supported cooperative work, Seattle, WA, USA (pp. 833–842).
Wei, Y., Singh, L., & Martin, S. (2016). Identication of extremism on Twitter. In Proceedings of the
2016 IEEE/ACM international conference on advances in social networks analysis and mining
(ASONAM), San Francisco, CA, USA (pp. 1251–1255).
Westlake, B. G., & Bouchard, M. (2015). Criminal careers in cyberspace: Examining website failure
within child exploitation networks. Justice Quarterly, 33(7), 1154–1181.
Westlake, B. G., Bouchard, M., & Frank, R. (2011). Finding the key players in online child exploitation
networks. Policy and Internet, 3(2), 1–25.
Williams, M. L., & Burnap, P. (2015). Cyberhate on social media in the aftermath of Woolwich: A
case study in computational criminology and big data. British Journal of Criminology, 56(2),
211–238.
Wong, M., Frank, R., & Allsup, R. (2015). The supremacy of online white supremacists – An analysis
of online discussions of white supremacists. Information and Communications Technology Law,
24(1), 41–73.
Yang, M., Kiang, M., Ku, Y., Chiu, C., & Li, Y. (2011). Social media analytics for radical opinion
mining in hate group web forums. Journal of Homeland Security and Emergency Management,
8(1), 1547–7355.
Zhang, Y., Zeng, S., Huang, C.-N., Fan, L., Yu, X., Dang, Y., …, Chen, H. (2010). Developing a dark
web collection and infrastructure for computational and social sciences. In Proceedings of the
2010 IEEE international conference on intelligence and security informatics (ISI), Atlanta, GA,
USA (pp. 59–64).
Zhou, Y., Qin, J., Lai, G., Reid, E., & Chen, H. (2005). Building knowledge management system for
researching terrorist groups on the web. In Proceedings of the 11th Americas conference on infor-
mation systems (AMCIS), Omaha, NE, USA (pp. 2524–2536).
Zulkarnine, A., Frank, R., Monk, B., Mitchell, J., & Davies, G. (2016). Surfacing collaborated networks
in dark web to nd illicit and criminal content. In Proceedings of the 2016 IEEE international
conference on intelligence and security informatics (ISI), Tucson, AZ, USA (pp. 109–114).