Conference PaperPDF Available

Sentiment-Based Identification of Radical Authors (SIRA)

Authors:
Sentiment-based Identification of Radical Authors (SIRA)
Ryan Scrivens, Garth Davies, Richard Frank, and Joseph Mei
International CyberCrime Research Centre
Simon Fraser University, School of Criminology
Burnaby, Canada
{rscriven, garthd, rfrank, jmmei}@sfu
Abstract As violent extremists continue to surface in online
discussion forums, counter-extremism agencies search for new
and innovative ways of uncovering their digital indicators.
Using a sample of approximately 1 million posts and 26,000
unique users across four Islamic-based discussion forums, this
study proposed a method of identifying the most radical users
on the Dark Web. Several characteristics of each user’s
postings were analyzed using Parts of Speech (POS) tagging, a
custom openNLP based tagger, sentiment analysis, and a
novel algorithm called “Sentiment-based Identification of
Radical Authors” (SIRA). POS tagging was used to develop a
list of the 400 most frequently cited nouns across the
discussion forums. With this list, sentiment analysis provided
the context surrounding users’ posts, and each post was
assigned a polarity value. Radical scores were calculated using
SIRA, which is an algorithm that accounts for a user’s
percentile score for average sentiment score, volume of
negative posts, severity of negative posts, and duration of
negative posts. Results did not suggest that a simple typology
best described the most radical users in the Dark Web;
however, the findings indicated that SIRA was flexible
enough to evaluate several combinations of online activity that
could identify the most radical users in the discussion forums.
In addition, SIRA identified the same user across two separate
discussion forums as the most radical, thus providing
validation for the algorithm. This particular user was linked to
an extremist website that supported terrorists. Lastly, the
results revealed that the Gawaher and Islamic Awakening web
forums hosted the highest volume of most radical users in the
sample.
Keywords - Sentiment Analysis; Extremism; Discussion Forums
I. INTRODUCTION
It is widely acknowledged that people around the world are
increasingly using computer technologies and computer-
mediated communications to connect with each other. The
Internet’s seamless accessibility and user-friendly platform
have revolutionized the sharing of information and
communications, facilitating an international web of virtual
communities. Violent extremists and those who adhere to
radical beliefs have embraced this changing digital landscape,
and we have seen a rapid growth in their online presence [17].
A growing body of evidence suggests that violent
extremists are a tech savvy group who continue to invest in,
and exploit the power of, this ever-growing medium, one that
spans across international boundaries and has low barriers of
entry [4]. Online discussion forums, also understood as “virtual
communities,” are ideal venues in which supporters,
sympathizers, or full-fledged fanatics can interact with one
another, free from the geographic barriers that once made it
difficult to communicate [1]. Furthermore, the rise of the
Internet has facilitated the violent “leaderless resistance”
movement, a decentralized and diffused tactic that has made it
increasingly difficult for law enforcement officials to detect
[17]. Identifying these online cues prior to catastrophic events
could produce a much different outcome.
Uncovering the indicators of violent extremism, or the
“most extreme individuals,” has been one of the most
significant policy issues faced by law enforcement agencies
and government officials worldwide [18], and the current focus
of government-funded research has been on the development
of advanced information technologies to identify and counter
the threat of violent extremism on the Internet. Here, scholars
have argued that successfully identifying extremists, especially
on a large scale, is the first step in reacting to them [2, 15].
Criminologists have begun to explore this critical point of
departure via customized web-crawlers, extracting large bodies
of text from websites featuring extremist material and then
using text-based analysis techniques to assess the content [2,
15]. Overlooked in this wave of research, however, has been an
assessment of extremist content on the alternative side of the
web, better known as the “Dark Web.”
Researchers have emphasized the importance of collecting
and analyzing extremist content on Dark Web discussion
forums. The “Dark Web Project” specifically has generated a
number of studies on the content and structure of extremist-
based websites [1, 3, 9, 24, 25]. While these studies have
showcased a variety of textual analysis techniques to detect the
indicators of extremism on the Dark Web, more systematic
research is needed to identify sentiment that may identify
violent Islamists online. Much is still unknown about this
hidden population [2, 5, 18].
II. METHOD
The purpose of this research project was to add to the
extant literature by using a sentiment analysis technique and an
algorithm to identify the most radical members within four
Dark Web discussion forums (Chapter II.A): Gawaher, Islamic
Awakening, Islamic Network, and Turn to Islam. To do so, we
used Parts of Speech (POS) analysis to develop a list of
keywords (Chapter II.B). Sentiment of the posts was then
calculated based on the POS keywords (Chapter II.C) detailing
the relative sentiment scores for each post for each of the 400
keywords. Finally, by applying a novel algorithm (Chapter
II.D) to calculate a radical score based on the users’ percentile
score, volume of negative posts, severity of negative posts, and
duration of negative posts, we were able to pick out the most
radical authors (Chapter III).
A. Forum Data
Members of the Intelligence and Security Informatics (ISI)
research team provided data on the four Dark Web Islamic
discussion forums, and all of the forums were in English.
Gawaher1 was created in September of 2000. It defines itself as
a “friendly Islamic international community,” and is designed
to facilitate various discussions pertaining to the Islamic world
and Islam. According to site administrators, the purpose of the
site is to: (1) promote a stronger understanding of Islam as a
moderate religion and way of life; and, (2) connect Muslims
with other Muslims worldwide [23]. The current study
analyzed 372,499 of its posts, 53,235 threads, and
approximately 9,260 members, dating from October 24, 2004
to June 7, 2012.
Islamic Awakening2 has members based in the UK and
surrounding countries, and identifies itself as a site “dedicated
to the blessed global Islamic awakening.” The current study
analyzed 201,287 of its posts, 32,879 threads, and 3,964
members, dating from April 28, 2004 to May 22, 2012.
Islamic Network3 has the overall goal of bringing dedicated
individuals together to practice their Muslim faith [14]. The
discussion forum includes a number of topics of interest to
Muslims, ranging from theology to current world events. The
current study analyzed 91,874 of its posts, 13,995 threads, and
2,082 members, dating from June 9, 2004 to November 10,
2010.
Lastly, Turn to Islam4 is a social networking website for
Muslims that is committed to disseminating “the true
understanding of Islam based upon the Noble Qur’an and the
authentic Sunnah” [19] and "correcting the common
misconceptions about Islam." It defines itself as “the best
place to learn about Islam” [13], with the aim of strengthening
and uniting the Ummah, promoting Islamic values, and
providing support for Reverts and non-Muslims [19]. The
current study analyzed 335,338 of its posts, 41,654 threads, and
10,858 members, dating from June 2, 2006 to May 20, 2013.
B. Parts of Speech (POS) Tagging
The first step in analyzing the forum data was to determine
users’ topic of discussion. To do this we began by isolating the
particular nouns that had the highest rate of occurrence within
the data, under the assumption that the most frequently
discussed topics would most likely be the ones in which
extremist content was likely to be detected. This was done with
Parts of Speech (POS) analysis, a data analysis method that
collects and divides supplied text into word groupings.
Traditionally, parts of speech analysis has been used for the
analysis and organization of different languages [6],
identifying specific words based on their word grouping.
1 gawaher.com
2 forums.islamicawakening.com
3 Forum could not be found.
4 turntoislam.com
Additionally, POS tagging has been used to arrange and
classify large databases, such as patient data from medical
institutions [8]. This method of data reorganization applied
specific tags and made these databases easier to query.
Using OpenNLP as a base, a POS tagger was created to
analyze the texts and divide them into meaningful groupings
[20]. The tagger first removed the HTML content from the
post, then scanned through the posts of all four web forums and
produced frequency distributions for each word. Each word
was then added into its appropriate grouping (i.e., noun, verb,
etc.) and split into sub-groupings (i.e., plural, singular, etc.).
Once this process was completed, all of the noun groups (i.e.,
both plural, proper and standard) were separated from the rest
of the results. Nouns were chosen because they are the words
most likely to be surrounded by relevant sentiment terms [12];
names and places are often described or denoted by the
adjectives linked to them. Furthermore, adjectives are often the
words that have sentiment values attached to them [22]. By
specifying adjectives as keywords, the sentiment of the word
itself would be lost.
From each of the four forums, we selected the top 100
nouns based on frequency5. Domain experts may replace or
extend the list of keywords with relevant domain-specific
words; however, our lack of domain knowledge required that
we rely on the data to identify the relevant keywords. After
removing terms that were not in fact words, including symbols,
incorrectly translated characters as well as redundant words,
the remaining terms formed the keyword list for the sentiment
analysis.
C. Sentiment Analysis
After the keyword list was developed, it was necessary to
identify and evaluate the context surrounding the keywords. To
allow a proper analysis of the user’s discussion and to aid in
the automatic discovery of which users could be considered
“the most radical,” sentiment analysis was used to highlight
relevant text. Sentiment analysis is a data collection and
analysis method that allows for the application of subjective
labels and classifications [7]. It can evaluate the opinions of
individuals by organizing data into distinct classes and
sections, and assigning an individual’s sentiment with a
negative or positive polarity value [1]. It also allows for a more
targeted view of a data set by allowing for the demarcation
between cases that are sought after and those without any
notable relevance.
Sentiment analysis has been used in a wide variety of
contexts, including customer review analysis for products [7],
and an assessment of attitudes towards events or products on
social media platforms [11]. Such a technique, however, has
not yet been used to assess which online users exhibit the most
radical sentiment values, taking into account the number,
severity, and duration of negative messages that each user
posted on Dark Web discussion forums. By analyzing the
5 Sentiment scores were assigned to each post, based on the keyword list of
400 nouns. However, two nouns and their variations were removed from the
list because their values were extreme outliers (i.e., admin, admins, and
passport, passports), and scores for the remaining 398 keywords were
averaged across each post. Each post was assigned a sentiment score that was
the average across all of the keywords found in that particular post.
sentiment surrounding the most frequent nouns that occur
within the data, it was possible to obtain an understanding of
the discussions and attitudes that were present on these forums.
Since the purpose of the current study was not to push the
boundaries of sentiment analysis algorithm, an off-the-shelf
algorithm, SentiStrength [21], was used.
SentiStrength is Java-based software that uses a specific
algorithm to run through large volumes of text and create
sentiment scores for the supplied documents. While there are
several configurations of the software, we utilized a keyword-
focused method, as a central feature of SentiStrength is its
ability to evaluate sentiment around any given keyword [21].
This process involves the utilization of a dictionary of
catalogued terms and Harvard’s general inquirer database to
determine sentiment values [21]. It locates words that
correspond with its dictionary and database, and then it uses a
stemming method to evaluate a text by assigning polarity
values of either positive or negative to words within the text.
Values are augmented by characters that can influence the
values assigned to the text, such as booster words, negative
words, repeated letters, repeated negative terms, antagonistic
words, punctuation, and other distinctive characters suited for
studying an online context [21].
Specific words within SentiStrength’s dictionaries are
labelled with appropriate sentiment values. For example the
phrase “I hate Americans and love Syrians” can be analyzed
for the sentiment around “Americans” (yielding a negative
value) as well as “Syrians” (yielding a positive value). After a
specific range of words has been set to be analyzed (e.g., four
words before and after “Syrians”), the software references its
dictionary. Values are applied to the specified range of words
before/after the keyword, and a sentiment score is assigned to
the given text. This method returns overall positive and
negative scores for the sentence.
D. Sentiment-based Identification of Radical Authors (SIRA)
There are various definitions that could be used to define
someone as extremist. For example, an author could be a long-
time member of a discussion forum and post somewhat
extremist material over that time. However, another author
could be on a discussion forum only for a short amount of time,
but during that time espouse very radical views in their
postings. Are long-term authors with somewhat extreme posts
considered more extreme than new authors with very extreme
posts?
Determining which aspect of an author’s online activity
should be considered when classifying them as more extreme
depends on the definition used, or the goals of the
identification exercise. Rather than focus on answering this
question, it was decided that a measure should be developed
that is capable of being adjusted to measure either type of
author6. Thus, an overall “radical score” was created based on
the following components of an author’s online activity:
1) Relative Average Sentiment Score (AS)
An author could be defined as extreme if they posted very
negative comments to the forum. To measure how extreme an
author was, with respect to the forum average, the percentile
score for average sentiment score was created. It was
calculated by accounting for the average sentiment score for all
posts in a given forum. The scores for each individual were
converted into percentiles scores, and percentile scores were
divided by 10 to obtain a score out of 10 points.
2) Volume of Negative Posts (VN)
An author could also be considered very extreme if they
were prolific posters and a lot of their posts were considered
somewhat extreme. Authors were classified as extreme not due
to their very extreme post content, but because of their long-
term dedication to posting extremist material. To measure this,
the volume of negative post was developed in two parts: (1) the
number of negative posts for a given member, and (2) the
proportion of posts for a given member that were negative. To
calculate the number of negative posts for a given member, we
counted the number of negative posts for a given member and
converted these scores into percentiles scores. Percentile scores
were then divided by 20 to obtain a score out of 5 points. To
calculate the proportion of posts for a given member that were
negative, the counts from the previous section were divided by
the total number of posts to get the proportion. These scores
were converted into percentiles scores, and the percentile
scores were divided by 20 to obtain a score out of 5 points.
Finally, the score of the number of negative posts for a given
member and the score of the proportion of posts for a given
member that were negative were tallied, thus creating a
composite measure (out of 10) of negative volume.
3) Severity of Negative Posts (SN)
There is a clear distinction between an author who posts
moderately extreme messages and another who posts very
extreme messages. One author may discuss a civil war in the
Middle East while the other celebrates the beheading of
Western militants. As such, a measure was needed to discern
authors’ level of extremism. To do this, a measure for the
severity of negative posts was developed consisting of two
parts: (1) the number of very negative posts for a given
member and (2) the proportion of posts for a given member
that were very negative. “Very negative” was calculated by
standardizing the count variable; all posts with a standardized
value greater than 3 were considered to be “very” negative.
After the standardization process, the severity calculation was
6 We did not attempt to match the identification number of each member
across each dataset.
Fig. 1. Process of Text Analysis and the Creation of the Radical Score
developed in the same manner as the volume calculation
above. To calculate the number of very negative posts for a
given member, we counted the number of very negative posts
for a given member and converted these scores into percentiles
scores. Percentile scores were then divided by 20 to obtain a
score out of 5 points. To calculate the proportion of posts for a
given member that were very negative, the counts from the
previous section were divided by the total number of posts to
obtain the proportion. These scores were converted into
percentile scores, and the percentile scores were divided by 20
to obtain a score out of 5 points. Finally, the score of the
number of very negative posts for a given member and the
score of the proportion of posts for a given member that were
very negative were tallied, thus creating a composite measure
(out of 10) of severity.
4) Duration of Negative Posts (DN)
An author who posted extreme messages over an extensive
period of time should be classified as more extreme than an
author who posted equally extreme messages over a shorter
period of time. SIRA accounted for this important component,
and the duration of negative posting was developed by
calculating the first and last dates on which individual
members made negative posts. We calculated the difference
between these data, which is the duration of negative posting,
and converted these scores into percentiles scores. Percentile
scores were divided by 10 to obtain a score out of 10 points.
E. Radical Score
Together, the percentile score for average sentiment score,
volume of negative post, severity of negative posts, and
duration of negative posts were tallied to produce an overall
score out of 40 points. This “radical score” quantified four
unique dimension of “seriousness” to identify the most radical
individuals within the Dark Web discussion forums. Consider
the following example of how the radical score of TX7,” a
hypothetical user, would be calculated, accounting for all of his
online activity within a discussion forum.
To attain TX7’s percentile score for average sentiment
score (AS), his average sentiment score was first calculated
(e.g., TX7’s sentiment score for the word gun was 0.4, while
the score for knife was 0.16, giving an average sentiment score
of 0.28). His average sentiment score was converted into a
percentile score (e.g., percentile score = 83.19, meaning that
83.19 percent of all users in the forum maintained an average
score that was equal to or less than TX7’s sentiment value of
0.28)7. The percentile score was divided by a value of 10 to
obtain a score out of 10 points (e.g., 83.10/10 = 8.319 of 10
points).
Volume of negative posts (VN) was calculated in two parts.
First, TX7’s total number of negative posts were counted (e.g.,
61 negative posts) and converted into percentile scores (e.g.,
percentile score = 77.56, meaning that 77.56 percent of all user
in the forum maintained an average score that was equal to or
less than a total of 61 negative posts). The percentile score was
divided by a value of 20 to obtain a score out of 5 points (e.g.,
7 The average sentiment score was a negative value (e.g., 0.28 = -0.28). We
multiplied this number by negative 1 to make the direction more intuitive, as
well as for mathematical convenience.
77.56/20 = 3.878 of 5 points). Second, the original count of
TX7’s negative posts (e.g., 61 negative posts) was divided by
the total number of his posts (e.g., 127 total posts) to obtain a
proportion of negative posts (e.g., 61/127 = 0.48) and
converted into a percentile score (e.g., percentile score = 95.82,
meaning that 95.82 percent of all users in the forum had a
proportion of negative posts that was equal to or less than 0.48
of all of their posts). The percentile score was divided by a
value of 20 to obtain a score out of 5 points (e.g., 95.82/20 =
4.791 of 5 points). Lastly, the sum scores of the number of
negative posts (part 1) and the proportion of posts that were
negative (part 2) were tallied to obtain a composite measure of
negative volume for TX7 (e.g., 3.878 + 4.791 = 8.669 of a
maximum of 10 points).
A measure of the severity of negative posts (SN) was also
calculated in two parts. First, TX7’s total number of very
negative posts was counted (e.g., 1 very negative post) and
converted into percentiles scores (e.g., percentile score = 99.14
percent, meaning that 99.14 percent of all users in the forum
displayed 0 very negative posts). The percentile score was
divided by a value of 20 to obtain a score out of 5 points (e.g.,
99.14/20 = 4.957 of 5 points). Second, the original count of
negative posts (e.g., 1 very negative post) was divided by the
total number of the user’s posts (e.g., 127 total posts) to obtain
a proportion of negative posts (e.g., 1/127 = 0.008) and
converted into a percentile score (e.g., percentile score = 98.54,
meaning that 98.54 percent of users in the forum had a
proportion of very negative posts that was equal to or less than
0.008 of all of their posts). The percentile score was divided by
a value of 20 to obtain a score out of 5 points (e.g., 98.54/20 =
4.927 of 5 points). Lastly, the sum scores of the number of very
negative posts (part 1) and the proportion of posts that were
very negative (part 2) were tallied to obtain TX7’s composite
measure of negative severity (e.g., 4.957 + 4.927 = 9.884 of 10
points).
The duration of negative posts was calculated by
identifying the first and last date that TX7 posted negative
messages (e.g., the user posted his first negative message on
July 19, 2006, and posted negative messages until May 11,
2011). To obtain the duration of negative posts, the difference
between the aforementioned dates were calculated (e.g., May
11, 2011 - July 19, 2006 = 4.84 years), and the duration of the
user’s negative posts was converted into percentiles scores
(e.g., percentile score = 99.02, meaning that 99.02 percent of
all users in the forum had a duration of negative posts that was
equal to or less than 4.84 years). The percentile score was
divided by a value of 10 to obtain a score out of 10 points (e.g.,
99.02/10 = 9.902 of 10 points).
The final step to obtaining TX7’s radical score was to
combine his percentile score for average sentiment score (AS)
with the volume of negative post (VN), severity of negative
posts (SN), and duration of negative posts (DN).
Radical score = AS + VN + SN + DN
= 8.319 + 8.669 + 9.884 + 9.902
= 36.774 of 40 points
The four percentile components, in combination, were
strong indicators of the most radical users in an online
discussion forum, whereby the higher a user’s radical score, the
more likely they were to be discussing extremist content in
their posts. Given a different definition of what is extremist
content, various weights could be attached to the different
components of the radical score.
III. RESULTS
The results were interpreted in two ways. First, the 26,171
unique forum users were categorized by discussion forum.
They were then ranked according to their radical score, and
those with the highest scores were assessed. Second, users
from the four Dark Web discussion forums were analyzed as a
single entity and ranked according to their radical score. Both
the users with the highest scores and the forums with the
highest number of radical users were evaluated.
A. Gawaher
According to the algorithm, the most radical user in the
Gawaher discussion forum was tom. The author posted
negative messages over a consistent period of over 6.5 years.
Of the 2,874 messages posted, the user’s overall average
posting score was -1.02, and 2,074 (72.4%) of the total number
of messages were negative while 16 (0.6%) of the messages
were “very negative.” What separated this user from the other
users who received the highest radical score in the forum was
the author’s total number of radical messages posted, as well as
the amount of time the author was active in the forum.
Specifically, tom was amongst the most prolific posters (i.e.,
2,874 posts) over one of the longest periods of time on the
forum (i.e., approximately 6.5 years), and roughly three-
quarters of the user’s messages were classified as negative. For
example, as tom wrote:
...the us [United States] loves to display its donkey
talent in poking its nosey nose into others' affairs
[...] merely to satisfy the us' [United States] lust for
revenge…[sic] (MessageID: 1,194,113).
As such, tom received a radical score of 38.934 of 40 and was
the most radical user on the Gawaher discussion forum
according to SIRA.
The second most radical member in the discussion forum,
Barayev, posted negative messages over a consistent period of
3.35 years. Of the 524 messages posted, the user’s overall
average posting score was -1.29, and 392 (i.e., 75%) of the
total number of messages were negative while 6 (i.e., 1.1%) of
the messages were “very negative.” An example of the author’s
online sentiment included the following posting:
This is about the invasions and bombs that the US
military machine has unleashed over Muslim lands.
Most people realise that the entire US military
machine has to be destroyed or stopped and this is
what many people pray for (MessageID: 180,208).
While this user was not one of the most prolific users on
the forum (i.e., 524 posts), three-quarters of the user’s
messages were classified as negative (75%) over a fairly
extensive period of time (i.e., 3.35 years). Furthermore, both
the user’s average sentiment score (i.e., -1.29) and the negative
severity of their messages (i.e., 1.1%) were slightly higher than
the majority of users who received the highest radical score in
the forum. As such, Barayev received a radical score of 38.823
of 40 and was the second most radical user on the Gawaher
discussion forum.
Another important user was identified as the third most
radical member in the discussion forum. Moishe3rd posted
negative messages over a consistent period of almost 4 years.
Of the 28 messages posted, the user’s overall average posting
score was -2.21, and 24 (85.7%) of the total number of
messages were negative while 1 (3.6%) message was “very
negative.” For example, as the author wrote:
Yes, power does corrupt and the downtrodden, the
oppressed and deprived peoples of the world,
Muslim or otherwise, will fight back until they or
others overthrow such corrupt regimes […] and put
in place more just and equitable governments which
then slowly become corrupt again (MessageID:
323,170).
Notably, Moishe3rd was not amongst the most prolific
users on the discussion forum (i.e., 28 posts) nor did the user
post radical material over the longest periods of time (i.e., 3.94
years), but the percentage of the user’s posts that were
classified as negative were amongst the highest in the forum
(i.e., 85.7%). Unlike other forum users who also received the
highest “radical scores” on the forum, Moishe3rd’s average
sentiment score was highest amongst the most radical users in
the discussion forum (i.e., -2.21), and the author was amongst
the few users who posted a small number of consistently
negative messages over a substantial amount of time (3.94
years). As such, Moishe3rd received a radical score of 38.790
of 40 and was the third most radical user on the Gawaher
discussion forum.
B. Islamic Awakening
The most radical member in the Islamic Awakening
discussion forum was user Cageprisoners.com. This author
posted negative messages over a consistent period of 2.87
years. Of the 139 messages posted, the user’s overall average
posting score was -4.47, and 134 (97.1%) of the total number
of messages were negative while 33 (23.9 %) of the messages
were “very negative.” What distinguished Cageprisoners.com
from the other radical users in the forum was the remarkably
high number of posts that were classified as negative (i.e.,
97.1% of all posts), the high percent of posts that were
identified as “very negative” (i.e., 23.9% of all posts were very
negative) and the relatively long duration of time that the user
posted negative messages on the forum (i.e., 2.87 years). For
example, as Cageprisoners.com explains,
Muslims are the current object of our society's fear
and contempt […] The aggressive American war
machine continues to endeavour tirelessly to gag
every free, independent and self respecting media
organization in the world that tries to convey a
message of truth to others without distortion or
perversion… (MessageID: 1,497).
The author was not amongst the most prolific posters in the
forum (i.e., 139 messages), but the average sentiment value of
the author’s content was the most negative in the forum (i.e., -
4.47). As such, Cageprisoners.com received a noteworthy
radical score of 39.028 of 40 and was the most radical user on
the Islamic Awakening discussion forum.
The second most radical member in the discussion forum
was hussain. The user posted negative messages over 5.72
years, and messages were identified as negative during 5.69 of
those years. Of the 425 messages posted, the user’s overall
average posting score was -1.05, and 294 (70.3%) of the total
number of messages were negative while 5 (1.2%) of the
messages were very negative.” Unlike the other users who
received the highest radical scores in the discussion forum,
hussain was the second most radical user in Islamic Awakening
for two key reasons. First, although the user was not amongst
the most prolific poster on the forum (i.e., 425 messages),
nearly three-quarters of the messages posted by hussain were
classified as negative (i.e., 70.3%) and were posted over a
substantial period of time (i.e., 5.69 years). Second, hussain’s
overall number of very negative messages was slightly higher
than the severity score of other users who received high radical
scores in the forum (i.e., 1.2% of messages were very
negative). An example of such messages includes:
Amrika [America] […] How can a government that
tortures our brothers and sisters, that imprisons our
brothers and sisters (often without trial), that
invades our lands, that has killed tens of thousands
of our brothers and sisters, that struts around like
the arrogant cowardly bully it is, be our friend…
(MessageID: 116,594).
Overall, the author received a radical score of 38.876 out of 40
and was the second most radical user on the Islamic Awakening
discussion forum.
C. Islamic Network
The most radical member in the Islamic Network discussion
forum was cageprisoners. The user posted negative messages
over 2.28 years, and messages were identified as negative
during 2.11 of those years. Of the 207 messages posted, the
user’s overall average posting score was -2.36, and 155
(74.9%) of the total number of messages were negative while 9
(4.3%) of the messages were “very negative.” What separated
cageprisoners from the other radical users in Islamic Network
was a combination of the severity of the user’s negative posts
and the user’s duration of negative activity on the discussion
forum. Specifically, the author was not amongst the most
prolific posters on the forum (i.e., 207 posts), but the duration
of the negative posts (i.e., 2.11 years) was longer than the
majority of the users who scored the highest radical scores in
the forum. Moreover, a relatively high proportion of this user’s
messages were very negative (i.e., 4.3%). For example, as
cageprisoners wrote in one discussion thread:
…inshaa'allaah [God-willing] I hope to keep the
brothers and sisters updated about what is
happening to unjustly detained Muslims around the
world and what we can do about it […] The true
believer is the one whom other Muslims are safe
from his tongue and hands! And all success comes
from Allah alone… (MessageID: 23,138).
The percent of very negative messages for this particular
user was higher than the majority of those who received the
highest radical scores in the discussion forum. As such,
cageprisoners received a radical score of 38.601 of 40 and was
the most radical user on the Islamic Network discussion forum.
Interestingly, this user was also identified as the most radical
user on the Islamic Awakening discussion forum.
The second most radical member in the discussion forum,
Abu_Sayyaf, posted negative messages over 3.16 years, and
messages were identified as negative during 2.46 of those
years. Of the 279 messages posted, the user’s overall average
sentiment score was -1.28, and 213 (76.3%) of the total number
of messages were negative while 3 (1.1%) of the messages
were “very negative.” The factors that distinguished this user
from those who received high radical scores in the same forum
were similar to that of cageprisoners. Abu_Sayyaf was not
amongst the most prolific posters on the forum (i.e., 279 posts),
and the number of messages classified as very negative were
amongst the lowest in the group of the most radical users in the
forum (i.e., a mere 1.1% of all messages were very negative).
However, slightly over three-quarters of the user’s messages
were classified as negative (i.e., 76.3%), and the duration of
time that the user posted negative messages was considerably
longer than the amount of time that the majority of the most
radical users were active on the forum (i.e., 2.46 years). For
example, in an online discussion about the perpetrator who
filmed himself throwing a bullet-riddled Quran at a mosque in
the U.S., Abu_Sayyaf noted:
The guy in the video is such an obvious piece of
human garbage. He claims that he considers as a
hero anyone that kills Muslims. Of course if this
little piece of crap is such a coward he would never
step foot in Iraq or Afghanistan (MessageID:
88,541).
Overall, Abu_Sayyaf received a radical score of 38.555 of 40
and was the second most radical user on the Islamic Network
discussion forum.
D. Turn to Islam
The most radical member in the Turn to Islam discussion
forum was sister harb who posted negative messages over a
consistent period of approximately 5.5 years. Of the 1,524
messages posted, the user’s overall average posting score was
-0.7, and 699 (50.9%) of the total number of messages were
negative while 12 (0.9%) of the messages were “very
negative.” What distinguished this user from the other users
who received the highest radical scores in the forum was a
combination of the user’s total number of radical messages
posted, the amount of time the author was active on the forum,
and the user’s average sentiment score. While sister harbs
volume of negative posts and the severity of posts was not
overly high (i.e., 50.9% of the author’s posts were negative,
and only 0.9% of the posts were very negative), the user posted
negative material over one of the longest periods of time on
forum (i.e., 5.52 years) and was amongst the most prolific
posters on the forum (i.e., 1,524 posts). For example, in one
thread the user wrote:
…US army used chemical weapons agains civilians
[…] In this world only Allah can judge the
president ot the USA [sic] (MessageID: 148,800).
As such, sister harb received a radical score of 38.288 out of
40 and was the most radical user on the Turn to Islam
discussion forum.
E. Radical Users Across Forums
The five most radical users across all four Dark Web
discussion forums were active users on Gawaher and Islamic
Awakening. Cagetheprisoners.com was the most radical user
on Islamic Awakening, and was also the most radical user
across all forums. Again, this author posted very few messages
on the forum (i.e., 139 messages); however, the content found
within these messages was amongst the most extreme across all
forums (i.e., the average sentiment score was -4.47, 97.1% of
all posts were negative, and 23.9% of all posts were very
negative). Furthermore, this user did not post a handful of
negative messages over a short period of time. Instead,
Cageprisoners.com posted a small number of extremely
negative messages that were spread out over approximately
three years (i.e., 2.87 years). The common theme of these
messages was brutalities of being a prisoner at Guantanamo
Bay, with an emphasis on the Islamic religion under attack by
Western Nations. As the user noted,
...everyone knows 9/11 was one of the biggest
attacks to be attributed to the Al-Qaida network
against America and it started the […] Global war
against Islam […] The beginning of that was
marked by detaining Muslims, detaining those who
practice their religion, have large amounts of money
in their bank accounts, or who have travelled to
certain Muslim countries (MessageID: 12,789).
This finding suggests that the extreme nature of the messages
was not short lived, nor did it display unusual sentiment for
this particular user; rather, this user showed a consistent pattern
of extremely radical online discourse and a high level of
dedication to extremist beliefs (see Table 1).
The second most radical user across all discussion forums,
tom, was also the most radical user on Gawaher. Unlike
Cagetheprisoners.com, tom’s overall average posting score
was not the most negative amongst those who also received the
highest radical scores across forums (-1.02), nor was the
severity of the user’s negative posts noticeably high (i.e., 0.6%
of all posts were very negative). What differentiated tom from
all other members across all forums, as well as those who
received the highest radical scores across forums, was the
volume of negative messages and the length of time in which
these messages were posted (i.e., 72.4% of 2,874 posts were
negative over a 6.69 year period). This user was amongst the
most prolific posters across all discussion forums, and although
these messages were not classified as the most negative across
all forums, the volume of mildly negative messages was
constant. For example, tom typically posted messages that were
classified as moderately negative, such as: “it dont matter
which non muslim country gonna respond. the chain reaction
gonna lead to doomsday [sic]” (MessageID: 1193392) or
“...whats the 'hidden' agenda for superpower usa [...] invade
irag followed by afghanistan and now possibly eyeing at iran...
[sic]” (MessageID: 1237860).
Focusing on the top 0.1% and 1% of the most radical users
in the sample of 26,171 individuals across the four web forums
(i.e., roughly 27 users and 262 users, respectively), the results
indicated that two of the four discussion forums hosted
substantially more radical users. Users who scored the highest
radical score were active users in Gawaher and Islamic
Awakening (see Table 2).
While the overall volume of most radical users varied from
0.1% of the most radical users in the web forums to 1% of the
most radical users in the web forums, the results suggested that
Gawaher and Islamic Awakening hosted the highest volume of
the most radical users in the sample, and at considerably higher
rates than the Islamic Network and Turn to Islam forums.
IV. CONCLUSIONS
This study proposed a Sentiment-based Identification of
Radical Authors (SIRA) method to identify the most radical
users across approximately 1 million posts found on four Dark
Web discussion forums. A few notable results were found.
First, the findings do not highlight a particular typology that
can be used to define the most radical users within a web
forum, nor did we uncover specific patterns that categorize
specific activity as most radical. Although this research is in its
preliminary stages, the results indicate that the process of
identifying the most radical users within a discussion forum
must encompass a variety of key elements. The SIRA
algorithm for “radical score” is flexible enough to evaluate
several combinations of online sentiment activity that may be
used to detect the most radical users in discussion forums. It
User
Forum
Negative Posting
Period (Years)
Average
Posting Score
Negative
Posts (%)
Very Negative
Posts (%)
Radical Score
Cageprisoners.com
Islamic Awakening
2.87
-4.47
97.1
23.9
39.028
tom
Gawaher
6.69
-1.02
72.4
0.6
38.934
hussain
Islamic Awakening
5.69
-1.05
70.3
1.2
38.876
Barayev
Gawaher
3.35
-1.29
75
1.1
38.823
Moishe3rd
Gawaher
3.94
-2.21
85.7
3.6
38.790
iftikharalam
Gawaher
6.28
-0.88
64.0
0.3
38.652
Hajjaj
Islamic Awakening
2.76
-1.27
74.1
0.7
38.625
Daniel
Islamic Awakening
2.26
-1.47
76.8
0.7
38.620
cageprisoners
Islamic Network
2.11
-2.36
74.9
4.3
38.601
Table 1. Most Radical Users Across Four Islamic Discussion Forums
also revealed that Gawaher and Islamic Awakening contained
the most radical users across all four forums.
Second, the same user was identified across two separate
discussion forums as the most radical. Cageprisoners.com of
Islamic Awakening and cageprisoners of Islamic Network were
deemed the most radical users within their given discussion
forum, thus providing validation for the algorithm.
Furthermore, this user represented the now defunct Cage
Prisoners (found at the website Cageprisoners.com), currently
operating under the domain name Cage (cageuk.org). Although
the site describes itself as an advocacy group campaigning
against the war on terror, it is a very controversial Islamic
rights group. It is headed by former Guantanamo Bay detainee
Moazzam Begg, and the site has a history of supporting Islamic
extremists [16]. Cage has also been described as fraudulent
attempt to promote human rights, and a front for Taliban
enthusiasts and Al-Qaida devotees [10].
Combining sentiment analysis techniques and the SIRA
algorithm proved to be a very useful way of identifying the
most radical users in Dark Web discussion forums. However,
future research should take into account if a user is active
during the time that the data is captured. Focusing on
"currently active" users may highlight how our strategy can be
used to identify specific users who are of interest to anti-
extremist groups. Future research should incorporate a
temporal analysis, examining how users’ radical scores change
during their time on a forum. This strategy could involve
splitting the entire dataset into clusters of months and
calculating each user’s radical score for that month. From here,
sharp changes in scores, especially ones that are most current,
could be interpreted as possible changes in motivation or level
of extremism within the author. This could help in minimizing
the possible threat posed by violent extremists.
V. REFERENCES
1) Abbasi, A., & Chen, H., Applying Authorship Analysis to Extremist-
group Web Forum Messages,” in Intelligent Systems, 20(5), (2005), pp.
67-75.
2) Bouchard, M., Joffres, K., & Frank, R., Preliminary Analytical
Considerations in Designing a Terrorism and Extremism Online
Network Extractor,” in Computational Models of Complex Systems,
Intelligent Systems Reference Library, 53, (2014), pp. 171-184.
3) Chen, H., Dark Web: Exploring and Data Mining the Dark Side of the
Web. New York: Springer. (2012).
4) Conway, M., Terrorism and the Internet: New MediaNew Threat?,” in
Parliamentary Affairs, 59(2), (2006), pp. 283-298.
5) Davies, G., & Dawson, S., “A Framework for Estimating the Number of
Extremists in Canada,” in Canadian Network for Research on Terrorism,
Security and Society, No. 14-08, (2014).
6) Denis, P., & Sagot, B., “Coupling an annotated corpus and a lexicon for
state-of-the-art POS tagging,” in Language Resources & Evaluation, 46,
(2012), pp. 721-736.
7) Feldman, R., “Techniques and Applications for Sentiment Analysis,” in
Communications of the ACM, 56(4), (2013), pp. 82-88.
8) Ferraro, J., Daumé III, H., Duvall, S., Chapman, W., Harkema, H., &
Haug, P., “Improving Performance of Natural Language Processing
Part-of-speech Tagging on Clinical Narratives through Domain
Adaptation,” in American Medical Information Association, 20, (2013),
pp. 931-939.
9) Fu, T., Abbasi, A., & Chen, H., “A Focused Crawler for Dark Web
Forums,” in Journal of American Society for Information Science and
Technology, 61(6), (2010), pp. 1213-1231.
10) Galvin, T., “Amnesty International Doubles Down on Appeasement.
The National Post,” (February 8, 2010). Retrieved from
http://oped.ca/National-Post/terry-glavin-amnesty-international-doubles-
down-on-appeasement/
11) Ghiassi, M., Skinner, J., & Zimbra, D., Twitter Brand Sentiment
Analysis: A Hybrid System using n-gram Analysis and Dynamic
Artificial Neutral Network,” in Expert Systems with Applications, 40,
(2013), pp. 6266-6282.
12) Hu, M., & Bing L., “Mining and Summarizing Customer Reviews, in
Proceedings of ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-2004) (2004).
13) ICDM., “IEEE ICDM Workshop on Intelligence and Security
Informatics Challenge - Turn to Islam README file,” (2015). Retrieved
from http://cci.drexel.edu/isi/isi-icdm2015/challenge.html
14) Islamic Network., “Our Mission,” (2015). Retrieved from
http://www.islamicnetwork.net/our-mission/
15) Mei, J., Frank, R., “Sentiment Crawling: Extremist Content Collection
through a Sentiment Analysis Guided Web-crawler, in Proceedings of
the International Conference on Advances in Social Networks Analysis
and Mining, Paris, France. Springer (forthcoming, 2015).
16) Murray, D., & Simcox, R., “The Evidence Shows that Cage is a Pro-
Terrorist Group. The Telegraph,” (July 21, 2014). Retrieved from
http://www.telegraph.co.uk/news/uknews/law-and-order/10981050/The-
evidence-shows-that-Cage-is-a-pro-terrorist-group.html
17) Sageman, M., Leaderless Jihad: Terror Networks in the Twenty-First
Century. Philadelphia, PA: University of Pennsylvania Press. (2008).
18) Sageman, M., “The Stagnation in Terrorism Research,” in Terrorism and
Political Violence, 26(4), (2014), pp. 565-580.
19) Salamyou., “About Us - Salamyou Muslim social Network - Turn to
Islam,” (n.d.). Retrieved from http://turntoislam.com/about/
20) The Apache Software Foundation. (2015). Retrieved from
https://opennlp.apache.org/
21) Thelwall, M., & Buckley, K., “Topic-based Sentiment Analysis for the
Social Web: The Role of Mood and Issue-related Words,” in Journal of
the American Society for Information Science and Technology, 64(8),
(2013), pp. 16081617.
22) Thet, T., Na, J., & Khoo, C., “Aspect-based Sentiment Analysis of
Movie Reviews on Discussion Boards,” in Journal of Information
Sicence, 36(6), (2010), pp. 823-848.
23) website.informer.com., www.gawaher.com,” (2015). Retrieved
http://website.informer.com/visit?domain=web.informer.com
24) Zhang, Y., Zeng, S., Huang, C-N., Fan, L., Yu, X., Dang, Y., Larson, C.
A., Denning, D., Roberts, N., & Chen, H., Developing a Dark Web
Collection and Infrastructure for Computational and Social Sciences, “
in IEEE, (2010).
25) Zhou, Y., Qin, J., Lai, G., Reid, E., & Chen, H., Building Knowledge
Management System for Researching Terrorist Groups on the Web,” in
Proceedings of the 11th Americas Conference on Information Systems,
Omaha, NE, August 11th to 14th, 2005.
Forum
0.1% of Most
Radical Users
in the Sample
1% of Most
Radical Users
in the Sample
Gawaher
48.15%
38.93%
Islamic Awakening
37.04%
45.8%
Islamic Network
11.11%
6.11%
Turn to Islam
3.7%
9.16%
Total
100%
100%
Table 2. 0.1% and 1.0% of the Most Radical Users Across Four Islamic
Discussion Forums
... Regarding sentiment features, they are not usually used as a single feature to detect extremist content, specially concerning political radicalization (Scrivens et al. 2015). While these features do not perform bad either and they, in fact, perform better than other less complex features (Ahmad et al. 2019), classification models trained with more features usually perform better than those who use only sentiment features (Weir et al. 2016;Hartung et al. 2017;Saif et al. 2017;Weir 2018, 2019;Araque and Iglesias 2020). ...
... SentiStrength can report binary (positive vs negative), trinary (positive/negative/neutral) and single scale ( −4 to +4 ) sentiment results. From the reviewed articles, it was the most commonly used tool to determine sentiment (Weir et al. 2016;Scrivens and Frank 2016;Wei et al. 2016;Saif et al. 2017;Owoeye and Weir 2019;Scrivens et al. 2015;Macnair and Frank 2018;Scrivens et al. 2020Scrivens et al. , 2018. -Linguistic inquiry word count 7 : this tool, also known as LIWC (Pennebaker et al. 2001), was created in 2007 with the purpose of studying the language through a psychological perspective. ...
... It supports different NLP tasks, providing several options to analyze texts. Four reviewed articles adopted OpenNLP in their experiments (Scrivens et al. 2018(Scrivens et al. , 2015Scrivens and Frank 2016;Weir et al. 2016). -IBM Watson natural language understanding 10 : this software, developed by IBM, includes several packages at their core, which allow conducting NLP analyzes from different perspectives (for example, open analysis versus questions and answers). ...
Article
Full-text available
Extremism has grown as a global problem for society in recent years, especially after the apparition of movements such as jihadism. This and other extremist groups have taken advantage of different approaches, such as the use of Social Media, to spread their ideology, promote their acts and recruit followers. The extremist discourse, therefore, is reflected on the language used by these groups. Natural language processing (NLP) provides a way of detecting this type of content, and several authors make use of it to describe and discriminate the discourse held by these groups, with the final objective of detecting and preventing its spread. Following this approach, this survey aims to review the contributions of NLP to the field of extremism research, providing the reader with a comprehensive picture of the state of the art of this research area. The content includes a first conceptualization of the term extremism, the elements that compose an extremist discourse and the differences with other terms. After that, a review description and comparison of the frequently used NLP techniques is presented, including how they were applied, the insights they provided, the most frequently used NLP software tools, descriptive and classification applications, and the availability of datasets and data sources for research. Finally, research questions are approached and answered with highlights from the review, while future trends, challenges and directions derived from these highlights are suggested towards stimulating further research in this exciting research area.
... The Islamic Network forum aims to unite dedicated individuals to practice their Muslim faith. The discussion forum, which is in the English language, includes several topics of interest to Muslims, ranging from theology to current world events [47]. The The most active users and those with the highest radical scores [47], who are thus likely to represent potential threats, were extracted, taking into account the periods of highest interest [8], topics of interest [48], and sentiment analysis [49,50]. ...
... The discussion forum, which is in the English language, includes several topics of interest to Muslims, ranging from theology to current world events [47]. The The most active users and those with the highest radical scores [47], who are thus likely to represent potential threats, were extracted, taking into account the periods of highest interest [8], topics of interest [48], and sentiment analysis [49,50]. Therefore, following the directions in [8,47], the most radical and followed forum members were selected. ...
... The The most active users and those with the highest radical scores [47], who are thus likely to represent potential threats, were extracted, taking into account the periods of highest interest [8], topics of interest [48], and sentiment analysis [49,50]. Therefore, following the directions in [8,47], the most radical and followed forum members were selected. After selecting the most radical and followed forum members, a second sorting was done. ...
Article
Full-text available
Online users tend to hide their real identities by adopting different names on the Internet. On Facebook or LinkedIn, for example, people usually appear with their real names. On other standard websites, such as forums, people often use nicknames to protect their real identities. Aliases are used when users are trying to protect their anonymity. This can be a challenge to law enforcement trying to identify users who often change nicknames. In unmonitored contexts, such as the dark web, users expect strong identity protection. Thus, without censorship, these users may create parallel social networks where they can engage in potentially malicious activities that could pose security threats. In this paper, we propose a solution to the need to recognize people who anonymize themselves behind nicknames—the authorship attribution (AA) task—in the challenging context of the dark web: specifically, an English-language Islamic forum dedicated to discussions of issues related to the Islamic world and Islam, in which members of radical Islamic groups are present. We provide extensive analysis by testing models based on transformers, styles, and syntactic features. Downstream of the experiments, we show how models that analyze syntax and style perform better than pre-trained universal language models.
... Regarding sentiment features, they are not usually used as a single feature to detect extremist content, specially concerning political radicalisation [104]. While the type of features do not perform bad either and they, in fact, perfform better than other less complex features [55], usually classification models trained with more features perform better than those who use only sentiment features [111,57,61,98,103,75]. ...
... Therefore, it becomes a hard task to find open datasets online. [148] 335.338 posts -10.858 users English Dark web forum [104,110] Much of the articles included on the review use their owns datasets. The reader is encouraged to contact with the authors of the different articles to ask for their data. ...
... SentiStreght can report binary (positive vs negative), trinary (positive/negative/neutral) and single scale (-4 to +4) sentiment results. From the reviewed articles, it was the most commonly used tool to determine sentiment [111,109,105,61,103,104,102,89,110]. ...
Preprint
Full-text available
Extremism research has grown as an open problem for several countries during recent years, especially due to the apparition of movements such as jihadism. This and other extremist groups have taken advantage of different approaches, such as the use of Social Media, to spread their ideology, promote their acts and recruit followers. Natural Language Processing (NLP) represents a way of detecting this type of content, and several authors make use of it to describe and discriminate the discourse held by this groups, with the final objective of detecting and preventing its spread. This survey aims to review the contributions of NLP to the field of extremism research, providing the reader with a comprehensive picture of the state of the art of this research area. The content includes a description and comparison of the frequently used NLP techniques, how they were applied, the insights they provided, the most frequently used NLP software tools and the availability of datasets and data sources for research. Finally, research questions are approached and answered with highlights from the review, while future trends, challenges and directions derived from these highlights are suggested.
... These include illicit fundraising, unlawful purchase of explosives, discussion forums, and covert communications channels (Weimann, 2016). Scrivens et al. created a framework to find the most radical users on dark web jihadist forums and found that the most radical user is associated with a terroristic website (Scrivens et al., 2016). Islamic terroristic organizations have encouraged their members to use the internet for their operations, naming it "Electronic Jihad" (Weimann, 2014). ...
... Identifying users and remote communities spreading radical content can help reduce unrest among the public by exposing the offenders behind it. Scrivens et al. proposed a system named SIRA (Sentiment-based identification of radical authors) to find the most radical authors on Dark Web forums by identifying extreme content using SentiStrength and classify users as radical based on average sentiment score, the volume of negative posts and duration of posts published (Scrivens et al., 2016). The most radical users were identified to be associated with terroristic websites. ...
Article
Full-text available
In this contemporary era, where a large part of the world population is deluged by extensive use of the internet and social media, terrorists have found it a potential opportunity to execute their vicious plans. They have got a befitting medium to reach out to their targets to spread propaganda, disseminate training content, operate virtually, and further their goals. To restrain such activities, information over the internet in context of terrorism needs to be analyzed to channel it to appropriate measures in combating terrorism. Open Source Intelligence (OSINT) accounts for a felicitous solution to this problem, which is an emerging discipline of leveraging publicly accessible sources of information over the internet by effectively utilizing it to extract intelligence. The process of OSINT extraction is broadly observed to be in three phases (i) Data Acquisition, (ii) Data Enrichment, and (iii) Knowledge Inference. In the context of terrorism, researchers have given noticeable contributions in compliance with these three phases. However, a comprehensive review that delineates these research contributions into an integrated workflow of intelligence extraction has not been found. The paper presents the most current review in OSINT, reflecting how the various state‐of‐the‐art tools and techniques can be applied in extracting terrorism‐related textual information from publicly accessible sources. Various data mining and text analysis‐based techniques, that is, natural language processing, machine learning, and deep learning have been reviewed to extract and evaluate textual data. Additionally, towards the end of the paper, we discuss challenges and gaps observed in different phases of OSINT extraction. This article is categorized under: Application Areas > Government and Public Sector Commercial, Legal, and Ethical Issues > Social Considerations Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining
... Some of the goals include: (a) Explore goals, ideologies, and online fundraising and propaganda drives of terrorist groups [8]. (b) Identifying the hidden authors of posts by extracting common stylistic features [9,10]. (c) Quantifying the radicalization level toward different events. ...
... (c) Quantifying the radicalization level toward different events. This is based mainly on typical sentiment analysis research [10,11]. (d) Analysis of behavioral patterns of targeted users and their interactions [12,13]. ...
Article
Social networks have recently evolved from a means of connecting people to becoming a tool for social engineering, radicalization, dissemination of propaganda, and recruitment of terrorists. It is no secret that the majority of the Islamic State in Iraq and Syria (ISIS) members are Arabic speakers, and even the non-Arabs adopt Arabic nicknames. However, the majority of the literature researching the subject deals with non-Arabic languages. Moreover, the features involved in identifying radical Islamic content are shallow and the search or classification terms are common in daily chatter among people of the region. We aim at distinguishing normal conversation, influenced by the role religion plays in daily life, from terror-related content. This article presents our experience and the results of collecting, analyzing, and classifying Twitter data from affiliated members of ISIS, as well as sympathizers. We used artificial intelligence and machine learning classification algorithms to categorize the tweets, as terror-related, generic religious, and unrelated. We report the classification accuracy of the K-Nearest Neighbor, Bernoulli Naïve Bayes, and Support Vector Machine (One-Against-All and All-Against-All) algorithms. We achieved a high classification F1 score of 83%. The work in this paper will hopefully aid more accurate classification of radical content.
... The limitation of this work is that many of the features depend on the dataset. Authors in [25] also have used Parts of Speech (POS) tagging, openNLP, sentiment analysis and Sentiment-based Identification of Radical Authors (SIRA) algorithm to analyze characters of dataset consisting of one million posts and 26,000 unique users posted in 4 Islamic-based blogs to identify the most radical users on the Dark Web. ...
Conference Paper
Full-text available
In the last ten years the popularity and usage of social media has exploded. User interests, opinions, activities and relationships on social media generated behavioral data. There has been significant research effort in mining these activities and behaviors on social media to understand how these users' interests evolve. This has gained importance in diverse fields; one of these fields is detecting terrorist activities. Terrorist groups are using social media platforms increasingly to recruit new supporters and to spread their message and propaganda to accomplish their goals. Social Network Analysis is a data mining technique, is used to identify the relations and associations that belong to the members of terrorist networks. The goal of this work is to review the issues for mining terrorists' networks in social media.
... Scrivens et al. has developed a sentiment based identifier of radical authors (SIRA). The authors use text from identified radical users to validate and evaluate their identifier [87]. Subsequently, in their later works, Scrivens et al. employ their sentiment based identifier to identify and explore posting behaviors of right wing extremists [53] where they identified large proportion of out-group construction and attacks in the community. ...
Preprint
Full-text available
The algorithms and the interactions facilitated by online platforms have been used by radical groups to recruit vulnerable individuals to their cause. This has resulted in the sharp growth of violent events and deteriorating online discourse. The Manosphere, a collection of radical anti-feminist communities, is one such group which has attracted attention due to their rapid growth and increasingly violent real world outbursts. In this paper, we examine the social engagements between Reddit users who have participated in feminist discourse and the Manosphere communities on Reddit to understand the process of development of traits associated with the adoption of extremist ideologies. By using existing research on the psychology of radicalization we track how specific types of social engagement with the Manosphere influence the development of traits associated with radicalization. Our findings show that: (1) participation, even by the simple act of joining the Manosphere, has a significant influence on the language and outlook traits of a user, (2) Manosphere elites are extremely effective propagators of radical traits and cause their increase even outside the Manosphere, and (3) community perception can heavily influence a user's behavior. Finally, we examine how our findings can help draft community and platform moderation policies to help mitigate the problem of online radicalization.
Conference Paper
Full-text available
As the data generated on the internet exponentially increases, developing guided data collection methods become more and more essential to the research process. This paper proposes an approach to building a self-guiding web-crawler to collect data specifically from extremist websites. The guidance component of the web-crawler is achieved through the use of sentiment-based classification rules which allow the crawler to make decisions on the content of the webpage it downloads. First, content from 2,500 webpages was collected for each of the four different sentiment-based classes: pro-extremist websites, anti-extremist websites, neutral news sites discussing extremism and finally sites with no discussion of extremism. Then parts of speech tagging was used to find the most frequent keywords in these pages. Utilizing sentiment software in conjunction with classification software a decision tree that could effectively discern which class a particular page would fall into was generated. The resulting tree showed an 80% success rate on differentiating between the four classes and a 92% success rate at classifying specifically extremist pages. This decision tree was then applied to a randomly selected sample of pages for each class. The results from the secondary test showed similar results to the primary test and hold promise for future studies using this framework.
Article
Full-text available
This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on French tagging, we introduce a maximum entropy Markov model-based tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.75 % accuracy on the French Treebank, an error reduction of 25 % (38 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.
Article
Full-text available
Objective: Natural language processing (NLP) tasks are commonly decomposed into subtasks, chained together to form processing pipelines. The residual error produced in these subtasks propagates, adversely affecting the end objectives. Limited availability of annotated clinical data remains a barrier to reaching state-of-the-art operating characteristics using statistically based NLP tools in the clinical domain. Here we explore the unique linguistic constructions of clinical texts and demonstrate the loss in operating characteristics when out-of-the-box part-of-speech (POS) tagging tools are applied to the clinical domain. We test a domain adaptation approach integrating a novel lexical-generation probability rule used in a transformation-based learner to boost POS performance on clinical narratives. Methods: Two target corpora from independent healthcare institutions were constructed from high frequency clinical narratives. Four leading POS taggers with their out-of-the-box models trained from general English and biomedical abstracts were evaluated against these clinical corpora. A high performing domain adaptation method, Easy Adapt, was compared to our newly proposed method ClinAdapt. Results: The evaluated POS taggers drop in accuracy by 8.5-15% when tested on clinical narratives. The highest performing tagger reports an accuracy of 88.6%. Domain adaptation with Easy Adapt reports accuracies of 88.3-91.0% on clinical texts. ClinAdapt reports 93.2-93.9%. Conclusions: ClinAdapt successfully boosts POS tagging performance through domain adaptation requiring a modest amount of annotated clinical data. Improving the performance of critical NLP subtasks is expected to reduce pipeline error propagation leading to better overall results on complex processing tasks.
Book
In the post-September 11 world, Al Qaeda is no longer the central organizing force that aids or authorizes terrorist attacks or recruits terrorists. It is now more a source of inspiration for terrorist acts carried out by independent local groups that have branded themselves with the Al Qaeda name. Building on his previous groundbreaking work on the Al Qaeda network, forensic psychiatrist Marc Sageman has greatly expanded his research to explain how Islamic terrorism emerges and operates in the twenty-first century. In Leaderless Jihad, Sageman rejects the views that place responsibility for terrorism on society or a flawed, predisposed individual. Instead, he argues, the individual, outside influence, and group dynamics come together in a four-step process through which Muslim youth become radicalized. First, traumatic events either experienced personally or learned about indirectly spark moral outrage. Individuals interpret this outrage through a specific ideology, more felt and understood than based on doctrine. Usually in a chat room or other Internet-based venues, adherents share this moral outrage, which resonates with the personal experiences of others. The outrage is acted on by a group, either online or offline. Leaderless Jihad offers a ray of hope. Drawing on historical analogies, Sageman argues that the zeal of jihadism is self-terminating; eventually its followers will turn away from violence as a means of expressing their discontent. The book concludes with Sageman's recommendations for the application of his research to counterterrorism law enforcement efforts. Copyright
Chapter
It is now widely understood that extremists use the Internet in attempts to accomplish many of their objectives. In this chapter we present a web-crawler called the Terrorism and Extremism Network Extractor (TENE), designed to gather information about extremist activities on the Internet. In particular, this chapter will focus on how TENE may help differentiate terrorist websites from anti-terrorist websites by analyzing the context around the use of predetermined keywords found within the text of the webpage. We illustrate our strategy through a content analysis of four types of web-sites. One is a popular white supremacist website, another is a jihadist website, the third one is a terrorism-related news website, and the last one is an official counterterrorist website. To explore differences between these websites, the presence of, and context around 33 keywords was examined on both websites. It was found that certain words appear more often on one type of website than the other, and this may potentially serve as a good method for differentiating between terrorist websites and ones that simply refer to terrorist activities. For example, words such as “terrorist,” “security,” “mission,” “intelligence,” and “report,” all appeared with much greater frequency on the counterterrorist website than the white supremacist or the jihadist websites. In addition, the white supremacist and the jihadist websites used words such as “destroy,” “kill,” and “attack” in a specific context: not to describe their activities or their members, but to portray themselves as victims. The future developments of TENE are discussed.
Article
Despite over a decade of government funding and thousands of newcomers to the field of terrorist research, we are no closer to answering the simple question of “What leads a person to turn to political violence?” The state of stagnation with respect to this issue is partly due to the government strategy of funding research without sharing the necessary primary source information with academia, which has created an unbridgeable gap between academia and the intelligence community. This has led to an explosion of speculations with little empirical grounding in academia, which has the methodological skills but lacks data for a major breakthrough. Most of the advances in the field have come from historical archival research and analysis of a few field interviews. Nor has the intelligence community been able to achieve any breakthrough because of the structure and dynamic of this community and its lack of methodological rigor. This prevents creative analysis of terrorism protected from political concerns. The solution to this stagnation is to make non-sensitive data available to academia and to structure more effective discourse between the academic and intelligence communities in order to benefit from the complementary strengths in these two communities.
Article
General sentiment analysis for the social web has become increasingly useful for shedding light on the role of emotion in online communication and offline events in both academic research and data journalism. Nevertheless, existing general-purpose social web sentiment analysis algorithms may not be optimal for texts focussed around specific topics. This article introduces 2 new methods, mood setting and lexicon extension, to improve the accuracy of topic-specific lexical sentiment strength detection for the social web. Mood setting allows the topic mood to determine the default polarity for ostensibly neutral expressive text. Topic-specific lexicon extension involves adding topic-specific words to the default general sentiment lexicon. Experiments with 8 data sets show that both methods can improve sentiment analysis performance in corpora and are recommended when the topic focus is tightest.
Article
The main applications and challenges of one of the hottest research areas in computer science.
Article
Twitter messages are increasingly used to determine consumer sentiment towards a brand. The existing literature on Twitter sentiment analysis uses various feature sets and methods, many of which are adapted from more traditional text classification problems. In this research, we introduce an approach to supervised feature reduction using n-grams and statistical analysis to develop a Twitter-specific lexicon for sentiment analysis. We augment this reduced Twitter-specific lexicon with brand-specific terms for brand-related tweets. We show that the reduced lexicon set, while significantly smaller (only 187 features), reduces modeling complexity, maintains a high degree of coverage over our Twitter corpus, and yields improved sentiment classification accuracy. To demonstrate the effectiveness of the devised Twitter-specific lexicon compared to a traditional sentiment lexicon, we develop comparable sentiment classification models using SVM. We show that the Twitter-specific lexicon is significantly more effective in terms of classification recall and accuracy metrics. We then develop sentiment classification models using the Twitter-specific lexicon and the DAN2 machine learning approach, which has demonstrated success in other text classification problems. We show that DAN2 produces more accurate sentiment classification results than SVM while using the same Twitter-specific lexicon.