Conference PaperPDF Available

Not So Cute but Fuzzy: Estimating Risk of Sexual Predation in Online Conversations



Content may be subject to copyright.
Not So Cute but Fuzzy: Estimating Risk of Sexual Predation in Online
Tatiana R. Ringenberg*, Kanishka Misra* and Julia Taylor Rayz
Department of Computer and Information Technology, Purdue University, USA
Abstract The sexual exploitation of minors is a known
and persistent problem for law enforcement. Assistance in
prioritizing cases of sexual exploitation of potentially risky
conversations is crucial. While attempts to automatically triage
conversations for the risk of sexual exploitation of minors have
been attempted in the past, most computational models use
features which are not representative of the grooming process
that is used by investigators. Accurately annotating an offender
corpus for use with machine learning algorithms is difficult
because the stages of the grooming process feed into one another
and are non-linear. In this paper we propose a method for
labeling risk, tied to stages and themes of the grooming process,
using fuzzy sets. We develop a neural network model that uses
these fuzzy membership functions of each line in a chat as input
and predict the risk of interaction.
The sexual exploitation of online youths is a known and
persistent problem. The National Center for Missing and
Exploited Children (NCMEC) received 10.2 million reports
of suspected child exploitation in 2017 [1]. Researchers
have suggested the exploitation of minors online is also
likely under-reported, due in part to the offender encouraging
the minor to keep the interaction a secret and the minor
acquiescing to the request [2]. Furthermore, handling the
cases we even know about has placed additional strain on
law enforcement agencies[3], and any help with prioritizing
high risk conversations is a step forward.
Since the early 2000s, researchers have devoted a consid-
erable amount of time and resources towards understanding
and modeling the online grooming process which occurs
between adults and minors [4], [5]. Studies related to online
grooming theory are focused on the techniques and themes
used within the various stages of the process [4], [6], offender
and victim characteristics [7], and offender identification [8],
[9]. While qualitative analyses have identified many of the
themes and patterns occurring within each of the grooming
stages [4], [10], little research on offender identification
uses this information. The majority of research on offender
identification uses low-level features (n-grams, word cate-
gories, basic chat statistics) which have been found to be less
effective than the use of high-level features directly related to
grooming [11]. There is a growing need for research which
focuses on the use of grooming-derived features to improve
the identification of online solicitors of minors.
While the use of high-level features is ideal, expert
annotation is required due to the complexity of the task
[12]. Even for a researcher in online grooming, the task of
* First two authors contributed equally
identification of various aspects of the grooming process is
difficult because the stages are neither linear nor perfectly
discrete [4], [6]. From previous research, we know within
the first 20% of a conversation, an offender may engage in
several stages of the grooming process including friendship
forming, risk assessment, and the sexual stage. We also know
the progression through the grooming stages can be gradual
both within and between stages[6]. For instance, within
the sexual stage offenders will often start with innocuous
questions related to a minor’s previous romantic relationships
but will progress to questions about sexual history, hypothet-
icals about future encounters, and at times graphic sexual
descriptions of fantasy [10], [13]. Furthermore, identification
of code span is a known problem within Natural Language
Processing literature and is further compounded by the use
of chat logs in which subsequent lines may or may not
refer to the line prior [14], [15]. As the bounds between
stages are not discrete and the code span of a given topic
is problematic due to the nature of the conversation struc-
ture, fuzzy annotation of grooming-related content is ideal.
Through the use of fuzzy annotation, researchers can model
the gradual progression and oscillation of various aspects of
the grooming process.
We address the gap in offender literature by creating fuzzy
annotations for a small corpus of grooming conversations,
where each message is mapped to three levels of risk asso-
ciated with the grooming process, namely low, medium and
high. The novelty of our approach is in modeling imprecise
boundaries of annotations of risk in conversations of offend-
ers and minors. This increases the speed of annotation, thus
making it possible to process time sensitive chats that may be
used for training or development set faster. A chat message
can belong to multiple categories with varying degrees of
membership. We employ fuzzy sets [16] to approximate the
Identification of predatory intent has been studied from
both a qualitative and computational perspective. Within
qualitative research, researchers have identified the pro-
cesses and themes offenders use to entrap minors [4], [5],
[17]. From the computational perspective, researchers have
worked on the identification of non-offenders from offenders
[8], [9] along with identifying various aspects of the groom-
ing stages [11].
O’Connell [4] developed the online grooming process
which consisted of the following stages: friendship forming,
2019 IEEE International Conference on Systems, Man and Cybernetics (SMC)
Bari, Italy. October 6-9, 2019
978-1-7281-4569-3/19/$31.00 ©2019 IEEE 2946
relationship forming, risk assessment, exclusivity, and sexual.
The friendship forming stage consists of conversation which
would generally be considered normal. Relationship forming
is also non-sexual but includes more child-specific themes
than the friendship forming stage. Risk assessment revolves
around reducing the probability of detection. The exclusivity
stage is the stage in which a bond with the child is estab-
lished. Finally, the sexual stage is a complex amalgamation
of topics and techniques an offender uses to normalize sexual
discussion. Offenders normalize sexual discussion by starting
with innocuous questions and ramping up sexual discussion
to include hypotheticals, cyber sex, and at times images [4].
Gupta et al. [17] labeled the stages identified by O’Connell
in 75 chat conversations to identify patterns in and between
phases. The authors suggested the relationship forming stage
is the most dominant stage. Additionally, the authors found
the topic of meeting did not occur only at the end of the
chat but rather occurred in multiple places throughout the
chat [17].
Other studies have focused on the themes which occur
within the chat conversations. Kloess et al. [13] examined a
set of five cases consisting of 29 transcripts to find patterns
related to the modus operandi of the offender. The authors
found themes related to initiating online sexual activity, pur-
suing sexual information, fantasy rehearsal, and discussion of
physical meeting [13]. Barber and Bettez [10] also annotated
a series of conversations for grooming themes and five main
themes: assessment, enticements, cyberexploitation, control,
and self-preservation.
Aside from grooming stages and themes, research on
offender strategies has provided clues as to the pacing and
progress of the offender process. As far back as the 1970s and
1980s, authors discussed the concept of progression within
the grooming process. Peters [18] described the movement of
offenders towards sexual activity with a minor as the result
of the affection-seeking of the child. Additionally, Lang and
Frenzel [19] identified progressive horseplay as a means
to move from innocuous childrens’ games to seemingly
accidental touching which finally results in sexual contact.
The authors also found the progression of the sexual activity
was not always gradual - the offender would intensify efforts
when it appeared the child was either confused or curious
and would lessen the intensity of grooming efforts when the
child appeared hesitant [19].
Within online grooming, authors have also identified sev-
eral characteristics related to the pacing and distribution of
grooming. Through examining the word categories present
within offender chats, Black et al. [6] confirmed the findings
of O’Connell [4] who determined stages are non-linear.
Additionally, Black et. al [6] found within the first 20% of
a chat an offender will go through multiple stages of the
grooming process including potentially friendship forming,
risk assessment, and sexual conversation. Kloess et al [13]
described the pacing of offender conversations as varying.
Some offenders formed a deep relationship with the minors
and incorporated sexual topics slowly. Other offenders chose
to not develop any relationship with the minor and instead
engaged in sexual conversation with the minor almost im-
mediately [13]. Similar to [13], Winters, Kaylor, and Jeglic
[20] found sexual conversation was often mentioned early
in conversation as a means to assess interest in sexual
activity. Offenders within the study also attempted to arrange
meetings within a short period of time [20].
Overall, qualitative research shows the distribution of
grooming themes and stages within a chat conversation
varies [6], [13], [20]. Additionally, the intensity of sexual
conversation and the level of risk of physical meeting varies
throughout the conversation and does not instantly change
but rather escalates and de-escalates depending upon the
situation [4], [13], [20].
Within the computational analysis of offender chats, a
large portion of research has focused on automatic triage of
offenders within chats [8], [9], [11], [21]. While features for
identification of offender conversations vary, one strategy is
to operationalize themes and stages of the grooming process
through natural language processing tools.
Cano, Fernandez, and Alani [22] use various behavioral
and lexical features to classify lines of chats into the stages of
Luring Communication Theory, which maps online grooming
themes to a communication process centered around decep-
tive trust [5]. The three stages included grooming, approach,
and trust development[22]. The features used included senti-
ment, bag of words, readability, lexical category features,
and chat patterns. The authors found combining features
together resulted in higher performance when detecting both
the grooming phase and the approach phase. Additionally,
the authors found sentiment features were not good at dis-
criminating between stages. Lexical category features were
found to improve classification but on their own were not
good predictors of the phases[22].
A similar study was performed by Michalopoulos and
Mavridis [23] in which the authors also attempted to classify
the elements of grooming based on a set of features. In this
study, the stages being categorized were sexual affair, gaining
access, and deceptive relationship. The authors used TF-IDF
and other document classification methods [23].
Finally, McGhee et al. [21] compared their rules-based
approach to identifying the pursuit of personal information,
the grooming process, and the approach of the predator to the
victim to a machine learning approach which used a series
of features including various categories of words and various
groups of parts of speech. Some of the categories of words
included approach nouns, activity nouns, and family nouns.
The categories were similar to the categories which are found
in the lexical category dictionary in the Linguistic Word
Count (LIWC) tool [24]. The authors found their previous
work which used a rule-based approach was superior to the
use of the machine learning methods of decision trees and
instance based learning [21]. However, the accuracy of the
rule-based system only reaches 68% which leaves room for
The majority of the computational studies rely on surface
level features. However, multiple studies have found combin-
ing such features, or using features which are representative
of the grooming process is a more successful approach[11],
[22]. Identifying only features such as n-grams, part of
speech, and word categories does not provide an accurate
picture of grooming because the strategies within an offender
conversation (1) do not occur in a particular order, (2) are
gradually shifted between, and (3) are repeated in various
combinations throughout the chat based on the input of the
minor [4], [6], [17]. As a result, annotating grooming char-
acteristics for computational use is a crucial task. However,
manual annotation of grooming characteristics is difficult for
the same reasons. The ebb and flow of themes causes issues
for annotators in terms of both complexity and code span
[14], [15]. Issues with complexity and code span result in
lower inter-annotator reliability, which affects the corpus’s
usability and validity [14], [15]. For such a task, a fuzzy
representation of annotations is preferable, as lines may
be progressive transitions between or within stages of the
grooming process.
A. Building the Corpus
To build our corpus, we choose eight chats from Perverted
Justice (PJ), comprising a total of 13,648 individual mes-
sages, with 16 unique users. PJ is a vigilante organization
which seeks to identify sexual solicitors of minors online.
The participants act as child decoys and talk to potential
solicitors as if they are minors. If the conversation progresses
to a level in which an individual is considered dangerous
and/or breaks the law, the organization contacts and works
with law enforcement to potentially capture the individual.
The chat conversations occurring between convicted offend-
ers and decoys are posted to the Perverted Justice website and
are the primary source of chat conversations for individuals
studying online solicitor communication [6], [25], [26].
As we are interested in improving the triage process of
chats for law enforcement, we annotate the eight offender-
decoy chats for risk. The literature leads us to believe fuzzy
representations of annotations are preferable. However before
fuzzy sets can be built, we need to start with a set of crisp
labels. We labeled each line of the eight chat conversations
as being low, medium, or high risk.
1) Low Risk Lines: These were defined as those lines
which could be considered typical of a non-sexual chat
between two individuals. In terms of grooming, this corre-
sponded to the friendship forming stage, relationship forming
stage, and risk assessments which did not obviously appear
to be related to sex or deception (E.g. requesting an image
of a child that was non-sexual).
Fig 1. is an example of a set of low risk lines occurring
at the beginning of one of the conversations:
The excerpt was considered to be low risk because the
solicitor engaged in small talk typical of two individuals
meeting in a chat room.
2) Medium risk lines: These are lines in which explicit
affection, compliments about the decoys appearance or body
parts which are implicitly but not explicitly sexual, secrecy,
guilt, or the exclusivity stage are present. The medium level
Solicitor: hey
Decoy: hey. ur in jasper?
Solicitor: yes
Decoy: kool wats u doin
Solicitor: nothing
Solicitor: i’m just laying in bed
Fig. 1: A snippet of low-risk messages
of risk maps to language around an isolated, covert bond
being formed between the offender and the decoy. The
following lines were labeled as medium risk within one of
the chats (Fig. 2):
Solicitor: look at you just a lil angel
Solicitor: lol
Decoy: thanks :p
Solicitor: i think my fav is you in the green tights
Solicitor: well i like them all actually
Decoy: thanks yeah it shows the most of me
Solicitor: yeah a lil bit of your side
Solicitor: lol
Decoy: lol yeah i bet you like that ¿:)
Solicitor: yeah i do
Fig. 2: A snippet of medium-risk messages
In this section of the chat, the offender compliments the
decoy by calling her a lil angel and makes comments about
the clothing of the decoy in the image she had sent. The
decoy implies the solicitor likes the image for sexual reasons,
but does not directly state this. The solicitor responds in the
affirmative, but does not explicitly sexualize the conversation.
3) High risk lines: These are lines in which the sexual
stage or discussion of meeting and explicit risk assessment
of meeting occur. In the previous excerpt, the solicitor/decoy
was headed in a sexual direction but was not explicit. In the
following lines, the offender encourages the minor to meet
physically (Fig 3):
Solicitor: i’m soo bored ..i’m coming to get u
Solicitor: jk
Solicitor: ouch ..good move
Decoy: ohhh ur jk?lol
Solicitor: unless u want me to ;)
Fig. 3: A snippet of high-risk messages
The above excerpt shows an offender introducing the
concept of meeting by framing it as a joke. As the conver-
sation continued, the offender began asking more and more
hypotheticals related to whether or not the child would like
to meet and what the meeting would entail. This progression
is common within offender chats. While the initial coding of
the corpus is crisp, the actual transitions between risk levels
is not. For instance, in Fig. 4, the offender and victim start
casually talking and the offender incorporates affectionate
talk which is consistent with the exclusivity stage.
Based on the conversations that were used in this work,
in addition to previous literature, we discovered that three
lines before and after the beginning of each definitive risk
Solicitor: hey
Decoy: hey
Solicitor: hey what are you up to
Solicitor: you there
Solicitor: really where you been at
Decoy: just not here much. what r u up to?
Solicitor: missing you
Decoy: u really missed me?
Solicitor: yes i try to call your number but you never answer
Fig. 4: A snippet of a risk transition
level chunk would approximately contribute in the transition
between risk levels [14], [15]. Thus, we start building our
risk level fuzzy sets starting from the fourth line before and
after the beginning of each crisp label for the risk. In the
next section, we discuss the methodology for creating fuzzy
sets from the crisp labels we initially created.
B. Building Fuzzy Sets from Crisp Labels
The fuzzy set for each line comprises of the line’s mem-
bership degree to each class - low, medium or high. To build
our fuzzy set from the imprecise crisp annotations, we use a
trapezoidal function, as indicated by Eq (1):
µC(l) =
4if a4l < a
1if alb
4if b<lb+ 4
where l indicates maximum membership of the line to
the given class C, a and b indicate the start and ending
points of the crisp set. For the three classes Clow,Cmedium ,
and Chigh, the fuzzy set describing the membership of
risk for line lwould be the following vector: µ(l) =
[µlow(l), µmedium (l), µhigh(l)]>Hence, models used in the
task of measuring risk would estimate the membership of
each class rather than classify the line into a singular
In this section, we propose simple models that have been
shown to be competitive in regular text classification tasks.
These models are modified variants of previously successful
models that have shown to be simple and difficult to beat
baselines for sentence classification. Specifically, we use two
A. Model Description
1) Deep Averaging Network (DAN): As described in Iyyer
et al. [27], ], this model takes the embeddings of each
word in the input and averages them to produce a sentence
representation, which is then passed as input into a Multi-
Layer Feed-Forward Network with a Softmax layer at the end
to classify the input. This model has shown to be a strong
baseline for many classification tasks [27].
2) Multi-Channel Convolution Neural Networks (CNN):
Using CNNs for sentence classification is a technique first
proposed by Kim [28]. In this model, each word in the
sentence is initialized by a word-based pretrained repre-
sentation as well as a random vector, which comprise of
the two channels, and multiple 1D convolution filters are
used to extract signals from various n-grams in the sentence,
which are passed through a max-pooling layer to produce a
fixed length sentence representation which is finally used for
classification using a standard feed-forward layer.
3) Modifications to the above models: In our case, we use
fasttext vectors [29] to represent words in DAN as well as
in initializing the CNN model. Additionally, for words that
do not have a vector in the pretrained embedding matrix,
we use fasttexts approximation by summing the unknown
words sub-word vectors (3-6 character n-grams) to represent
the word. Since we are trying to estimate the entire fuzzy set
instead of a single class, we use the sigmoid function in the
last layer of both these models, and return a 3-dimensional
vector comprising of the membership values of the input for
each risk-level.
B. Loss Function
To train our networks, we use the L1 Loss function to
compare our output vector with the ground truth fuzzy
set. The L1 loss function computes the sum of absolute-
differences between two vectors, it is also referred to as
Least Absolute Deviations, and is robust when it encounters
outliers. Our models are trained by jointly minimizing the
following function, for all classes (low, medium, high):
Where µi(l)and ˆyiare the ground truth and the estimated
value of the membership for class iof the given line, where
iC={low, medium, high}
C. Evaluation
The ground truth for risk of a given line is a fuzzy set
with membership values of the line for three classes, and
our proposed baselines produce a 3-dimensional vector by
training using the input chat message and ground truth.
Since these are not singular, precise values as observed in
binary/multinomial classification, metrics such as accuracy,
precision, F1 scores are not suitable to evaluate our results.
Thus, we use a fuzzy metric to calculate how similar the
output of the neural network is to the truth fuzzy set of the
line. In this case, we use a Fuzzy Jaccard Similarity metric
[30], where fuzzified versions of the union, intersections and
cardinality are used. Mathematically, the Jaccard Similarity
between two sets, A and B, is given by:
J(A, B) = |AB|
In fuzzy operations [16] between two fuzzy sets Aand B
with membership µi, denoting the set’s membership degree
for the ith class, their intersection, AB, is given by
Fig. 5: Illustration of the truth (bottom) membership values, and values estimated by the CNN model (top) on randomly
selected chunks in the test set. Low: Blue, Medium: Yellow, High: Red
min{µi(A), µi(B)}iC, their union, AB, is given by
max{µi(A), µi(B)}iC. Finally, the cardinality of a fuzzy
set, say Q, denoted by |Q|, is given by PC
1µi(Q). Thus,
the Fuzzy Jaccard Similarity between fuzzy sets Aand B,
is computed as follows:
JF uzzy (A, B) = P
min {µi(A), µi(B)}iC
max {µi(A), µi(B)}iC
For our purposes, we calculate this metric for each line, and
report the average fuzzy-Jaccard similarity over the entire set
(training. validation or test). It is bounded by 0 and 1, i.e., the
closer the output produced by the model to the true-fuzzy set
for a given line, the greater is the value of this metric. Thus,
it serves as a good evaluation metric to compare different
models over a large collection of lines, as is in our case.
D. Experiments and Results
We split the final corpus into train, validation, and test
sets. Out of the 8 different conversations, we randomly select
two whole conversations to be our validation and test sets
respectively. We select conversations instead of chunks of
lines or randomly sampled lines because during our analysis,
we can explore and probe the model based on an entire
meaningful unit, in this case a conversation, rather than
assess how it performed on isolated lines. Table I describes
the number of lines in each of the sets.
Set Conversations Size(lines)
Train 6 11900
Valid. 1 977
Test 1 771
We train our DAN and CNN models with a dropout of 0.5
after the embedding layer, a learning rate of 1×104, and
using the Adam Optimizer. The DAN model was trained for
1000 epochs with a minibatch size of 50, while the CNN
model was trained for 100 epochs with a minibatch size of
32. The best model was chosen based on the highest average
Fuzzy-Jaccard Similarity on the validation set. The results of
both models are shown in Table II.
Model Epochs Jfuz zy
CNN 100 0.455
DAN 1000 0.380
Figure 5 shows excerpts from two of the chats in the
test set. While the ground truth and the model’s prediction
look somewhat different, the model makes sense because
the various grooming stages are not independent [6], [4].
Elements of each of the grooming stages feeds into the next.
For instance, the exclusivity stage is focused around trust
and isolation which is used to transition the child into the
sexual stage [4]. So, it is feasible for a high risk line to
contain elements of low and medium risk because the high
risk sexual or meeting information build upon the low and
medium risk information.
The estimated labels in Figure 5B demonstrate the pro-
gressive decrease and increase of sexual topics throughout
the chat. As the high risk level decreases, the low risk
and medium risk related to the non-sexual aspects of the
grooming level increase. This is consistent with previous
research which indicates grooming is non-linear and grad-
ually progresses between stages [4], [6], [17]. The estimated
labels in Figure 5A are also consistent with the ebb and
flow of the grooming process and also accurately capture
the sexual nature of the chat from lines 730 745 as well as
the variability of the last 5 lines.
In this work, our aim was to estimate risk of sexual
predation within online chat conversations. We deviated from
the classical notion of crisp sets and instead use fuzzy
membership functions to quantify the amount of risk present
in each chat message. In our framework, a given chat
message can belong to low, medium and high risk with
varying membership degrees. We experimented on eight on-
line conversations between predators and decoys by using a
fuzzy membership function to label the amount of risk within
each line, which were used in two neural network models
that were tested on a new conversation. While the models
achieved moderate values of the fuzzy-jaccard similarity, the
patterns of the various risks as produced from the model
on a new conversation are consistent with existing literature
on grooming processes. Leveraging contextual cues and the
relationships between the various message snippets can help
these models perform better.
This research was partially supported by Purdue Research
[1] “Online Enticement of Children: An In-Depth Analysis of
CyberTipLine Reports,” in, National Center for Missing and
Exploited Children, 2017.
[2] H. C. Whittle, C. Hamilton-Giachritsis, and A. R. Beech,
“Victimsfffdfffdfffd voices: The impact of online grooming
and sexual abuse,Universal Journal of Psychology, vol. 1,
no. 2, pp. 59–71, 2013.
[3] M. M. Chiu, K. C. Seigfried-Spellar, and T. R. Ringenberg,
“Exploring detection of contact vs. fantasy online sexual of-
fenders in chats with minors: Statistical discourse analysis of
self-disclosure and emotion words,” Child abuse & neglect,
vol. 81, pp. 128–138, 2018.
[4] R. O ´
Connell, “A typology of child cybersexploitation and
online grooming practices,” Preston, UK: University of Cen-
tral Lancashire, 2003.
[5] L. N. Olson, J. L. Daggs, B. L. Ellevold, and T. Rogers,
“Entrapping the innocent: Toward a theory of child sexual
predators’ luring communication,” Communication Theory,
vol. 17, no. 3, pp. 231–251, 2007.
[6] P. J. Black, M. Wollis, M. Woodworth, and J. T. Hancock,
“A linguistic analysis of grooming strategies of online child
sex offenders: Implications for our understanding of preda-
tory sexual behavior in an increasingly computer-mediated
world,” Child Abuse & Neglect, vol. 44, pp. 140–149, 2015.
[7] K. M. Babchishin, R. K. Hanson, and H. VanZuylen, “Online
child pornography offenders are different: A meta-analysis of
the characteristics of online and offline sex offenders against
children,” Archives of sexual behavior, vol. 44, no. 1, pp. 45–
66, 2015.
[8] J. Parapar, D. E. Losada, and A. Barreiro, “A learning-based
approach for the identification of sexual predators in chat
logs.,” in CLEF), vol. 1178, 2012.
[9] M. Ebrahimi, C. Y. Suen, and O. Ormandjieva, “Detecting
predatory conversations in social media by deep convo-
lutional neural networks,” Digital Investigation, vol. 18,
pp. 33–49, 2016.
[10] C. Barber and S. Bettez, “Deconstructing the online groom-
ing of youth: Toward improved information systems for
detection of online sexual predators,” 2014.
[11] D. Bogdanova, P. Rosso, and T. Solorio, “Exploring high-
level features for detecting cyberpedophilia,Computer
speech & language, vol. 28, no. 1, pp. 108–120, 2014.
[12] P. S. Bayerl and K. I. Paul, “What determines inter-coder
agreement in manual annotations? a meta-analytic investiga-
tion,” Computational Linguistics, vol. 37, no. 4, pp. 699–725,
[13] J. A. Kloess, S. Seymour-Smith, C. E. Hamilton-Giachritsis,
M. L. Long, D. Shipley, and A. R. Beech, “A qualitative
analysis of offenders’ modus operandi in sexually exploita-
tive interactions with children online,Sexual Abuse, vol. 29,
no. 6, pp. 563–591, 2017.
[14] P. Kingsbury, M. Palmer, and M. Marcus, “Adding seman-
tic annotation to the penn treebank,” in Proceedings of
the human language technology conference, Citeseer, 2002,
pp. 252–256.
[15] M. Bada, M. Eckert, D. Evans, K. Garcia, K. Shipley, D.
Sitnikov, W. A. Baumgartner, K. B. Cohen, K. Verspoor,
J. A. Blake, et al., “Concept annotation in the craft corpus,”
BMC bioinformatics, vol. 13, no. 1, p. 161, 2012.
[16] L. A. Zadeh, “Fuzzy sets,” Information and control, vol. 8,
no. 3, pp. 338–353, 1965.
[17] A. Gupta, P. Kumaraguru, and A. Sureka, “Characterizing
pedophile conversations on the internet using online groom-
ing,” arXiv preprint arXiv:1208.4324, 2012.
[18] J. J. Peters, “Children who are victims of sexual assault
and the psychology of offenders,American Journal of
Psychotherapy, vol. 30, no. 3, pp. 398–421, 1976.
[19] R. A. Lang and R. R. Frenzel, “How sex offenders lure
children,” Annals of Sex Research, vol. 1, no. 2, pp. 303–317,
[20] G. M. Winters, L. E. Kaylor, and E. L. Jeglic, “Sexual
offenders contacting children online: An examination of tran-
scripts of sexual grooming,Journal of sexual aggression,
vol. 23, no. 1, pp. 62–76, 2017.
[21] I. McGhee, J. Bayzick, A. Kontostathis, L. Edwards, A.
McBride, and E. Jakubowski, “Learning to identify internet
sexual predation,International Journal of Electronic Com-
merce, vol. 15, no. 3, pp. 103–122, 2011.
[22] A. Cano Basave, M. Fern´
andez, and H. Alani, “Detecting
child grooming behaviour patterns on social media,” 2014.
[23] D. Michalopoulos and I. Mavridis, “Utilizing document
classification for grooming attack recognition,” in 2011 IEEE
Symposium on Computers and Communications (ISCC),
IEEE, 2011, pp. 864–869.
[24] Y. R. Tausczik and J. W. Pennebaker, “The psychologi-
cal meaning of words: Liwc and computerized text analy-
sis methods,” Journal of language and social psychology,
vol. 29, no. 1, pp. 24–54, 2010.
[25] K. Guice, Predators, decoys, and teens: A corpus analysis
of online language. Hofstra University, 2016.
[26] M. Drouin, R. L. Boyd, J. T. Hancock, and A. James,
“Linguistic analysis of chat transcripts from child predator
undercover sex stings,The Journal of Forensic Psychiatry
& Psychology, vol. 28, no. 4, pp. 437–457, 2017.
[27] M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daum´
III, “Deep unordered composition rivals syntactic methods
for text classification,” in Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers), vol. 1, 2015,
pp. 1681–1691.
[28] Y. Kim, “Convolutional neural networks for sentence classifi-
cation,” in Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), 2014,
pp. 1746–1751.
[29] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich-
ing word vectors with subword information,Transactions
of the Association for Computational Linguistics, vol. 5,
pp. 135–146, 2017.
[30] V. Zhelezniak, A. Savkov, A. Shen, F. Moramarco, J. Flann,
and N. Y. Hammerla, “Don’t settle for average, go for the
max: Fuzzy sets and max-pooled word vectors,” in Interna-
tional Conference on Learning Representations, 2019.
... Ringenberg et al. [20] described a method for identifying fuzzy sets for assessing risk in offender conversations with decoys. This study extends the study by examining not risk but rather the stages of the grooming process as defined in [5,19]. ...
... As in Ringenberg et al. [20], a trapezoidal membership function similar to that shown in Fig. 1 was chosen to represent the shape of each stage within the chat. The trapezoidal shape was chosen due to the tendency of stages to intermingle and due to the partial membership of lines before and after groups of lines with full membership in a stage. ...
Full-text available
The sexual solicitation of minors online is a known and growing problem. The process by which adults attempt to entice minors into sexual situations both online and offline is called grooming. Online grooming is a complex, non-linear process consisting of six interweaving stages. While the stages of grooming are well-understood in the literature, little work has focused on the interplay and overlap of the grooming stages throughout grooming conversations. Previous researchers have identified key aspects of the grooming process by annotating and analyzing grooming conversations. However, traditional annotation methods are unable to express the fuzzy nature of the grooming process, as annotation per line is generally limited to a single crisp grooming stage. This paper addresses the gap in literature by describing an annotation method by which the complexities and differences in grooming may be coded and examined. Three conversation types from the domain are annotated to demonstrate the applicability of the annotation protocol. The method presented in this paper results in fuzzy sets for each of the grooming stages within each of the three conversation types. In this paper, the fuzzy annotation protocol and protocol considerations are discussed. Following this discussion, six chats (two offender-victim, two offender-decoy, and two offender-Law Enforcement) chats are analyzed using the fuzzy annotation protocol in order to demonstrate the applicability of fuzzy sets for this task.
... Additional classifiers, that is, linear and RBF coordinate descent fuzzy twin support vector machines (CDFTSVM), were tested. The authors of [35], [38], [39], [40], [45] used the text classification approach to detect predatory behavior in chat logs using a wide variety of well-known classifiers such as SVM, CNN, deep artificial networks (DAN), RF, and NB. ...
Full-text available
Online gaming no longer has limited access, as it has become available to a high percentage of children in recent years. Consequently, children are exposed to multifaceted threats, such as cyberbullying, grooming, and sexting. Although the online gaming industry is taking concerted measures to create a safe environment for children to play and interact with, such efforts remain inadequate and fragmented. There is a vital need to develop laws and policies to regulate and build minimum standards for the industry to safeguard and protect children online on the one hand, while promoting innovations in the gaming industry to preempt such threats. Many tools have been adapted to control threats against children in the form of content filtering and parental controls, thereby restricting contact with children to protect them from child predators. Different approaches utilizing machine learning (ML) techniques to detect child predatory behavior have been designed to provide potential detection and protection in this context. In this paper, we survey online threats to children in the gaming environment and present the limitations of existing solutions that address these threats. We also aimed to present the challenges that ML techniques face in protecting children against predatory behavior by presenting a systematic review of the available techniques in the literature. Therefore, this analysis provides not only recommendations to stakeholders to develop policies and practices that safeguard children when gaming, but also to the gaming industry to continue providing appropriate measures for a safe and entertaining gaming environment.
... It is the most popular dataset for online chat predatory detection related tasks an there are many recent related works use this dataset (e.g. predatory conversation detection [17], sexual predation risk estimation [18], etc.). As seen in Table I, the dataset is divided into training corpus and testing corpus. ...
Full-text available
Professionals in the field need a comprehensive understanding of the risks and practices associated with online sex grooming to safeguard young individuals from online sex offenders. While the Internet offers numerous positive aspects, one of the most detrimental issues is its potential for facilitating online sexual exploitation. Originally designed as a communication tool, the Internet inadvertently provides access to promiscuous content for countless children, often in a covert manner. The objective of our task is to identify and flag potential predators through analysis of comments and online media accounts, with the intention of reporting such instances to the appropriate cyber cell administrator. Recent public surveys indicate that approximately one in five young people actively search for sexual content online each year (Finkelhor, Mitchell, & Wolak, 2000; Mitchell, Finkelhor, & Wolak, 2001). This task report outlines our progress in developing a framework to address this issue. Through the implementation of this framework, accounts associated with predatory behaviour are identified, and reports are promptly submitted to the administrator for further action.
We collected Instagram data from 150 adolescents (ages 13-21) that included 15,547 private message conversations of which 326 conversations were flagged as sexually risky by participants. Based on this data, we leveraged a human-centered machine learning approach to create sexual risk detection classifiers for youth social media conversations. Our Convolutional Neural Network (CNN) and Random Forest models outperformed in identifying sexual risks at the conversation-level (AUC=0.88), and CNN outperformed at the message-level (AUC=0.85). We also trained classifiers to detect the severity risk level (i.e., safe, low, medium-high) of a given message with CNN outperforming other models (AUC=0.88). A feature analysis yielded deeper insights into patterns found within sexually safe versus unsafe conversations. We found that contextual features (e.g., age, gender, and relationship type) and Linguistic Inquiry and Word Count (LIWC) contributed the most for accurately detecting sexual conversations that made youth feel uncomfortable or unsafe. Our analysis provides insights into the important factors and contextual features that enhance automated detection of sexual risks within youths' private conversations. As such, we make valuable contributions to the computational risk detection and adolescent online safety literature through our human-centered approach of collecting and ground truth coding private social media conversations of youth for the purpose of risk classification.
In the last two decades, human trafficking (where individuals are forcibly exploited for the profits of another) has seen increased attention from the artificial intelligence (AI) community. Clear focus on the ethical risks of this research is critical given that those risks are disproportionately born by already vulnerable populations. To understand and subsequently address these risks, we conducted a systematic literature review of computing research leveraging AI to combat human trafficking and apply a framework using principles from international human rights law to categorize ethical risks. This paper uncovers a number of ethical tensions including bias endemic in datasets, privacy risks stemming from data collection and reporting, and issues concerning potential misuse. We conclude by highlighting four suggestions for future research: broader use of participatory design; engaging with other forms of trafficking; developing best practices for harm prevention; and including transparent ethics disclosures in research. We find that there are significant gaps in what aspects of human trafficking researchers have focused on. Most research to date focuses on aiding criminal investigations in cases of sex trafficking, but more work is needed to support other anti-trafficking activities like supporting survivors, adequately address labor trafficking, and support more diverse survivor populations including transgender and nonbinary individuals.
Full-text available
This exploratory study is the first to identify content differences between youths’ online chats with contact child sex offenders (CCSOs; seek to meet with youths) and those with fantasy child sex offenders (FCSOs; do not meet with youths) using statistical discourse analysis (SDA). Past studies suggest that CCSOs share their experiences and emotions with targeted youths (self-disclosure grooming tactic) and encourage them to reciprocate, to build trust and closer relationships through a cycle of self-disclosures. In this study, we examined 36,029 words in 4,353 messages within 107 anonymized online chat sessions by 21 people, specifically 12 youths and 9 arrested sex offenders (5 CCSOs and 4 FCSOs), using SDA. Results showed that CCSOs were more likely than FCSOs to write online messages with specific words (first person pronouns, negative emotions and positive emotions), suggesting the use of self-disclosure grooming tactics. CCSO’s self-disclosure messages elicited corresponding self-disclosure messages from their targeted youths. These results suggest that CCSOs use grooming tactics that help engender youths’ trust to meet in the physical world, but FCSOs do not.
Full-text available
We analyzed chat transcripts from 590 undercover Internet sex stings across the US, using the Linguistic Inquiry and Word Count software program to examine trends in sexual word usage, total word usage, and clout (a measure conveying social dominance) for convicted child sex offenders and undercover agents. Offenders and agents varied greatly in their scores in these word categories; however, generally, offenders used more words in each: 91% used more sexual words, 66% used more words overall, and 82% exhibited more clout than their respective agents. Linguistic analyses can provide the trier of fact with objective measures of psychometric properties that may help them assess the offender’s predisposition and appropriateness of government conduct. Additionally, our data-set shows the distribution of these language dimensions across a wide sample of offenders, providing a statistical context for linguistic evidence from individual cases. As language-based digital evidence become more prevalent, forensic linguistic analyses may prove invaluable in the courtroom.
Full-text available
The present study investigated transcripts of adults sexually grooming decoy victims on the Internet. One hundred transcripts were coded for offender characteristics, victim characteristics, and dynamics of the conversation. The results revealed that all of the offenders were male, most of whom believed they were communicating with an adolescent female. The sexual intentions of the offenders were made clear, with the majority introducing sexual content early on into the conversation. The length of the contact ranged from one day to nearly one year, suggesting that the duration of the online grooming process may vary significantly. The majority of offenders also communicated with the decoy victim over the telephone and attempted to arrange an in-person meeting, many within short periods of time. Implications for prevention and future research are discussed.
Full-text available
Automatic identification of predatory conversations in chat logs helps the law enforcement agencies act proactively through early detection of predatory acts in cyberspace. In this paper, we describe the novel application of a deep learning method to the automatic identification of predatory chat conversations in large volumes of chat logs. We present a classifier based on Convolutional Neural Network (CNN) to address this problem domain. The proposed CNN architecture outperforms other classification techniques that are common in this domain including Support Vector Machine (SVM) and regular Neural Network (NN) in terms of classification performance, which is measured by F1-score. In addition, our experiments show that using existing pre-trained word vectors are not suitable for this specific domain. Furthermore, since the learning algorithm runs in a massively parallel environment (i.e., general-purpose GPU), the approach can benefit a large number of computation units (neurons) compared to when CPU is used. To the best of our knowledge, this is the first time that CNNs are adapted and applied to this application domain.
Full-text available
The aggressive online solicitation of youth by online sexual predators has been established as an unintended consequence of the connectedness afforded individuals through social media. Computer science research that has focused on the detection of online sexual predators is scant and absent behavioral theory. We address this gap through examining what behavioral patterns emerge regarding how online sexual predators use language inside of social media to groom youth. Through a grounded theory analysis of ninety Perverted Justice (PVJ) transcripts, of conversations between convicted online sexual predators and PVJ volunteers who posed as youth, we identified five categories of online predator behavior inside of text during victimization of children. Those categories are: assessment, enticements, cybersexploitation, control and self-preservation. The aim of the research is twofold: (a) to improve pattern recognition programming for automated detection software, and (b) to improve educational tools for youth, parents, guardians, educators, and law enforcement.
Full-text available
Transcripts of chat logs of naturally occurring, sexually exploitative interactions between offenders and victims that took place via Internet communication platforms were analyzed. The aim of the study was to examine the modus operandi of offenders in such interactions, with particular focus on the specific strategies they use to engage victims, including discursive tactics. We also aimed to ascertain offenders’ underlying motivation and function of engagement in online interactions with children. Five cases, comprising 29 transcripts, were analyzed using qualitative thematic analysis with a discursive focus. In addition to this, police reports were reviewed for descriptive and case-specific information. Offenders were men aged between 27 and 52 years (M = 33.6, SD = 5.6), and the number of children they communicated with ranged from one to 12 (M = 4.6, SD = 4.5). Victims were aged between 11 and 15 (M = 13.00, SD = 1.2), and were both female and male. Three offenders committed online sexual offenses, and two offenders committed contact sexual offenses in addition to online sexual offenses. The analysis of transcripts revealed that interactions between offenders and victims were of a highly sexual nature, and that offenders used a range of manipulative strategies to engage victims and achieve their compliance. It appeared that offenders engaged in such interactions for the purpose of sexual arousal and gratification, as well as fantasy fulfillment.
Full-text available
There is a large body of evidence to suggest that child sex offenders engage in grooming to facilitate victimization. It has been speculated that this step-by-step grooming process is also used by offenders who access their underage victims online; however, little research has been done to examine whether there are unique aspects of computer-mediated communication that impact the traditional face-to-face grooming process. This study considered the similarities and differences in the grooming process in online environments by analyzing the language used by online offenders when communicating with their victims. The transcripts of 44 convicted online offenders were analyzed to assess a proposed theory of the online grooming process (O'Connell, 2003). Using a stage-based approach, computerized text analysis examined the types of language used in each stage of the offender-victim interaction. The transcripts also were content analyzed to examine the frequency of specific techniques known to be employed by both face-to-face and online offenders, such as flattery. Results reveal that while some evidence of the strategies used by offenders throughout the grooming process are present in online environments, the order and timing of these stages appear to be different. The types (and potential underlying pattern) of strategies used in online grooming support the development of a revised model for grooming in online environments. Copyright © 2015 Elsevier Ltd. All rights reserved.
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.