ArticlePDF Available

Detection of Dangerous Interactions in Online Chat Services

Authors:
  • CyberAgent, Inc.

Abstract and Figures

More minors are communicating online with others due to the increasing popularity of smartphones and SNS sites. Even though such online communication services are very convenient, for example, allowing interaction among friends at distant places, unfortunately, they also expose minors to the risks of cyberbullying and sexual predators. Therefore, a technique that estimates whether users are minors is essential for protecting them from online risks. In this paper we clarify the form of interaction in online chat services by estimating age groups from chat data that don’t provide age data and extract the interaction between minors and other categories of adults from age information. We found that in online chat services, minors were interacting with adults who comprise the population of most sexual predators and discovered that these interactions are actually dangerous for minors.
Content may be subject to copyright.
Journal of Transformation of human behavior under the influence of The Infosocionomics SocietyVol.2 2017
Detection of Dangerous Interactions in Online Chat Services
Yuichi Hiranoa, Fujio Toriumia, Masanori Takanob, Kazuya Wadab, and Ichiro Fukudab
aThe University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
bCyberAgent, Inc., Chiyoda-ku, Tokyo, Japan
Abstract
More minors are communicating online with others due to the increasing popularity of smartphones and SNS sites. Even though
such online communication services are very convenient, for example, allowing interaction among friends at distant places,
unfortunately, they also expose minors to the risks of cyberbullying and sexual predators. Therefore, a technique that estimates
whether users are minors is essential for protecting them from online risks. In this paper we clarify the form of interaction in
online chat services by estimating age groups from chat data that dont provide age data and extract the interaction between
minors and other categories of adults from age information. We found that in online chat services, minors were interacting with
adults who comprise the population of most sexual predators and discovered that these interactions are actually dangerous for
minors.
Keywords: protection of minors, social media, online communication, luring minors, age prediction
1. 1. Introduction
The Internet is a very useful tool that allows users worldwide to easily obtain information and interact with others offline.
Communication through the Internet has become indispensable for social life, especially when combined with the spread of mobile
phones, smartphones, e-mails, SNSs, and chat applications. Such communication is not only ubiquitous for adults but also for
minors. However, online communication poses risks to minors, such as cyberbullying [1], internet harassment [2], and sex crimes
[3].
To deal with such problems, Toriumi et al. [4] identified potential victims and predator targets using such information as the
number of transmitted/received words and characters, except for user age information. Wolak et al. [3] concluded that in many
cases of internet-initiated sex crimes, the victims were minors and the predators were adults over 25. To uncover both potential
victims and predators, age information must be considered. However, in many online services, the correct ages of users are not
known. For example, since registration for some services does not require strict verification of the age of users, registration can
be accomplished with a fake age. To protect minors from the risk of online interaction such as sex crimes, estimating user ages and
detecting online communication between adults and minors are vital. In our paper, we focused on acquiring age information by
estimating it from the contents of texts posted by users to detect interactions between minors and adults and clarified that the
interactions of minors with adults are more dangerous than with their peers.
In this paper, we created generation labels of users as training data from posted text data that can determine user ages. Then we
made a generation classifier by a machine learning algorithm to classify non-age-determined users and analyzed the frequencies of
both received and dangerous comments from adults that might lead to offline encounters.
2. 2. Datasets
2.1. Communication service 755
In this study, we use a communication service called 755(https://7gogo.jp/), provided by 7gogo Inc. in Japan. In 755, users can
create chatrooms called talk with their friends and interact with other members through its talk. Non-member users can also read
talk in chatrooms and send responses that are called yajiuma (onlookers) comments to chatroom members.
In the 755 service, the founders of a group of friends generally participate in chatrooms and invite users who post yajiuma
comments to communicate with them. We assume that the users who participate in the same chatroom are friends. In some cases,
users post (on twitter and other communication services) such personal information as user IDs and mail addresses to yajiuma
comments or talks. Such posting of personal information is often prohibited by service providers. Since the leak of personal
information increases the threat that such incidents will encourage sexual predators, such comments are deleted immediately and
labeled as NG comments by the sites managers.
Journal of Transformation of human behavior under the influence of The Infosocionomics SocietyVol.2 2017
2.2. Data
In this paper, we use talk and yajiuma comment data with NG comment flags. The following are the details of these data:
Users: 212,709 (the number of users who had posted the sentence on talk once or more from 1/12/2014 to 27/3/2015. This
data includes sentences which had been posted on talk by them in the meanwhile)
Chatrooms: 481,473 (the number of talk which had been posted by users from 1/12/2014 to 26/1/2015)
Comments: 10,000,000 (the number of yajiuma comments which had been posted by users from 1/12/2014 to 7/2/2015)
NG comments: 20,990 (the number of NG comments which had been posted by user from 1/12/2014 to 7/2/2015)
3. Age prediction
3.1. Method of age prediction
In 755, although the user's age information is not known directly, sometimes it can be inferred from sentence data. One method
for estimating a users age is locating the phrase, happy yth birthday, sent by friends [5]. However, in 755, it is difficult to
determine ages from birthday messages for the following reasons. First, since users can not directly send a message to a specific
individual, determining the target of birthday messages is difficult. In addition, in Japan, people generally do not mention age
when celebrating birthdays. Therefore, in this research, we simply use the following posted sentence to determine user ages: I am
x years old.
After we determine the ages of users, we labeled each user into the following three categories:
Junior/high school students (minors): 12-17 years old;
College students: 18-24 years old;
Adults: 25-50 years old.
These categories are based on age studies in Japan. As a result, for the period from December 1, 2014 to March 31, 2015, we
obtained the following age-determined users (Table 1).
Table 1 Age distribution
Age group
Minors
College students
Adults
Total
Numbers of users
2,344
2,249
907
5,500
3.2. Results of age prediction
In this study, we use the TF-IDF of the sentence data and logistic regression to classify the age groups [6][7]. First, we created
a bag-of-words vector that only extracted nouns without numerals from all of the posted texts of each user . From the bag-of-
words vectors of all the users, we calculated TF-IDF matrix . Each element  of TF-IDF matrix is calculated as follows:
 =(number of times word j appears in )
 


From TF-IDF matrix , we defined each users TF-IDF vector :

Next, for each user i, age-group j was classified using multinomial logistic regression based on TF-IDF vector and an age-
group label:
Journal of Transformation of human behavior under the influence of The Infosocionomics SocietyVol.2 2017

s.t.


 


 

where weight w is calculated from the training data as follows:


s.t.
 


We employed Lasso logistic regression [8] to avoid overfitting.
Table 2 Results of age prediction
Age group
Recall
F1-score
Amount of data
Minors
0.90
0.90
237
College students
0.85
0.84
222
Adults
0.70
0.75
91
Average
0.85
0.85
550
Since there are 5500 pieces of user data, we used 4950 users as training data and 550 users as test data and obtained the precision
and recall scores (Table 2). From the result, we estimated age with high precision, especially for minors. In fact, the F1-scores
obtained by our method are higher than those of Zhang et al. [5] (Table 3).
Table 3 Results of Zhang et al.’s study
Precision
Recall
F1-score
Average
0.8069
0.8349
0.8138
4. Communication among age groups
4.1. Chatroom classification
In this section, we apply this age-group classification to all non-age-determined users and examine the age distribution for each
chatroom. First, for each chatroom, the classifier estimated the age group of its participants. We classified the ages of 212,709
users (Table 4).
Table 4 Age distribution of users
Age group
Minors
College students
Adults
Number of users
71,831
118,321
21,557
Next, we classified the rooms with two or more age-estimated users into seven groups by the rule shown in Table 5.
Journal of Transformation of human behavior under the influence of The Infosocionomics SocietyVol.2 2017
Table 5 Age classes of chatrooms
Chatroom group
Features
Percentage of minors
Percentage of college students
Percentage of adults
0
Minors are the only majority
Over 40%
Under 30%
Under 30%
1
College students are the only majority
Under 30%
Over 40%
Under 30%
2
Adults are the only majority
Under 30%
Under 30%
Over 40%
3
Minors and college students are the majority
Over 40%
Over 40%
Under 20%
4
College students and adults are majority
Under 20%
Over 40%
Over 40%
5
Adults and minors are majority
Over 40%
Under 20%
Over 40%
6
Others
Others
As a result, 44,970 chatrooms were classified into age groups. The classification results are shown in Table 6.
Table 6 Percentage of chatroom groups
Chatroom group
Features
Number of chatrooms
Percentage
0
Minors are the only majority
7,291
16.2%
1
College students are the only majority
11,439
25.4%
2
Adults are the only majority
916
2.0%
3
Minors and college students are the majority
14,486
32.2%
4
College students and adults are the majority
4,261
9.5%
5
Adults and minors are majority
3,044
6.8%
6
Others
3,533
7.9%
For comparison, we calculated the null model of the number of engagements among age groups by the following simulation.
First, we derived the age probability distribution for a user from the ratio of the number of age groups in Table 4. Next, we estimated
the ages of the users in each chatroom based on that age probability distribution. Then we classified each chatroom by the age
distribution of its users and the rules in Table 5. If all of the chatrooms are generated randomly, their number should resemble the
simulation results.
Table 7 shows the result of our simulation. By comparing it with the result in Table 6, we confirmed statistically significant
differences for groups 0, 1, 2, 4, and 5 in one-sided 1% hypothesis testing. Naturally, interaction was prominent among generations
whose ages are almost the same, such as minors and college students, college students and adults, based on the viewpoint of
interests and taste. On the other hand, since adults and minors who meet in chatrooms that are comprised of both adults and minors
are less likely to become friends offline, such a form of communication is considered unique to online situations. Also, because
sexual crimes are relatively common, such chatrooms warrant special attention to protect minors. From such a viewpoint, we
confirmed a greater statistical significance in chatrooms with many adults and minors in the real world than in comparative
simulations. Clearly more dangerous communication than expected occurs in 755 chatrooms.
Table 7 Expected numbers of chatrooms from age distributions
Chatroom group
Features
Number of chatrooms
Percentage
0
Minors are the only majority
4,839
10.8% *
1
College students are the only majority
14,135
31.4% *
2
Adults are the only majority
460
1.0% *
3
Minors and college students are the majority
15,105
33.6%
4
College students and adults are the majority
4,313
9.6% *
5
Adults and minors are the majority
2,484
5.5% *
6
Others
3,634
8.1%
4.2. Relationship between chatroom groups and comments
To determine how users communicate in each chatroom group, we analyzed the relationship among chatroom group, comments,
and NG comments. Even though NG comments are deleted by the service provider because they include such personal information
as e-mail addresses, if their number is large enough, such communication must be focused on because it includes an intrinsic risk
of luring. Table 8 and Fig. 1 show the average number of comments per chatroom, the number of chatrooms in which NG comments
were written, and the proportion of chatrooms in which NG comments were written for each chatrooms group.
Journal of Transformation of human behavior under the influence of The Infosocionomics SocietyVol.2 2017
Table 8 Relationship between chatroom group and comments
Chatroom
group
Features
Number of
comments
Average number of
comments per chatroom
Number of chatrooms in which
NG comments were written
Percentage of chatrooms in
which NG comments were
written
0
Minors are the only
majority
84,542
11.6
139
1.91%
1
College students are the
only majority
152,082
13.3
217
1.90%
2
Adults are the only
majority
10,629
11.6
11
1.20%
3
Minors and college
students are the
majority
216,150
14.9
313
2.16%
4
College students and
adults are the majority
85,693
20.1
85
1.99%
5
Adults and minors are
the majority
64,114
21.1
77
2.53%
6
Others
17,417
4.9
43
1.22%
Fig. 1 Percentage of chatrooms in which NG comments were written
We found that the average number of comments is large in groups 4 and 5. In one-sided 5% hypothesis testing, it is statistically
significant that for minors and college students, chatrooms interacting with adults have more comments than chatrooms dominated
by one age group. Comments are posted more frequently in mixed generation chatrooms than those dominated by a particular age.
Next, we focused on NG comments. From t-tests, it is statistically significant that group 5, which has a majority of adults and
minors, has a higher ratio of NG comments written for groups 0, 1, 2, and 6 at a 1% level of significance (Fig. 1). Since NG
comments often include personal information that might trigger predatory behaviors, group 5 poses a higher risk than other groups
from the viewpoint of safe communication for minors. This result suggests that for minors, chatrooms composed of both adults
and minors pose higher risks for involvement in dangerous communication. Therefore, to protect minors, we must concentrate on
such chatrooms that are roughly divided between adults and minors.
0.00%
0.50%
1.00%
1.50%
2.00%
2.50%
3.00%
Group 0 Group 1 Group 2 Group 3 Group 4 Group 5 Group 6
Rate
*
*
*
*
Journal of Transformation of human behavior under the influence of The Infosocionomics SocietyVol.2 2017
5. Conclusion
We proposed a method to estimate the ages of users based on their communication in a Japanese communication service and
showed that for minors, interaction with adults posed such higher risks as the leak of personal information that might trigger
predatory behaviors. Even though obtaining user’s age information is difficult in many communication services, many minors
become involved in sexual crimes through online encounters. Therefore, properly detecting minors without direct age information
is critical for their protection. In that context, our proposed method in this research is significant.
Furthermore, by the analysis of chat data with age estimation, we found that types of communication of different ages are more
dangerous than single generation communication. These results will help us identify minors who are at higher risk from predators
in online chat services.
Our future work will investigate the characteristics of chatrooms with many adults and minors. To clarify the details of
communication between adults and minors, we must identify the actual situations of dangerous communication. We also have to
analyze the age distribution of users who posted NG comments in each chatroom group.
References
[1] Belsey, B. “Cyberbullying: An emerging threat to the “always on” generation. Recuperado el, 5.2005
[2] M. L. Ybarra. “Linkages between depressive symptomatology and Internet harassment among young regular Internet users.” Cyberpsychology and Behavior
7, no. 2. 24757. 2004.
[3] Wolak, J., Finkelhor, D., & Mitchell, K. Internet-initiated sex crimes against minors: Implications for prevention based on findings from a national study.
Journal of Adolescent Health, 35(5), 424-e11.2004.
[4] Fujio Toriumi, Takafumi Nakanishi, Mitsuteru Tashiro, and Kiyotake Efuchi. Encounters between Predators and Their Targets in Private Chat. Journal of
Transformation of Human Behavior under the Influence of Infosocionomics Society Vol. 1. 2016.
[5] Zhang, Jinxue, et al. “Your Age Is No Secret: Inferring Microbloggers’ Ages via Content and Interaction Analysis. Tenth International AAAI Conference on
Web and Social Media. 2016.
[6] Genkin, A., Lewis, D. D., & Madigan, D. Sparse logistic regression for text categorization. DIMACS Working Group on Monitoring Message Streams
Project Report. 2005.
[7] Sara, R and Kathleen M. Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. Proceedings of
the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Vol. 1. Association for Computational
Linguistics. 2011.
[8] Sanders, M. A. “Sparse multi-class prediction based on the group lasso in multinomial logistic regression.” Diss. TU Delft, Delft University of Technology,
2009.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This paper studies regularized logistic regression and its ap-plication to text categorization. In particular we examine a Bayesian approach, lasso logistic regression, that simul-taneously selects variables and provides regularization. We present an efficient training algorithm for this approach, and show that the resulting classifiers are both compact and have state-of-the-art effectiveness on a range of text categoriza-tion tasks.
Conference Paper
Full-text available
We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors. Our hypothesis is that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction. We also show that internet writing characteristics are important features for age prediction, but that lexical content is also needed to produce significantly more accurate results. Our best results allow for 81.57% accuracy.
Article
Full-text available
Recent reports indicate 97% of youth are connected to the Internet. As more young people have access to online communication, it is integrally important to identify youth who may be more vulnerable to negative experiences. Based upon accounts of traditional bullying, youth with depressive symptomatology may be especially likely to be the target of Internet harassment. The current investigation will examine the cross-sectional relationship between depressive symptomatology and Internet harassment, as well as underlying factors that may help explain the observed association. Youth between the ages of 10 and 17 (N = 1,501) participated in a telephone survey about their Internet behaviors and experiences. Subjects were required to have used the Internet at least six times in the previous 6 months to ensure a minimum level of exposure. The caregiver self-identified as most knowledgeable about the young person's Internet behaviors was also interviewed. The odds of reporting an Internet harassment experience in the previous year were more than three times higher (OR: 3.38, CI: 1.78, 6.45) for youth who reported major depressive symptomatology compared to mild/absent symptomatology. When female and male respondents were assessed separately, the adjusted odds of reporting Internet harassment for males who also reported DSM IV symptoms of major depression were more than three times greater (OR: 3.64, CI: 1.16, 11.39) than for males who indicated mild or no symptoms of depression. No significant association was observed among otherwise similar females. Instead, the association was largely explained by differences in Internet usage characteristics and other psychosocial challenges. Internet harassment is an important public mental health issue affecting youth today. Among young, regular Internet users, those who report DSM IV-like depressive symptomatology are significantly more likely to also report being the target of Internet harassment. Future studies should focus on establishing the temporality of events, that is, whether young people report depressive symptoms in response to the negative Internet experience, or whether symptomatology confers risks for later negative online incidents. Based on these cross-sectional results, gender differences in the odds of reporting an unwanted Internet experience are suggested, and deserve special attention in future studies.
Article
Continuous variable selection using shrinkage procedures have recently been considered as favorable models in a wide range of scientific research; in particular biomedical research. In some cases, it is desirable to select as few predictors as possible, to increase the interpretability of the attained prediction rule. One frequently used shrinkage procedure; the Lasso, imposes a L1 regularization on the regression coefficients of general linear models, inherently leading to sparse prediction rules. When dealing with multi-class prediction in generalized linear models each predictor has a regression coefficient for each class. A major disadvantage is that the Lasso selects individual regression coefficients instead of the more logical selection of predictors. In this paper, we demonstrate a new regularization procedure, based on the Group Lasso in multinomial logistic regression. This results in a lower number of retained predictors, but with similar prediction accuracy when compared to the regular Lasso regularization. To illustrate the new regularization applicability we have employed it on a large cohort of acute myeloid leukemia patients (AML, n=531) who are characterized on a gene expression microarray.
Article
To describe the characteristics of episodes in which juveniles became victims of sex crimes committed by people they met through the Internet. A national survey of a stratified random sample of 2574 law enforcement agencies conducted between October 2001 and July 2002. Telephone interviews were conducted with local, state, and federal law enforcement investigators concerning 129 sexual offenses against juvenile victims that originated with online encounters. Victims in these crimes were primarily 13- through 15-year-old teenage girls (75%) who met adult offenders (76% older than 25) in Internet chat rooms. Most offenders did not deceive victims about the fact that they were adults who were interested in sexual relationships. Most victims met and had sex with the adults on more than one occasion. Half of the victims were described as being in love with or feeling close bonds with the offenders. Almost all cases with male victims involved male offenders. Offenders used violence in 5% of the episodes. Health care professionals and educators, parents and media need to be aware of the existence, nature and real life dynamics of these online relationships among adolescents. Information about Internet safety should include frank discussion about why these relationships are inappropriate, criminal, and detrimental to the developmental needs of youth.
Cyberbullying: An emerging threat to the "always on" generation
  • B Belsey
Belsey, B. "Cyberbullying: An emerging threat to the "always on" generation." Recuperado el, 5.2005
Your Age Is No Secret: Inferring Microbloggers' Ages via Content and Interaction Analysis
  • Jinxue Zhang
Zhang, Jinxue, et al. "Your Age Is No Secret: Inferring Microbloggers' Ages via Content and Interaction Analysis." Tenth International AAAI Conference on Web and Social Media. 2016.
Encounters between Predators and Their Targets in Private Chat
  • Fujio Toriumi
  • Takafumi Nakanishi
  • Mitsuteru Tashiro
  • Kiyotake Efuchi
Fujio Toriumi, Takafumi Nakanishi, Mitsuteru Tashiro, and Kiyotake Efuchi. "Encounters between Predators and Their Targets in Private Chat." Journal of Transformation of Human Behavior under the Influence of Infosocionomics Society Vol. 1. 2016.