ArticlePDF Available

An Ensemble Model Using N-grams and Statistical Features to Identify Fake News Spreaders on Twitter Notebook for PAN at CLEF 2020

Authors:

Abstract and Figures

In this notebook, we summarize our work process of preparing a software for the PAN 2020 Profiling Fake News Spreaders on Twitter task. Our final software was a stacking ensemble classifier of five different machine learning models; four of them use word n-grams as features, while the fifth one was based on statistical features extracted from the Twitter feeds. Our software uploaded to the TIRA platform achieved an accuracy of 75% in English and 80.5% in Spanish. Our overall accuracy of 77.75% turned out to be a tie for the first place in the competition.
Content may be subject to copyright.
Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020,
Thessaloniki, Greece.
An#Ensemble#Model#Using#N-grams#and#
Statistical#Features#to#Identify#Fake#News#
Spreaders#on#Twitter##
Notebook'for'PAN'at'CLEF''2020
Jakab Buda, Flora Bolonyai
Eötvös Loránd University, Budapest
bakajb@gmail.com, f.bolonyai@gmail.com
Abstract. In this notebook, we summarize our work process of preparing a
software for the PAN 2020 Profiling Fake News Spreaders on Twitter task. Our
final software was a stacking ensemble classifier of five different machine
learning models; four of them use word n-grams as features, while the fifth one
was based on statistical features extracted from the Twitter feeds. Our software
uploaded to the TIRA platform achieved an accuracy of 75% in English and
80.5% in Spanish. Our overall accuracy of 77.75% turned out to be a tie for the
first place in the competition.
1 Introduction
The aim of the PAN 2020 Profiling Fake News Spreaders on Twitter task [12] was
to investigate whether the author of a given Twitter feed is likely to spread fake news.
The training and test sets of the task consisted of English and Spanish Twitter feeds
[13].
We used an ensemble of different machine learning models to provide a prediction
for each user. All of our sub-models handle the Twitter feed of a user as a unit and
determine a probability for each user how likely they are to be fake news spreaders.
For the final predictions, these sub-models are combined using a logistic regression.
In Section 2 we present some related works on profiling fake news spreaders. In
Section 3 we describe our approach in detail together with the extracted features and
models. In Section 4 we present our results. In Section 5 we discuss some potential
future work and in Section 6 we conclude our notebook.
2 Related Works
Using word n-gram variables for author profiling has been shown to be effective
[3, 5, 9, 14, 15, 18], especially with TF-IDF weighting [20]. Identifying fake news
based on such features has been tested earlier [1]. Statistical features, such as the
number of punctuation marks [15, 19], medium-specific symbols (for example
hashtags, and at signs in tweets, links in digital texts) [7, 8, 14, 15, 17, 19], emoticons
[7, 8, 14, 16, 19] or stylistic features [8] are also commonly used for text classification
purposes.
SVMs [3, 5, 9, 14, 15], XGBoost [21], logistic regression [19] and random forest
[2] models are commonly used for author profiling and text classification purposes.
Although the state-of-the-art results for many text classification tasks are achieved
with transformer-based language models [4, 11], these are computationally very
expensive solutions and perform better on tasks where text semantics is more
important. Ghanem et al. proposed an emotionally infused LSTM model to detect
false information in social media and news articles. Their model yielded state-of-the-
art results on three datasets, but it is also computationally expensive [6], so
experimenting with lighter approaches still has practical benefits.
3 Our Approach
3.1 The corpus and the environment setup
3.1.1 The corpus
The corpus for the PAN 2020 Profiling Fake News Spreaders on Twitter task [12]
consists of one English and one Spanish corpus, each containing 300 XML files. Each
of these files contains 100 tweets from an author. Because of the moderate size of the
corpus, we wanted to avoid splitting the corpus into a training and a development set.
Therefore, we used cross-validation techniques to prevent overfitting. As opposed to
earlier editions of the PAN competition, the dataset this year came pre-cleaned: all
urls, hashtags and user mentions in the tweets were changed to standardized tokens.
3.1.2 Environment setup
We developed our software using the Python language (version 3.7). To build our
models we mainly used the following packages: scikit-learn
1
, xgboost
2
, emoji
3
,
lexical-diversity
4
, pandas
5
and numpy
6
. Our codes are available on GitHub
7
.
1
https://scikit-learn.org/
2
https://xgboost.readthedocs.io/
3
https://pypi.org/project/emoji/
4
https://pypi.org/project/lexical-diversity/
5
https://pandas.pydata.org/
6
https://numpy.org/
7
https://github.com/pan-webis-de/bolonyai20
3.2 Our models
3.2.1 N-gram models
We experimented with a number of machine learning models based on word n-
grams extracted from the text. Precisely, we investigated the performance of
regularized logistic regressions (LR), random forests (RF), XGBoost classifiers
(XGB) and linear support vector machines (SVM). For all four models, we ran an
extensive grid search combined with five-fold cross-validation to find the optimal text
preparation method, vectorization technique and modeling parameters. We tested the
same parameters for the English and Spanish data. We investigated two types of text
cleaning methods for all models. The first method (M1) removed all non
alphanumeric characters (except #) from the text, while the second method (M2)
removed most non alphanumeric characters (except #) but kept emoticons and emojis.
Both methods transformed the text to lower case. Regarding the vectorization of the
corpus, we experimented with a number of parameters. We tested different word n-
gram ranges (unigrams, bigrams, unigrams and bigrams) and also looked at different
scenarios regarding the minimum overall document frequency of the word n-grams
(3, 4, 5, 6, 7, 8, 9, 10) included as features. Table 1 describes the tested model
hyperparameter values during the training phase of our models.
Table 1: Grid-searched hyperparameters for the used machine learning models
Model
Model hyperparameters
Name (Python parameter name)
Values
LR
Regularization coefficient (C)
{0.1,1,10,100,1000}
RF
Number of boosting rounds (B)
{100,300,400}
Minimum number of cases on each leaf
(min_samples_leaf)
{5,6,7,8,9,10}
SVM
Regularization coefficient (C)
{1,10,100,1000}
XGB
Learning rate (eta):
{0.01,0.1,0.3}
Number of estimators (n_estimators)
{200,300}
Maximum depth of a tree (max_depth)
{3,4,5,6}
Subsample ratio (subsample)
{0.6,0.7,0.8}
Subsample ratio of columns
(colsample_bytree)
{0.5,0.6,0.7}
For the early bird testing phase conducted through TIRA [10], we simply chose the
model and parameter combination in each language that had the highest accuracy
during the cross-validation and fitted these models on the entire training set.
However, the accuracy of our model was approximately 5% lower on the test set
compared to the cross-validation results (79% vs. 83% for the Spanish dataset and
69% vs. 76% for the English dataset), so we used a different approach during the final
testing phase.
The ensemble method we used for the final version of our software (described in
detail in Section 3.2.3) required the best text cleaning and vectorization parameters
and hyperparameters for each model. These hyperparameters are summarized in Table
2.
Table 2: The best performing text cleaning methods, vectorization parameters and model
hyperparameters for the n-gram based machine learning models
Language
Model
Vectorization
Model
hyperparameters8
N-grams
Min. global
occurrence
EN
LR
uni- and
bigrams
6
C=1000
RF
uni- and
bigrams
9
B=300
min_samples_leaf=9
SVM
uni- and
bigrams
5
C=100
XGB
uni- and
bigrams
8
eta= 0.01
max_depth=6
colsample_bytree=0.6
subsample=0.8
n_estimators=300
ES
LR
bigrams
9
C=100
RF
uni- and
bigrams
3
B=100
min_samples_leaf=8
SVM
bigrams
8
C=10
XGB
uni- and
bigrams
8
eta= 0.3
max_depth=6
colsample_bytree=0.7
subsample=0.6
n_estimators=200
8
Parameter names in the relevant Python package/function. Detailed description in Table 1.
3.2.2 User-wise statistical model
Apart from the n-gram based models, we constructed a model based on statistical
variables describing all hundred tweets of each author, thus giving one more
prediction per author. The variables used in this model are as follows:
the mean length of the 100 tweets of the authors both in words and in
characters;
the minimum length of the 100 tweets of the authors both in words and in
characters;
the maximum length of the 100 tweets of the authors both in words and
in characters;
the standard deviations of the length of the 100 tweets of the authors both
in words and in characters;
the range of the length of the 100 tweets of the authors both in words and
in characters;
the number of retweets in the dataset by each author;
the number of URL links in the dataset by each author;
the number of hashtags in the dataset by each author;
the number of mentions in the dataset by each author;
the number of emojis in the dataset by each author;
the number of ellipses used at the end of the tweets in the 100 tweets of
the authors;
a stylistic feature, the type-token ratio to measure the lexical diversity of
the authors (in the dataset each author has 100 tweets thus the number of
tokens per author does not differ as much that it would cause a great
diversity in the TTRs).
This gives a total of 17 statistical variables. Since we used an XGBoost classifier,
we did not normalize the variables and the linear correlation between the variables
posed no problem.
To find the best hyperparameter set, we used a five-fold cross-validated grid search
and finally refitted the best model on the whole data. The cross-validated accuracies
achieved this way are 70% and 74% for the English and Spanish data respectively.
Table 3 contains the best hyperparameters found.
Table 3: The best model hyperparameters for the XGBoost model using statistical features
Parameter name
Parameter values
EN
ES
Column sample by node
1
0.8
Column sample by tree
0.9
0.8
gamma
2
4
Learning rate
0.2
0.3
Max depth
2
3
Min child weight
4
5
Number of estimators
200
100
alpha
0.1
0.3
Subsample
0.8
0.8
3.2.3 Stacking ensemble
After identifying the best hyperparameters for the five mentioned models with
cross-validation, we had to find a reliable ensemble method. To avoid overfitting this
ensemble model to the training set, we did not train it using the predictions of the five
final trained models. Instead, we wanted to create a dataset that represents the
predictions that are produced by our models. To do this, we refitted the five sub-
models with the cross-validated hyperparameters five times on different chunks of the
original training data (each consisting of tweets from 240 users). The predictions
given by these five models to the 60 remaining users were appended to the training
data of the ensemble model, thus this training set consisted of predictions given to all
300 users in the training data, but these predictions were given by five different
models in case of each model type. The sample created this way can be interpreted as
an approximation of a sample from the distribution of the predictions of the final five
models on the test set. We created a test set with the same method but with a different
split of the training data.
We then used these constructed training and test sets to find the best ensemble
from the following three methods: majority voting, linear regression of predicted
probabilities (this includes the simple mean), and a logistic regression model. The
best and most reliable results were given by the logistic model; therefore, we used this
model as our final ensemble method. Table 4 summarizes the logistic regression
coefficients for the probabilistic predictions of each model for both languages.
Table 4: Logistic regression coefficients for the predicted probabilities by each sub-model
Model
Coefficient values
EN
ES
LR
0.8
1.31
SVM
0.48
1.16
RF
0
0
XGB
1.07
0.54
Statistical XGB
0.2
0.12
The validity of this method is backed by the fact that our results on the training sets
(an accuracy of 75% and 81% for the English and Spanish set respectively) were only
slightly better than the final test results.
4 Results
As mentioned in Section 3, we tested two versions of our software. For the early
bird testing, we used the single best n-gram models based on our cross-validated grid
search (a random forest classifier for the English set and a support vector machines
classifier for the Spanish set). Using these models, we experienced a significant
decrease in the accuracy of the models compared to their cross-validated
performance, so this was one of the reasons why we decided to incorporate a number
of different models for our final software. As Table 5 shows, relying on a number of
different models and a statistically based ensemble method proved to be a good
solution. First, the cross-validated accuracies of our final models were almost the
same as their accuracies on the test set, and second, our final software was able to
reach a higher accuracy in both languages than our early bird solution.
Table 5: Accuracies achieved by the two versions of our software during the cross-
validation process and on the test set
Language
Early bird software
Final software
CV (training set)
Test set
CV (training set)
Test set
EN
83%
79%
81%
80.5%
ES
75%
69%
75%
75%
5 Future Work
One of the unanswered questions that emerged during this project is concerning the
reasons behind the fact that our models are better at identifying fake news spreaders
that tweet in Spanish. This is true about all of our individual models regardless of the
features they used, and about the final ensemble model as well. We assume that it
would be beneficial to conduct some qualitative research about the tweets in the
dataset to better understand why fake news spreaders that tweet in Spanish are more
distinguishable from regular users than those that tweet in English.
Another promising direction for achieving higher accuracy in profiling fake news
spreaders is to develop a software that is able to determine whether a single tweet
should be considered as fake news. It is reasonable to assume that even those that are
labelled as fake news spreaders only post some tweets that can be considered as fake
news, while some of their posts are just regular tweets. Therefore, from the
perspective of our approach, the current dataset is likely to contain a lot of noise. If
we were able to identify fake news on the level of tweets, we could build a model
relying on this information that would allow us to give predictions for each tweet.
This approach was unfortunately not executable with the PAN20 Fake News
Spreaders dataset [13], as it did not provide information about single tweets, and
additionally, all URL links, hashtags and user mentions, which could have provided
valuable clues about the credibility of the tweet, were replaced by standardized tokens
in the text. Moreover, even if we had access to these tweets in their original form,
manual labeling would be a tedious process even for the “small” dataset of 300 users.
However, it would be interesting to investigate how a software that is able to decide
whether a single tweet is fake news would perform in this task.
6 Conclusion
In this notebook, we summarized our work process of preparing a software for the
PAN 2020 Profiling Fake News Spreaders on Twitter task [12]. Originally, we looked
at a number of machine learning models using n-grams as features. To find the best
parameters for the models, we conducted an extensive grid search combined with
cross-validation. After finding the models achieving the highest accuracy during the
cross-validation, we fitted these on the entire training set. However, we realized
during the early bird testing phase that this approach results in a significantly lower
accuracy on the test set compared to its cross-validation results. Therefore, for our
final software, we decided to create a combined model which was a stacking
ensemble of five sub-models. Four of these sub-models (a logistic regression, a
support vector machine classifier, a random forest classifier and an XGBoost
classifier) used word n-grams as features, while the fifth model (another XGBoost
model) used statistical features extracted from the Twitter feed. For each sub-model,
we used grid search and cross-validation to find the best performing parameters and
fitted the models on the entire training data with these parameters. To get a final
prediction for each user, we trained a logistic regression that used the probabilistic
predictions of the sub-models as features. Using the ensemble model, we were able to
achieve the same accuracy on the test set as during the cross-validation process.
Overall, our final software was able to identify fake news spreaders with a 75%
accuracy among users that tweet in English, and with an 80.5% accuracy among users
that tweet in Spanish. Our overall accuracy of 77.75% was tied as the highest
performance in the competition.
7 References
1. Ahmed, H., Traore, I., Saad, S.: Detection of Online Fake News Using N-Gram
Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad
A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud
Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618.
Springer, Cham (2017)
2. Aravantinou, C., Simaki, V., Mporas, I., Megalooikonomou, V.: Gender
Classification of Web Authors Using Feature Selection and Language Models. In:
Speech and Computer Lecture Notes in Computer Science, pp. 22633. (2015)
3. Boulis, C., Ostendorf, M.: A quantitative analysis of lexical differences between
genders in telephone conversations. In: Proceedings of the 43rd Annual Meeting
on Association for Computational Linguistics - ACL '05. Morristown, NJ, USA:
Association for Computational Linguistics, pp. 435-442 (2005)
4. Devlin, J., Chang, M., Lee, K., Toutanova K.: BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. NAACL-HLT. (2019)
5. Garera, N., Yarowsky, D.: Modeling Latent Biographic Attributes in
Conversational Genres. In: Proceedings of the 47th Annual Meeting of the ACL
and the 4th IJCNLP of the AFNLP, pp 710-718 (2009)
6. Ghanem, B., Rosso, P., Rangel, F.: An Emotional Analysis of False Information
in Social Media and News Articles. In: ACM Transactions on Internet
Technology (TOIT) vol. 20 no. 2, pp. 1-18 (2020)
7. Gonzalez-Gallardo, C. E., Torres-Moreno, J. M., Rendon, A. M., Sierra, G.:
Efficient social network multilingual classification using character, POS n-grams
and Dynamic Normalization. In: IC3K 2016 - Proceedings of the 8th
International Joint Conference on Knowledge Discovery, Knowledge
Engineering and Knowledge Management. SciTePress, pp. 307-314. (2016)
8. Marquardt, J., Farnadi, G., Vasudevan, G., Moens, M., Davalos, S., Teredesai,
A., De Cock, M.: Age and gender identification in social media. In: CLEF 2014
working notes, pp. 1129-1136. (2014)
9. Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting Age and Gender
in Online Social Networks. In: Proceedings of the 3rd International Workshop on
Search and Mining User-Generated Contents. New York, NY, USA: Association
for Computing Machinery, pp. 37-44. (2011)
10. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research
Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a
Changing World - Lessons Learned from 20 Years of CLEF. Springer. (2019)
11. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D, Sutskever, I.: Language
Models are Unsupervised Multitask Learners. OpenAI, San Francisco, CA,
(2019)
12. Rangel F., Giachanou A., Ghanem B., Rosso P. Overview of the 8th Author
Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: L.
Cappellato, C. Eickhoff, N. Ferro, and A. Névéol (eds.) CLEF 2020 Labs and
Workshops, Notebook Papers. CEUR Workshop Proceedings.CEUR-WS.org
(2020)
13. Rangel F., Rosso P., Ghanem B., Giachanou A. Profiling Fake News Spreaders
on Twitter [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3692319 (2020)
14. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user
attributes in Twitter. In: SMUC '10: Proceedings of the 2nd international
workshop on Search and mining user-generated contents. Pp. 37-44. (2010)
15. Santosh, K., Bansal, R., Shekhar, M., Varma, V.: Author Profiling: Predicting
Age and Gender from Blogs Notebook for PAN at CLEF 2013. in: Working
Notes for CLEF 2013 Conference. (2013)
16. Sboev, A., Litvinova, T., Voronina, I., Gudovskikh, D., Rybka, R.: Deep
Learning Network Models to Categorize Texts According to Author's Gender and
to Identify Text Sentiment. In: Proceedings - 2016 International Conference on
Computational Science and Computational Intelligence, CSCI 2016. Institute of
Electrical and Electronics Engineers Inc., pp. 1101-1106. (2017)
17. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of Age and Gender
on Blogging. In: AAAI Spring Symposium: Computational Approaches to
Analyzing Weblogs. American Association for Artificial Intelligence (AAAI),
pp. 199- 205. (2006)
18. Stout, L., Musters, R., Pool, C.: Author Profiling based on Text and Images
Notebook for PAN at CLEF 2018. In: Working Notes of CLEF 2018 -
Conference and Labs of the Evaluation Forum. (2018)
19. Volkova, S., Bachrach, Y.: On Predicting Sociodemographic Traits and Emotions
from Communications in Social Networks and Their Implications to Online Self
Disclosure. In: Cyberpsychology, Behavior, and Social Networking (Mary Ann
Liebert Inc.) 2015/12, pp. 726-736. (2015)
20. Yildiz, T.: A comparative study of author gender identification. In: Turkish
Journal of Electrical Engineering and Computer Science 27, pp. 1052-1064.
(2019)
21. Zhang, X., Yu, Q.: Hotel reviews sentiment analysis based on word vector
clustering. In: 2nd IEEE International Conference on Computational Intelligence
and Applications (ICCIA), Beijing, pp. 260-264. (2017)
... The task received 66 participants, and the highest Accuracy scores achieved were 75% on the English dataset and 82% on the Spanish dataset. It is worth mentioning that the highest performance was achieved using a stacked ensemble classifier of five machine learning algorithms; four of the base models use character n-grams as features, while the fifth model uses features based on statistics of the tweets such as the average length of the tweets (Buda and Bolonyai, 2020). All of the highest six participants in the task used a combination of n-grams and traditional machine learning approaches. ...
... 2. PAN 2020: The winning participation at PAN author profiling task (Buda and Bolonyai, 2020). They proposed an ensemble of five machine learning models. ...
... 3. PAN 2020+: An improved version of PAN 2020 that we proposed. First, we eliminated the XG-Boost model from the ensemble, as it was shown by Buda and Bolonyai (2020) that it has the least impact on the performance as per the Logistic Regression coefficients. Additionally, for the remaining models that use only tf-idf as features, we expand the feature vector by including emotional signals. ...
Conference Paper
Full-text available
The spread of misinformation has become a major concern to our society, and social media is one of its main culprits. Evidently, health misinformation related to vaccinations has slowed down global efforts to fight the COVID-19 pandemic. Studies have shown that fake news spreads substantially faster than real news on social media networks. One way to limit this fast dissemination is by assessing information sources in a semi-automatic way. To this end, we aim to identify users who are prone to spread fake news in Arabic Twitter. Such users play an important role in spreading misinformation and identifying them has the potential to control the spread. We construct an Arabic dataset on Twitter users, which consists of 1,546 users, of which 541 are prone to spread fake news (based on our definition). We use features extracted from users' recent tweets, e.g., linguistic, statistical, and profile features, to predict whether they are prone to spread fake news or not. To tackle the classification task, multiple learning models are employed and evaluated. Empirical results reveal promising detection performance, where an F1 score of 0.73 was achieved by the logistic regression model. Moreover, when tested on a benchmark English dataset, our approach has outperformed the current state-of-the-art for this task.
... In particular, the former is based on combinations of character and word ngrams and SVM [37]. The latter, is a Logistic Regression ensemble of five sub-models: n-grams with Logistic Regression, n-grams with SVM, n-grams with Random Forest, ngrams with XGBoost and XGBoost with features based on textual descriptive statistics, such as the average length of the tweets or their lexical diversity [38]. However, only a few participants experimented with more deep learning approaches (e.g., Fully-Connected NN, CNN, LSTM, or Bi-LSTM with self-attention). ...
... Focusing on the authors labelled as nFNS (corpus 0 as focus) and FNS (corpus 1 as focus), we extract keywords which are used differently by the two groups of users (it is possible also that some tokens do not occur in both subcorpora). Based on these keywords and inspecting the linguistic context (i.e., co-text) in which they occur (using the Sketch Engine Concordance facility), we observed that nFNS (corpus 0) share information about technology (4, 9, 10, 15, 17, 19 'mobile' but also referred only to mobile phones, 29 'screen', 35 'users'), FN (14), toponyms (13,18,(24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43), politics (20,32), warnings (8,11). Conversely, FNS (corpus 1) share information about mostly Latin American artists, music and related (5 'premiere', 8-4, 9, 11-13, 15, 17, 29-39, 41, 45, 46, 47, 48-20, 50), videos (2,3,10,19), shocking or last minute news (5, 7-6, 18-22, 35-28, 37), and also galvanize users to get involved (1 'join us', 14 'download', 23, 31 '2ps-forget', 34 'share it'). ...
Article
Full-text available
Guided by a corpus linguistics approach, in this article we present a comparative evaluation of State-of-the-Art (SotA) models, with a special focus on Transformers, to address the task of Fake News Spreaders (i.e., users that share Fake News) detection. First, we explore the reference multilingual dataset for the considered task, exploiting corpus linguistics techniques, such as chi-square test, keywords and Word Sketch. Second, we perform experiments on several models for Natural Language Processing. Third, we perform a comparative evaluation using the most recent Transformer-based models (RoBERTa, DistilBERT, BERT, XLNet, ELECTRA, Longformer) and other deep and non-deep SotA models (CNN, MultiCNN, Bayes, SVM). The CNN tested outperforms all the models tested and, to the best of our knowledge, any existing approach on the same dataset. Fourth, to better understand this result, we conduct a post-hoc analysis as an attempt to investigate the behaviour of the presented best performing black-box model. This study highlights the importance of choosing a suitable classifier given the specific task. To make an educated decision, we propose the use of corpus linguistics techniques. Our results suggest that large pre-trained deep models like Transformers are not necessarily the first choice when addressing a text classification task as the one presented in this article. All the code developed to run our tests is publicly available on GitHub.
... A previous edition of the author profiling task is discussed in [8], where the goal is to identify authors prone to spread fake news based on their last 100 tweets. The winners at the shared task were [9] and [10]. Their models obtained an overall accuracy of 0.77 on the provided test set. ...
Conference Paper
Full-text available
In this work we propose a novel ensemble model based on deep learning and non-deep learning classifiers. The proposed model was developed by our team for participating at the Profiling Irony and Stereotype Spreaders (ISSs) task hosted at PAN@CLEF2022. Our ensemble (named T100), include a Logistic Regressor (LR) that classifies an author as ISS or not (nISS) considering the predictions provided by a first stage of classifiers. All these classifiers are able to reach state-of-the-art results on several text classification tasks. These classifiers (namely, the voters) are a Convolutional Neural Network (CNN), a Support Vector Machine (SVM), a Decision Tree (DT) and a Naive Bayes (NB) classifier. The voters are trained on the provided dataset and then generate predictions on the training set. Finally, the LR is trained on the predictions made by the voters. For the simulation phase the LR considers the predictions of the voters on the unlabelled test set to provide its final prediction on each sample. To develop and test our model we used a 5-fold cross validation on the labelled training set. Over the five validation splits, the proposed model achieves a maximum accuracy of 0.9342 and an average accuracy of 0.9158. As announced by the task organizers, the trained model presented here is able to reach an accuracy of 0.9444 on the unlabelled test set provided for the task.
... A similar author profiling task was organized last year [14], in which the participants had to identify authors prone to spread fake news based on their last 100 tweets. The winners were [15] and [16]. Their models obtained an overall accuracy of 0.77 on the provided test set. ...
Conference Paper
Full-text available
In this paper we describe a deep learning model based on a Convolutional Neural Network (CNN). The model was developed for the Profiling Hate Speech Spreaders (HSSs) task proposed by PAN 2021 organizers and hosted at the 2021 CLEF Conference. Our approach to the task of classifying an author as HSS or not (nHSS) takes advantage of a CNN based on a single convolutional layer. In this binary classification task, on the tests performed using a 5-fold cross validation, the proposed model reaches a maximum accuracy of 0.80 on the multilingual (i.e., English and Spanish) training set, and a minimum loss value of 0.51 on the same set. As announced by the task organizers, the trained model presented is able to reach an overall accuracy of 0.79 on the full test set. This overall accuracy is obtained averaging the accuracy achieved by the model on both languages. In particular, with regard to the Spanish test set, the organizers announced that our model achieves an accuracy of 0.85, while on the English test set the same model achieved-as announced by the organizers too-an accuracy of 0.73. Thanks to the model presented in this paper, our team won the 2021 PAN competition on profiling HSSs.
... The first set of classifiers are a multinomial naive Bayes classifier and a separate Gradient boosting classifier with tf-idf transformed vectors of the text transcriptions found in the data set. Originally we considered training the models on n-gram features, as previous research has shown that they can be useful to classify short texts (Buda and Bolonyai, 2020), but tf-idf vectors consistently performed better. Similarly, we experimented with adding more classifiers trained on tf-idf vectors, however, performance stayed consistent or decreased, thus we stayed with two. ...
Chapter
Easy accessibility to various kinds of information causes huge misuse of social media platforms in modern times. Fake news spreaders are making use of these platforms to exploit the gullibility of common people to satisfy their own purpose. As a consequence, identification of Fake News spreaders is of prime consideration. The present paper focuses on automatic identification of Fake News Spreaders using Machine Learning based techniques. Different linguistic, personalized, stylistic features and word embeddings have been extracted form a large collection of tweet data to train the model. The dataset used for the model is taken from PAN@CLEF Profiling Fake News Spreaders on Twitter competition. The above features along with the clickbait feature have been found to achieve the best accuracy of 77% with XGBoost classifier. This outperforms the state of the art deep learning technique, viz. FakeBERT, as well as several other deep learning-based methods available in literature.
Article
By defining the computable word segmentation unit and studying its probability characteristics, we establish an unsupervised statistical language model (SLM) for a new pre-trained sequence labeling framework in this article. The proposed SLM is an optimization model, and its objective is to maximize the total binding force of all candidate word segmentation units in sentences under the condition of no annotated datasets and vocabularies. To solve SLM, we design a recursive divide-and-conquer dynamic programming algorithm. By integrating SLM with the popular sequence labeling models, Vietnamese word segmentation, part-of-speech tagging and named entity recognition experiments are performed. The experimental results show that our SLM can effectively promote the performance of sequence labeling tasks. Just using less than 10% of training data and without using a dictionary, the performance of our sequence labeling framework is better than the state-of-the-art Vietnamese word segmentation toolkit VnCoreNLP on the cross-dataset test. SLM has no hyper-parameter to be tuned, and it is completely unsupervised and applicable to any other analytic language. Thus, it has good domain adaptability.
Chapter
Full-text available
Zusammenfassung Die Nutzung von Smartphones und Tablets ermöglicht auch den Zugang zu Applikationen zum Lernen und Vertiefen von Schulinhalten. Allerdings werden dabei zwangsläufig eine große Anzahl persönlicher Daten gesammelt. Hierbei kann es sich um demografische Daten, administrative Informationen, Daten aus Interaktionen und individuelle Daten wie das Vorwissen oder Testergebnisse handeln. Da die Funktionsweise solcher Lernsysteme oftmals intransparent ist, stellt sich die Frage, inwiefern Kinder und Jugendliche die Bedrohung ihrer Privatheit in horizontaler (durch die Einblicke von Lehr:innen, Eltern und Mitschüler:innen in sensible Leistungsdaten) und vertikaler (durch die kommerzielle Nutzung und Weitergabe an Unternehmen) Hinsicht wahrnehmen. Im Rahmen einer empirischen Befragung von Schüler:innen sowie von Lehrkräften und Eltern wird deshalb untersucht, in welchem Umfang Lernsoftware in den Schulen und Zuhause genutzt wird, und, ob Kenntnisse über potenzielle Bedrohungen der persönlichen Daten durch die Speicherung und Weitergabe bei der Nutzung von Lernsoftware existieren und welche Schutzmaßnahmen im Zuge dessen ergriffen werden, um die eigenen Daten zu schützen.
Chapter
Full-text available
Zusammenfassung In den letzten Jahren hat die Popularität von mobilen Learning Apps für Kinder und Jugendliche stark zugenommen – insbesondere im Kontext der aktuell herrschenden Corona-Pandemie, in der der Präsenz-Schulbetrieb mehrfach stark eingeschränkt werden musste. Im Rahmen dieses Beitrags untersuchen wir, inwieweit Android Learning Apps vor dem Hintergrund der DSGVO die Privatheit ihrer Nutzenden (i. d. R. Minderjährige) gewährleisten bzw. Anforderungen an Datensicherheit erfüllen. Die Datengrundlage für die Untersuchung besteht aus 199 Learning Apps aus dem Google Play Store. Die Analyse unterteilt sich in zwei Schritte: die grobgranulare und die feingranulare Analyse. Die grobgranulare Analyse befasst sich mit Beobachtungen und statistischen Erkenntnissen, welche direkt aus den bereits gesammelten Metadaten der Apps ersichtlich sind. Weiterhin werden die Ergebnisse hinsichtlich Datenschutz und Datensicherheit kritisch hinterfragt, indem Metadaten bezüglich Datenschutzerklärung und Berechtigungen eingestuft werden. Die feingranulare Analyse baut auf der grobgranularen Analyse auf. Hierbei wird das Android Package (APK) mittels Tools zur statischen und dynamischen Analyse genauer betrachtet. Des Weiteren werden das Vorhandensein und die Qualität von Maßnahmen zur Absicherung des Datenverkehrs der ausgewählten Apps untersucht und bewertet. Wir stellen fest, dass viele Learning Apps Datenschutzrichtlinien oder sichere Datenübertragung bieten, die Apps auf unsichere Weise implementiert sind und oft eine zum Teil niedrige Codequalität vorweisen, was auf zusätzliche Cybersicherheits- und Datenschutzrisiken schließen lässt.
Chapter
Full-text available
Zusammenfassung Dass wertorientierte Technikgestaltung kaum umhinkommt, die gesellschaftsstrukturellen Bedingungen mit zu reflektieren, unter denen sie agiert, ist bekannt. Der Vortrag überträgt diese Einsicht auf den Bereich der Privatheit, indem er die strukturhistorischen Konstellationen rekonstruiert, aus denen sich versch. Formen der informationellen Privatheit in unterschiedlichen Vergesellschaftungsphasen der Moderne jeweils herausgeschält haben. Bei den so identifizierten Privatheitsformen handelt es sich um a) Reputation Management; b) Rückzug; sowie c) individuelle Informationskontrolle. Basierend auf einer solchen Genealogie informationeller Privatheitspraktiken werden in einem weiteren Schritt die strukturellen Treiber zeitgenössischer Privatheit herausgearbeitet, deren Form solchermaßen als d) Unschärfegarantie erkennbar wird. Abschließend werden Konsequenzen diskutiert, die sich aus den somit herausgearbeiteten gesellschaftsstrukturellen Bedingungen zeitgenössischer Technikgestaltung ergeben.
Article
Full-text available
In recent years, author gender identification has gained considerable attention in the fields of information retrieval and computational linguistics. In this paper, we employ and evaluate different learning approaches based on machine learning (ML) and neural network language models to address the problem of author gender identification. First, several ML classifiers are applied to the features obtained by bag-of-words. Secondly, datasets are represented by a low-dimensional real-valued vector using Word2vec, GloVe, and Doc2vec, which are on par with ML classifiers in terms of accuracy. Lastly, neural networks architectures, the convolution neural network and recurrent neural network, are trained and their associated performances are assessed. A variety of experiments are successfully conducted. Different issues, such as the effects of the number of dimensions, training architecture type, and corpus size, are considered. The main contribution of the study is to identify author gender by applying word embeddings and deep learning architectures to the Turkish language.
Conference Paper
Full-text available
In the present article, we address the problem of automatic gender classification of web blog authors. More specifically, we employ eight widely used machine learning algorithms, in order to study the effectiveness of feature selection on improving the accuracy of gender classification. The feature ranking is performed over a set of statistical, part-of-speech tagging and language model features. In the experiments, we employed classification models based on decision trees, support vector machines and lazy-learning algorithms. The experimental evaluation performed on blog author gender classification data demonstrated the importance of language model features for this task and that feature selection significantly improves the accuracy of gender classification, regardless of the type of the machine learning algorithm used.
Conference Paper
Full-text available
A common characteristic of communication on online social networks is that it happens via short messages, often using non-standard language variations. These characteristics make this type of text a challenging text genre for natural language processing. Moreover, in these digital communities it is easy to provide a false name, age, gender and location in order to hide one's true identity, providing criminals such as pedophiles with new possibilities to groom their victims. It would therefore be useful if user profiles can be checked on the basis of text analysis, and false profiles flagged for monitoring. This paper presents an exploratory study in which we apply a text categorization approach for the prediction of age and gender on a corpus of chat texts, which we collected from the Belgian social networking site Netlog. We examine which types of features are most informative for a reliable prediction of age and gender on this difficult text type and perform experiments with different data set sizes in order to acquire more insight into the minimum data size requirements for this task.
Conference Paper
Full-text available
Analysis of a corpus of tens of thousands of blogs - incorporating close to 300 million words - indicates significant differences in writing style and content between male and female bloggers as well as among authors of different ages. Such differences can be exploited to determine an unknown author's age and gender on the basis of a blog's vocabulary.
Article
Fake news is risky, since it has been created to manipulate readers’ opinions and beliefs. In this work, we compared the language of false news to the real one of real news from an emotional perspective, considering a set of false information types (propaganda, hoax, clickbait, and satire) from social media and online news article sources. Our experiments showed that false information has different emotional patterns in each of its types, and emotions play a key role in deceiving the reader. Based on that, we proposed an LSTM neural network model that is emotionally infused to detect false news.
Conference Paper
In the present article, we consider a problem to evaluate the gain in accuracy of using deep learning network for two language tasks: the automatic text classification according to the authors gender and to identify text sentiment. A preexisting corpus of Russian-language texts RusPersonality labeled with information on their authors (gender, age, psychological testing and so on) has been used for gender task along with the materials of the SentiRuEval competition for evaluating the sentiment of tweets. We have performed the comparative study of machine learning techniques for both tasks on the Russian-language texts. In case of gender tasks the bias in topics and genre was deliberately removed. The obtained neuronet models of deep learning demonstrate accuracy close to the state-of-the-art and even higher: for the gender identification up to 0.86 +/- 0.03 in Accuracy, 0.86 in F1-score, for sentiment classification the best model demonstrates F1 scores with micro average of 0.57 and macro average of 0.61 for banks dataset, and F1-micro of 0.61 and F1-macro of 0.74 for telekom.
Article
Social media services such as Twitter and Facebook are virtual environments where people express their thoughts, emotions, and opinions and where they reveal themselves to their peers. We analyze a sample of 123,000 Twitter users and 25 million of their tweets to investigate the relation between the opinions and emotions that users express and their predicted psychodemographic traits. We show that the emotions that we express on online social networks reveal deep insights about ourselves. Our methodology is based on building machine learning models for inferring coarse-grained emotions and psychodemographic profiles from user-generated content. We examine several user attributes, including gender, income, political views, age, education, optimism, and life satisfaction. We correlate these predicted demographics with the emotional profiles emanating from user tweets, as captured by Ekman's emotion classification. We find that some users tend to express significantly more joy and significantly less sadness in their tweets, such as those predicted to be in a relationship, with children, or with a higher than average annual income or educational level. Users predicted to be women tend to be more opinionated, whereas those predicted to be men tend to be more neutral. Finally, users predicted to be younger and liberal tend to project more negative opinions and emotions. We discuss the implications of our findings to online privacy concerns and self-disclosure behavior.