Preprints and early-stage research may not have been peer reviewed yet.
Preprint

Lifelong Learning Natural Language Processing Approach for Multilingual Data Classification

Abstract and Figures

The abundance of information in digital media, which in today's world is the main source of knowledge about current events for the masses, makes it possible to spread disinformation on a larger scale than ever before. Consequently, there is a need to develop novel fake news detection approaches capable of adapting to changing factual contexts and generalizing previously or concurrently acquired knowledge. To deal with this problem, we propose a lifelong learning-inspired approach, which allows for fake news detection in multiple languages and the mutual transfer of knowledge acquired in each of them. Both classical feature extractors, such as Term frequency-inverse document frequency or Latent Dirichlet Allocation, and integrated deep NLP (Natural Language Processing) BERT (Bidirectional Encoder Representations from Transformers) models paired with MLP (Multilayer Perceptron) classifier, were employed. The results of experiments conducted on two datasets dedicated to the fake news classification task (in English and Spanish, respectively), supported by statistical analysis, confirmed that utilization of additional languages could improve performance for traditional methods. Also, in some cases supplementing the deep learning method with classical ones can positively impact obtained results. The ability of models to generalize the knowledge acquired between the analyzed languages was also observed.
Content may be subject to copyright.
Lifelong Learning Natural Language Processing
Approach for Multilingual Data Classification
Jędrzej Kozal[0000000173362561], Michał Leś,
Paw Zyblewski[0000000242246709], Paweł Ksieniewicz[0000000195788395],
and Michał Woźniak[0000000301464205]
Wroclaw University of Science and Technology,
Department of Systems and Computer Networks,
Wroclaw, Poland
jedrzej.kozal@pwr.edu.pl
Abstract. The abundance of information in digital media, which in to-
day’s world is the main source of knowledge about current events for
the masses, makes it possible to spread disinformation on a larger scale
than ever before. Consequently, there is a need to develop novel fake
news detection approaches capable of adapting to changing factual con-
texts and generalizing previously or concurrently acquired knowledge.
To deal with this problem, we propose a lifelong learning-inspired ap-
proach, which allows for fake news detection in multiple languages and
the mutual transfer of knowledge acquired in each of them. Both clas-
sical feature extractors, such as Term frequency-inverse document fre-
quency or Latent Dirichlet Allocation, and integrated deep nlp (Natural
Language Processing)bert (Bidirectional Encoder Representations from
Transformers) models paired with mlp (Multilayer Perceptron) classifier,
were employed. The results of experiments conducted on two datasets
dedicated to the fake news classification task (in English and Spanish,
respectively), supported by statistical analysis, confirmed that utiliza-
tion of additional languages could improve performance for traditional
methods. Also, in some cases supplementing the deep learning method
with classical ones can positively impact obtained results. The ability
of models to generalize the knowledge acquired between the analyzed
languages was also observed.
Keywords: natural language processing ·lifelong learning ·classifier
ensemble ·transformers ·bidirectional encoder ·representation learning
·bert
1 Introduction
It is almost a cliche to say that the modern economy and society strongly depend
on information. Therefore we all expect that the information we are provided
with will be reliable and credible, enabling us to make rational decisions. This
is the ideal world, and the role of information and disinformation has been
appreciated since ancient times and information manipulation has become a
arXiv:2206.11867v1 [cs.CL] 25 May 2022
2 J. Kozal et al.
critical weapon used to gain political or material benefits [23]. Nowadays, the
problem with the use of so-called fake news is strongly noticed because such
a scale of manipulation has not been noted before. One can think of the news
spread related to the COVID-19 pandemic or the Russian invasion of Ukraine.
This problem is particularly evident in digital media, hence almost all global
Internet platforms, such as Facebook or Twitter, indicate in their terms of service
the mechanisms for verifying information. Unfortunately, manual verification
of the information veracity and relying on human experts as fact-checkers is
relatively slow. MIT Media Lab proves, that fake news travels farther and faster
than the news that is absolutely legit1. Hence, one of the challenges is to develop
mechanisms that can detect fake news automatically and have the ability to
improve their model continuously.
Among the ML-based approaches to fake news detection, the literature dis-
tinguishes methods for [6]:
Text analysis consists of analyzing the Natural Language Processing (nlp)
data representation without the linguistic context [27], as well as the psy-
cholinguistic analysis [32] and processing the syntactic text structure [18].
Reputation analysis, which measures the reputation of an article and pub-
lisher based on sources such as content, reviews, domain, IP address or
anonymity [31].
Network analysis, which is related to the graph theory, and is performed to
evaluate the truthfulness of the information [33].
Image-manipulation recognition, which are dedicated for the detection of
image tampering [3], copy-move forgeries [2], and other image modifications
such as contrast enhancement [28].
This work deals with nlp, and introduces a training procedure for a fake
news detection model that can classify news provided in different languages and
perform knowledge transfer between them.
2 Related Works
Fake news detection with nlp is based on the extraction of features directly from
the content of the analyzed texts, without taking into account their social con-
text, and is currently the basis for each of the subtasks included in the detection
of fake news [27]. Many of the classic nlp methods are based on bag-of-words
approach [11], which creates a vector for each document containing information
about the number of times each word appears in it. An example of an extension
of this approach may be n-grams, which can tokenize a sequence of words of a
given length, rather than focusing only on individual words. The minimum range
of n-grams is one (so-called unigrams), while its upper limit strongly depends
on the length of the analyzed documents. Additionally, the weights of individual
1https://www.technologyreview.com/2018/03/08/144839/fake-news-spreads-faster-
than-the-truth-and-its-all-our-fault/
Lifelong Learning Natural Language Processing . . . 3
words can be normalized for a specific document using Term Frequency (tf)
[21], or for the entire analyzed corpus, using Term Frequency-Inverse Document
Frequency (tf-idf) [14]. Another noteworthy approach to extracting features
from text is Latent Dirichlet Allocation (lda), which is a generative probabilis-
tic model for topic modeling [13].
Such approaches, despite their simplicity, are successfully used in the liter-
ature in the task of recognizing fake news. Hassan et al classified Tweets as
credible and not credible, using five different classification algorithms and word-
based n-gram analysis including tf and tf-idf [12]. The authors have proven
the effectiveness of this approach on the pheme benchmark dataset [34]. Bharad-
waj and Shao employed reccurent neural networks, random forest classifier and
naive bayes classifier coupled with n-grams of various lengths in order to classify
fake news dataset from Kaggle [1]. Similar experiments, including e.g. n-grams,
tf-idf and Gradient Boosting Classifier, were conducted by Wynne and Wint
[30]. Kaur et al. and Telang et al. evaluated multiple combinations of various
classifier and extraction techniques in the fake news detection task [15,29]. Usage
of the ensemble methods for disinformation detection was explored in [17].
Currently, deep neural networks Deep Neural Networks (dnn) based on pre-
training general language representations are very popular in fake news detect-
ing. [26]. A strong point of the currently used goings, which additionally fine-tune
the token representations using supervised learning, is that a few parameters
need to be learned from scratch only [24]. Recently, many works show the use-
fulness of knowledge transfer in the case of natural language inference [7]. bert
(Bidirectional Encoder Representations from Transformers) is particularly rec-
ognizable, which was designed for training with the use of unlabeled text [8].
Then, bert is fine-tuned for a given problem with an additional output layer.
3 Methods
Each recognition system dedicated to language data is based on two basic steps.
The first is vectorization, which allows for the transformation of a set of textual
documents into a numerical data set interpretable by modeling procedure, and
the second is an appropriate induction that allows for the construction of a model
in a problem marked with bias. These procedures can be conducted separately,
in a serial manner, as in the case of the classic tf-idf and mlp pair, where the
basic extractor determines the vector representation and the neural network fits
in the space designated by it, or jointly, in a parallel manner, as is the case in
deep models, such as bert.
Regardless of the integrated processing procedure typical of deep learning
approaches, in which extraction and induction take place within the same struc-
ture, also in models such as bert we can delineate the constructive elements
responsible for the extraction and at the end of the neural network structure
fully connected layers, which in their logic are identical to typical, shallow mlp
networks. This creates some potential for integrating classical and deep feature
extraction methods for nlp. The deep model, getting trained on the basis of a
4 J. Kozal et al.
given bias, trains weights responsible for both extraction of attributes and clas-
sification, updating them so that they are better suited to the construction of
the proper problem space. It is therefore possible to separate the extraction part
of the deep learning model and using typical propagation use it to obtain
attributes for the classification model in the same manner as classical extraction
methods such as tf-idf.
Therefore, in the proposed method, the mlp model is used as a base classifier,
the default in the structure for deep extractors, and at the same time, good at
optimizing the decision space for classical extraction methods. Moreover, for the
analysis of the integration potential of heterogeneous extractors, three methods
commonly used in the literature were selected:
Term Frequency-Inverse Document Frequency legacy model based on nor-
malization to document (tf) and to corpora (idf) as the most basic method
currently used in nlp applications.
Latent Dirichlet Allocation a generative statistical model, similar to the
latest unsupervised learning methods, but using approaches to identify sig-
nificant thematic factors.
Bidirectional Encoder Representations from Transformers the current state-
of-the-art model, being a clear baseline in contemporary nlp experiments.
The bert model, unlike tf-idf and lda, is not a language-independent nlp
task solver. To enable its operation in a multilingual environment, base models
were selected, trained on different corpora, with a similar structure size. First
was standard version pre-trained on English corpus (bert), second beto [5]
- variant of bert architecture pre-trained on Spanish dataset (bertspa), and
third, the standard bert pre-trained on a multilingual corpus (bertmult).
nlp extractors, regardless of the procedure used, tend to generate many re-
dundant attributes, significantly increasing the optimization space of the neural
network. In the case of an integration architecture based on the concatenation
of features obtained from a heterogeneous pool of extractors, this tendency will
be further deepened. Additionally, different extraction contexts may also allow
for the diversification of models, where different processing algorithms allow for
alternative insight on the data. For this proposal, the following solutions will be
used:
Mutual Information (mi)
Analysis of Variance (anova)
Principal Components Analysis (pca)
With the use of all the tools introduced above, it is possible to propose two
alternative architectures for integrating the extractors. They are presented in
Figure 1, compiling the processing schemes of sis and ers policies.
As part of the Simple Integration Schema policy (sis), each of the extractors
independently builds an extraction model which, after processing the avail-
able language corpora (A, B and C corpora) for each of them, allows for the
transformation from a text into a vector representation by pre-limited, constant
Lifelong Learning Natural Language Processing . . . 5
corpora A
corpora B
corpora C
Simple integration schema
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Extractor
Ψ
Ensemble
Ensemble
Ensemble
TF-IDF LDA MULT ENG SPA
BERT
Extractor reduction schema
Extractor
Extractor
Extractor Extractor
Extractor
Extractor Extractor
Extractor
Extractor Extractor
Extractor
Extractor Extractor
Extractor
Extractor
Concatenated feature space
Reductor
Ψ
Concatenated feature space
Reductor
Ψ
Concatenated feature space
Reductor
Ψ
Feature extraction approaches
Fig. 1. Ensemble diversification policies
dimensionality. For each transformation, a separate classifier is built, which fi-
nally goes to the pools of models dedicated to different corpora, integrated inter-
nally and externally through the accumulation of support. Internal integration is
carried out classically, and external integration depends on the source languages
of the corpora. In the case of a linguistically homogeneous pool, the accumu-
lation of support is not modified. Otherwise, (a) the number of languages is a
multiplier of the number of problem classes, (b) monolingual models receive zero
foreign language support, and (c) multilingual models replicate class supports
divided by the language multiplier to preserve the complementarity condition of
the support vector.
The Extractor Reduction Schema (ers) strategy performs the extraction in
the same way for sis. However, models are not built for each of the extractors,
and the object representations obtained within the corpora are concatenated into
6 J. Kozal et al.
a wide, common space of the problem. The space dimensionality is reduced to a
predetermined size, i.e. upper limit of the number of attributes per extractor. A
separate classification model is trained for each corpora, constituting a narrow
pool that is integrated in the same way as external sis ensemble policy.
The aim of the research is to obtain experimentally verified information about
the appropriate integration strategy of classifier ensembles for the needs of mul-
tilingual natural language processing systems. Both the potentials of various
extraction methods in each of the selected corpora and the capabilities of
the various dimension reducers in the ers policy will be verified.
4 Experiment
The experiments aim to analyze how knowledge gained from different languages
can positively affect the overall classification performance. During the experi-
mental evaluation, we intend to answer the following research questions:
1. How does fake news classification performance depend on different extraction
methods?
2. How do ensemble models diversified by various extractors, training sets and
training features improve fake-news classification?
3. Can we utilize models trained with different language corpora to improve
other language classification?
4. What methods for feature integration can be useful in a multilingual envi-
ronment?
4.1 Setup
Datasets. Two datasets were utilized in this study, namely: The Spanish Fake
News Corpus 2[22] and Kaggle Fake News: Build a system to identify unreliable
news articles3. These datasets were treated as a representation of fake-news
distribution in Spanish and English, respectively. The third variant of a dataset
was a mixed dataset that combined both Spanish and English datasets. Detailed
information about the number of samples for each dataset is given in Table 1. As
we can see both Spanish and English datasets are internally balanced. However,
please note a significant, 1:31 imbalance in the presence of the source datasets
in mixed article collection. For this reason in further parts we report balanced
accuracy as the evaluation metric, which in case of mixed dataset actually shows
the quality of the four-class problem sensitizing itself to errors performed on
minority language.
Data preprocessing. Each word was changed to lowercase before computing
tf-idf features. For this extraction method, features were computed based on
unigrams. We remove words that occur in less than 20 documents or more than
half of all documents during lda preprocessing. lda features were computed with
2https://github.com/jpposadas/FakeNewsCorpusSpanish
3https://www.kaggle.com/c/fake-news/data
Lifelong Learning Natural Language Processing . . . 7
Table 1. Description of used datasets. Fake-news learning examples are considered as
positives.
dataset #samples #negative #positive ir
esp fake 676 338 338 1:1
kaggle 20 800 10 387 10 413 1:1
mixed 21 476 10 725 10 751 1:1
ir 1:31 1:31 1:31
chunk size 2000, 20 passes, and 400 iterations. Both tf-idf and lda extracted
100 features from each document.
Classification models. MLP used for training on extracted features had two
hidden layers with 500 neurons each and was trained by Adam optimizer [16]
with a learning rate 0.001 for 200 epochs with Early Stopping [10]. bert-based
models were trained for 5 epochs each with AdamW [20] optimizer and learning
rate 3e-5.
Experimental protocol. To obtain more reliable results 5x2 cross validation
was employed. However, there is a significant difference in the number of English
and Spanish datasets samples. For this reason, stratification was introduced for
Spanish and English labels in the case of the mixed dataset. This modification
allows avoiding over-representation of learning examples from one language in
some folds.
Result analysis. To analyse the classification performance of the chosen algo-
rithms we employed a 5x2cv combined F test for statistical analysis with p-value
of 0.05.
Implementation and reproducibility. Github repository with code for all
experiments is available online 4.
4.2 Results
E1 Classification based on single feature extraction method Results
of classification accuracy obtained for a single feature extraction methods are
presented in Tables 2 and 3. We use these values as a baseline for further ex-
periments. We state a significant difference between all deep models and tf-idf
and lda for text and title attributes. Additionally, for text attribute bert was
better than bertmult for the kaggle dataset, lda than tf-idf for the kaggle and
mixed datasets. Solution based on bertspa was better than bertmult for esp
fake dataset and better than any other model for the mixed dataset.
E2 Classification based on multiple feature extraction methods In this
part of experiments we create ensemble according to Simple Integration Schema
and Extractor Reduction Schema introduced earlier. We use multiple base models
4https://github.com/w4k2/nn-nlp
8 J. Kozal et al.
Table 2. Accuracy for each extractor trained and evaluated on one dataset for text
attribute.
dataset extraction method
bert
tf-idf lda mult eng spa

kaggle .866 .911 .981 .985
1 1, 2 1, 2, 3
esp fake .751 .618 .786 .853
2 2 1, 2, 3
mixed .729 .768 .972 .974 .982
1 1, 2 1, 2 1, 2, 3, 4
Table 3. Accuracy for each extractor trained and evaluated on one dataset for title
attribute.
dataset extraction method
bert
tf-idf lda mult eng spa

kaggle .904 .829 .958 .963
2 1, 2 1, 2
esp fake .540 .553 .675 .852
1, 2 1, 2, 3
mixed .646 .631 .942 .948 .982
1, 2 1, 2, 3 1, 2, 3, 4
trained with the same language. Results are presented in tables 4 and 5. pca
was significantly worse than sa for esp fake and mixed and all other methods for
the mixed dataset with text attribute. For title attribute, pca was better than
sa with the kaggle dataset, sa was better than all other methods for esp fake,
pca was worse than other methods, and minfo and anova were better than sa
for the mixed dataset.
E3 Classification with different languages We utilize a single extraction
method trained on different datasets to precompute features for one dataset.
Extracted features are concatenated. Then, these features are used to train the
single classifier. The impact of different language features on fake news classifi-
cation performance is presented in table 6 and 7. To keep the presentation of the
results consistent, the statistical analysis was carried out by comparing different
models while keeping the dataset and ensemble construction method the same.
No significant difference was found for text or title attributes.
Lifelong Learning Natural Language Processing . . . 9
Table 4. Accuracy for different ensemble construction methods with text attribute.
dataset integration method
dimensionality reduction
sa minfo anova pca
kaggle .988 .987 .987 .983
esp fake .851 .833 .831 .788
4
mixed .913 .894 .903 .785
4 4 4
Table 5. Accuracy for different ensemble construction methods with title attribute.
dataset integration method
dimensionality reduction
sa minfo anova pca
kaggle .955 .963 .963 .964
1, 3 1 1
esp fake .730 .737 .732 .674
4
mixed .885 .905 .879 .760
4 4 4
E4 Classification with different extraction methods and languages
Classification performance of ensemble using a combination of all languages and
extraction methods is presented in tables 8 and 9. For text attribute, we state
meaningful differences between pca and the rest of the methods and between
minfo and sa for the kaggle dataset and the difference between ANVOA and
pca for the mixed dataset. For title attribute, we found that pca is better than
minfo for kaggle, sa is better than pca for esp fake. For mixed dataset, anova
is better than all other methods, sa is better than pca, and minfo is better
than sa and pca.
Visualization of learned features To provide more insight into learned rep-
resentations, we visualize average feature vectors. We consider features for the
dataset that the extractor was trained on and features generated by applying
the trained extractor to two other datasets. Next, we plot the average feature
vector as a heatmap to show the most active features for each training-evaluation
dataset pair. Results are presented in Fig. 2. First, we focus on plots for tf-idf
and lda. Most active features are obtained for the same dataset that the extrac-
tor was trained on. When applying the extractor to the datasets with a different
language, some features have significantly higher values. This indicates that ap-
10 J. Kozal et al.
Table 6. Accuracy for ensembles obtained from evaluating one extractor on all 3
datasets with text attribute.
dataset vectorization approach
bert
tf-idf lda mult. eng. beto
SA
kaggle .933 .951 .934 .950
—————
esp fake .741 .764 .757 .761
—————
mixed .775 .772 .782 .740 .780
—————
MINFO
kaggle .922 .943 .921 .941
—————
esp fake .759 .772 .781 .746
—————
mixed .809 .802 .805 .801 .818
—————
ANOVA
kaggle .924 .945 .925 .944
—————
esp fake .771 .773 .783 .748
—————
mixed .801 .793 .795 .773 .791
—————
PCA
kaggle .921 .942 .922 .944
—————
esp fake .757 .775 .780 .749
—————
mixed .790 .815 .782 .788 .813
—————
plying the extraction model to a different language can provide an additional
source of information that can be beneficial for classification quality. Further-
more, it can explain why exploiting extractors trained on different languages
improved performance for tf-idf and lda in the third experiment.
These findings are strictly connected to the inner workings of extraction algo-
rithms. When extracting features with tf-idf first step is to create vocabulary
from the corpus. Next, feature vectors are computed based on the frequency of
each word. When processing text with words that mostly fall outside of vocabu-
lary, tf-idf will return a sparse vector. Similar reasoning can be applied to lda.
Lifelong Learning Natural Language Processing . . . 11
Table 7. Accuracy for ensembles obtained from evaluating one extractor with all 3
datasets with title attribute.
dataset vectorization approach
bert
tf-idf lda mult. eng. beto
SA
kaggle .916 .931 .913 .928
—————
esp fake .638 .668 .655 .622
—————
mixed .676 .688 .678 .659 .669
—————
MINFO
kaggle .903 .921 .901 .921
—————
esp fake .628 .680 .681 .635
—————
mixed .740 .743 .747 .742 .739
—————
ANOVA
kaggle .914 .926 .901 .923
—————
esp fake .633 .682 .685 .627
—————
mixed .751 .756 .750 .749 .743
—————
PCA
kaggle .913 .929 .911 .927
—————
esp fake .618 .685 .679 .631
—————
mixed .764 .756 .769 .727 .767
—————
Transformers use a different procedure for text preprocessing. bert tokenizer can
divide a single word into separate tokens if it falls outside of vocabulary. Then
these tokens are used to generate word embeddings that are passed through the
transformer network. As a result, when dealing with different languages model
will not produce sparse output but some vector that may or may not contain
useful information. This is shown in the two lowest rows of Fig 2. There is no
significant difference between beto average features, regardless of whether they
are computed with the same language model was trained on or not. This in
12 J. Kozal et al.
Table 8. Results for utilizing in ensemble both different extraction methods and dif-
ferent languages with text attribute.
dataset integration method
dimensionality reduction
sa minfo anova pca
kaggle .986 .988 .988 .990
1
esp fake .845 .830 .831 .804
mixed .884 .875 .838 .869
3
Table 9. Results for utilizing in ensemble both different extraction methods and dif-
ferent languages with title attribute.
dataset integration method
dimensionality reduction
sa minfo anova pca
kaggle .965 .963 .964 .963
esp fake .771 .750 .753 .673
4
mixed .762 .908 .909 .806
1, 4 1, 4
turn, can explain why we observe a decrease in accuracy for the deep models in
Experiment 3.
4.3 Lessons learned
When analyzing accuracy obtained for single extractors (Tab. 2 and 3) one can
notice that deep learning-based methods obtain best results. This is expected,
as deep models currently dominate nlp field [9,25,4]. One can also notice that
models dedicated for a specific language obtained better results than MultiB-
ERT. This can be explained by a lower number of learning examples in datasets
utilized in our experiments. MultiBERT is the largest model. Therefore, it can
be prone to overfitting with a smaller amount of data. Also, it is worth noting
that the difference in performance between beto and MultiBERT is larger than
between the English version of bert and MultiBERT. This is probably caused by
an improved pretraining procedure from RoBERTa [19], that was used in beto.
Better pretraining can lead to improved performance in downstream tasks [9,25].
Although MultiBERT was initially trained with a multi-language corpus and has
the largest number of parameters, it is not the best model for the mixed dataset.
Lifelong Learning Natural Language Processing . . . 13
lda
tf-idf
beto
Fig. 2. Visualization of average features learned by lda and tf-idf and beto. Dataset
labels in y-axis correspond to training datasets and labels in x-axis correspond to
evaluation datasets. Brighter colors correspond to higher values.
Unfortunately, we found no good explanation for this phenomenon. Results ob-
tained for the Spanish dataset are worse compared to kaggle and mixed. This can
be easily explained by a larger number of samples in the kaggle dataset compared
to esp fake (please refer to Tab. 1). These findings provide an answer to research
question 1. Deep models obtain better or similar results when comparing text
and title attributes. Unfortunately, the same cannot be stated about tf-idf and
lda, where performance for the title attribute is worse in most cases.
Regardless of the ensemble construction methodology obtained accuracy is
close for all methods (Tab. 4 and 5). Comparing values of metrics to the first
experiment, we can state that employing an ensemble can benefit accuracy only
for the English dataset. This is probably caused by the close performance of
all models for the kaggle dataset. In the case of the Spanish dataset, there is
a clear gap between beto and the rest of the models, and in the case of the
mixed dataset, there is a gap between deep learning models and two other.
When combining strong classifiers with weaker ones in one ensemble, we can
obtain worse results than for the strongest classifier. In this case, the utilization
of additional models with worse performance is more harmful than helpful. These
results answer research question 2.
14 J. Kozal et al.
In the third experiment feature extractor trained on a single dataset was
used to extract features from all datasets and create a new ensemble from 3
different sets of features (Tab. 6 and 7). This approach is beneficial for tf-
idf and lda but detrimental for deep models. The statistical analysis results
show no significant differences between tf-idf,lda, and deep models. This
finding is quite interesting due to the clear dominance of deep models in the first
experiment. It is important to note that deep models achieved worse results than
in the first experiment; however, at the same time, there was a significant gain
in performance for tf-idf and lda. We provide a possible explanation when
discussing feature visualization from Fig 2. This answers research question 3.
Utilizing both different models and features extracted for different languages
(Tab. 8 and 9) does not provide an improvement over baselines, nor constructing
ensembles with the utilization of different models. These findings, combined with
conclusions from Experiment 2, answer research question 4.
5 Conclusions
This study aimed to apply nlp methods to the problem of detecting misinforma-
tion in messages produced in different languages and verify whether it is possible
to transfer knowledge between models trained for different languages.
Based on the results of the experimental studies, it was not found that differ-
ent models or attributes extracted for different languages led to a noticeable, i.e.,
statistically significant, improvement over the baseline results. Also, no signifi-
cant improvement was confirmed using methods based on the classifier ensemble
concept. The limited scope of datasets contents can explain this. Articles in the
Spanish dataset are written about local affairs. Therefore intersections of topics
between two datasets can be small. Moreover, this indicates that preparing a
fake-news detection model with a single language and utilizing learned knowl-
edge for other languages is a difficult problem and should be further explored
in research. Having this in mind, future work should focus on preparing articles
and datasets for multiple languages. By utilizing the same type of article top-
ics in multiple languages, one can obtain better results. Our work showed that
some classes of models could benefit from utilizing multiple languages, so it is
reasonable to expect that aligning of topics should further improve results.
In the calculation of the achieved results, as well as the literature analysis,
it seems that the direction of further work may be related to the transfer of
knowledge, but rather within a single language, while it seems attractive to
transfer knowledge between tasks of fake news detection, but different subject
areas.
References
1. Bharadwaj, P., Shao, Z.: Fake news detection with semantic features and text
mining. International Journal on Natural Language Computing (IJNLC) Vol 8
(2019)
Lifelong Learning Natural Language Processing . . . 15
2. Bilal, M., Habib, H.A., Mehmood, Z., Saba, T., Rashid, M.: Single and multi-
ple copy–move forgery detection and localization in digital images based on the
sparsely encoded distinctive features and dbscan clustering. Arabian Journal for
Science and Engineering 45(4), 2975–2992 (2020)
3. Bondi, L., Lameri, S., Guera, D., Bestagini, P., Delp, E.J., Tubaro, S., et al.: Tam-
pering detection and localization through clustering of camera-based cnn features.
In: CVPR Workshops. vol. 2 (2017)
4. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A.,
Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win-
ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark,
J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language
models are few-shot learners. CoRR abs/2005.14165 (2020), https://arxiv.
org/abs/2005.14165
5. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-
trained bert model and evaluation data. In: PML4DC at ICLR 2020 (2020)
6. Choraś, M., Demestichas, K., Giełczyk, A., Herrero, Á., Ksieniewicz, P., Re-
moundou, K., Urda, D., Woźniak, M.: Advanced machine learning techniques for
fake news (online disinformation) detection: A systematic mapping study. Applied
Soft Computing 101, 107050 (2021)
7. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning
of universal sentence representations from natural language inference data. arXiv
preprint arXiv:1705.02364 (2017)
8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
9. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-
tional transformers for language understanding. CoRR abs/1810.04805 (2018),
http://arxiv.org/abs/1810.04805
10. Girosi, F., Jones, M., Poggio, T.: Regularization theory and neu-
ral networks architectures. Neural Computation 7(2), 219–269 (1995).
https://doi.org/10.1162/neco.1995.7.2.219
11. Harris, Z.S.: Distributional structure. Word 10(2-3), 146–162 (1954)
12. Hassan, N., Gomaa, W., Khoriba, G., Haggag, M.: Credibility detection in twitter
using word n-gram analysis and supervised machine learning techniques. Interna-
tional Journal of Intelligent Engineering and Systems 13(1), 291–300 (2020)
13. Hoffman, M., Blei, D., Bach, F.: Online learning for latent dirichlet allocation.
vol. 23, pp. 856–864 (11 2010)
14. Jones, K.S.: A statistical interpretation of term specificity and its application in
retrieval. Journal of documentation (1972)
15. Kaur, S., Kumar, P., Kumaraguru, P.: Automating fake news detection system
using multi-level voting model. Soft Computing 24(12), 9049–9069 (2020)
16. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International
Conference on Learning Representations (12 2014)
17. Ksieniewicz, P., Choraś, M., Kozik, R., Woźniak, M.: Machine learning methods
for fake news classification. In: International Conference on Intelligent Data Engi-
neering and Automated Learning. pp. 332–339. Springer (2019)
18. Kumar, S., Carley, K.M.: Tree lstms with convolution units to predict stance and
rumor veracity in social media conversations. In: Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics. pp. 5047–5058 (2019)
16 J. Kozal et al.
19. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized BERT pretraining
approach. CoRR abs/1907.11692 (2019), http://arxiv.org/abs/1907.11692
20. Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. CoRR
abs/1711.05101 (2017), http://arxiv.org/abs/1711.05101
21. Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary
information. IBM Journal of research and development 1(4), 309–317 (1957)
22. Posadas Durán, J., Gomez Adorno, H., Sidorov, G., Moreno, J.: Detection of fake
news in a new corpus for the spanish language. Journal of Intelligent & Fuzzy
Systems 36, 4869–4876 (05 2019). https://doi.org/10.3233/JIFS-179034
23. Posetti, J., Matthews, A.: A short guide to the history of ‘fake news’ and disinfor-
mation. International Center for Journalists 7(2018), 2018–07 (2018)
24. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language un-
derstanding with unsupervised learning (2018)
25. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Lan-
guage models are unsupervised multitask learners (2018), https://d4mucfpksywv.
cloudfront.net/better-language-models/language-models.pdf
26. Rodríguez, Á.I., Iglesias, L.L.: Fake news detection using deep learning. arXiv
preprint arXiv:1910.03496 (2019)
27. Saquete, E., Tomás, D., Moreda, P., Martínez-Barco, P., Palomar, M.: Fighting
post-truth using natural language processing: A review and open challenges. Expert
systems with applications 141, 112943 (2020)
28. Suryawanshi, P., Padiya, P., Mane, V.: Detection of contrast enhancement forgery
in previously and post compressed jpeg images. In: 2019 IEEE 5th International
Conference for Convergence in Technology (I2CT). pp. 1–4. IEEE (2019)
29. Telang, H., More, S., Modi, Y., Kurup, L.: Anempirical analysis of classification
models for detection of fake news articles. In: 2019 IEEE International Conference
on Electrical, Computer and Communication Technologies (ICECCT). pp. 1–7.
IEEE (2019)
30. Wynne, H.E., Wint, Z.Z.: Content based fake news detection using n-gram models.
In: Proceedings of the 21st International Conference on Information Integration
and Web-based Applications & Services. pp. 669–673 (2019)
31. Xu, K., Wang, F., Wang, H., Yang, B.: Detecting fake news over online social
media via domain reputations and content understanding. Tsinghua Science and
Technology 25(1), 20–27 (2019)
32. Zhang, D., Zhou, L., Kehoe, J.L., Kilic, I.Y.: What online reviewer behaviors re-
ally matter? effects of verbal and nonverbal behaviors on detection of fake online
reviews. Journal of Management Information Systems 33(2), 456–481 (2016)
33. Zhou, X., Zafarani, R.: Network-based fake news detection: A pattern-driven ap-
proach. ACM SIGKDD explorations newsletter 21(2), 48–60 (2019)
34. Zubiaga, A., Wong Sak Hoi, G., Liakata, M., Procter, R.: Pheme dataset of rumours
and non-rumours (Oct 2016)
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Fake news gains has gained significant momentum, strongly motivating the need for fake news research. Many fake news detection approaches have thus been proposed, where most of them heavily rely on news content. However, networkbased clues revealed when analyzing news propagation on social networks is an information that has hardly been comprehensively explored or used for fake news detection. We bridge this gap by proposing a network-based pattern-driven fake news detection approach. We aim to study the patterns of fake news in social networks, which refer to the news being spread, spreaders of the news and relationships among the spreaders. Empirical evidence and interpretations on the existence of such patterns are provided based on social psychological theories. These patterns are then represented at various network levels (i.e., node-level, ego-level, triad-level, community-level and the overall network) for being further utilized to detect fake news. The proposed approach enhances the explainability in fake news feature engineering. Experiments conducted on real-world data demonstrate that the proposed approach can outperform the state of the arts.
Article
Full-text available
The issues of online fake news have attained an increasing eminence in the diffusion of shaping news stories online. Misleading or unreliable information in the form of videos, posts, articles, URLs is extensively disseminated through popular social media platforms such as Facebook and Twitter. As a result, editors and journalists are in need of new tools that can help them to pace up the verification process for the content that has been originated from social media. Motivated by the need for automated detection of fake news, the goal is to find out which classification model identifies phony features accurately using three feature extraction techniques, Term Frequency–Inverse Document Frequency (TF–IDF), Count-Vectorizer (CV) and Hashing-Vectorizer (HV). Also, in this paper, a novel multi-level voting ensemble model is proposed. The proposed system has been tested on three datasets using twelve classifiers. These ML classifiers are combined based on their false prediction ratio. It has been observed that the Passive Aggressive, Logistic Regression and Linear Support Vector Classifier (LinearSVC) individually perform best using TF-IDF, CV and HV feature extraction approaches, respectively, based on their performance metrics, whereas the proposed model outperforms the Passive Aggressive model by 0.8%, Logistic Regression model by 1.3%, LinearSVC model by 0.4% using TF-IDF, CV and HV, respectively. The proposed system can also be used to predict the fake content (textual form) from online social media websites.
Article
Full-text available
Due to the advancements in digital image processing and multimedia devices, the digital image can be easily tampered and presented as evidence in judicial courts, print media, social media, and for insurance claims. The most commonly used image tampering technique is the copy-move forgery (CMF) technique, where the region from the original image is copied and pasted in some other part of the same image to manipulate the original image content. The CMFD techniques may not provide robust performance after various post-processing attacks and multiple forged regions within the images. This article introduces a robust CMF detection technique to mitigate the aforementioned problems. The proposed CMF detection technique utilizes a fusion of speeded up robust features (SURF) and binary robust invariant scalable keypoints (BRISK) descriptors for CMF detection. The SURF features are robust against different post-processing attacks such as rotation, blurring, and additive noise. However, the BRISK features are considered as robust in the detection of the scale-invariant forged regions as well as poorly localized keypoints of the objects within the forged image. The fused features are matched using hamming distance and second nearest neighbor. The matched features grouped into clusters by applying density-based spatial clustering of applications with noise clustering algorithm. The random sample consensus technique is applied to the clusters to remove the remaining false matches. After some post-processing, the forged regions are detected and localized. The performance of the proposed CMFD technique is assessed using three standard datasets (i.e., CoMoFoD, MICC-F220, and MICC-F2000). The proposed technique surpasses the state-of-the-art techniques used for CMF detection in terms of true and false detection rates.
Article
Fake news has now grown into a big problem for societies and also a major challenge for people fighting disinformation. This phenomenon plagues democratic elections, reputations of individual persons or organizations, and has negatively impacted citizens, (e.g., during the COVID-19 pandemic in the US or Brazil). Hence, developing effective tools to fight this phenomenon by employing advanced Machine Learning (ML) methods poses a significant challenge. The following paper displays the present body of knowledge on the application of such intelligent tools in the fight against disinformation. It starts by showing the historical perspective and the current role of fake news in the information war. Proposed solutions based solely on the work of experts are analysed and the most important directions of the application of intelligent systems in the detection of misinformation sources are pointed out. Additionally, the paper presents some useful resources (mainly datasets useful when assessing ML solutions for fake news detection) and provides a short overview of the most important R&D projects related to this subject. The main purpose of this work is to analyse the current state of knowledge in detecting fake news; on the one hand to show possible solutions, and on the other hand to identify the main challenges and methodological gaps to motivate future research.
Article
Fake news has recently leveraged the power and scale of online social media to effectively spread misinformation which not only erodes the trust of people on traditional presses and journalisms, but also manipulates the opinions and sentiments of the public. Detecting fake news is a daunting challenge due to subtle difference between real and fake news. As a first step of fighting with fake news, this paper characterizes hundreds of popular fake and real news measured by shares, reactions, and comments on Facebookfrom two perspectives: domain reputations and content understanding. Our domain reputation analysis reveals that the Web sites of the fake and real news publishers exhibit diverse registration behaviors, registration timing, domain rankings, and domain popularity. In addition, fake news tends to disappear from the Web after a certain amount of time. The content characterizations on the fake and real news corpus suggest that simply applying term frequency-inverse document frequency (tf-idf) and Latent Dirichlet Allocation (LDA) topic modeling is inefficient in detecting fake news, while exploring document similarity with the term and word vectors is a very promising direction for predicting fake and real news. To the best of our knowledge, this is the first effort to systematically study domain reputations and content characteristics of fake and real news, which will provide key insights for effectively detecting fake news on social media.