Lifelong Learning Natural Language Processing
Approach for Multilingual Data Classiﬁcation
Jędrzej Kozal[0000−0001−7336−2561], Michał Leś,
Paweł Zyblewski[0000−0002−4224−6709], Paweł Ksieniewicz[0000−0001−9578−8395],
and Michał Woźniak[0000−0003−0146−4205]
Wroclaw University of Science and Technology,
Department of Systems and Computer Networks,
Abstract. The abundance of information in digital media, which in to-
day’s world is the main source of knowledge about current events for
the masses, makes it possible to spread disinformation on a larger scale
than ever before. Consequently, there is a need to develop novel fake
news detection approaches capable of adapting to changing factual con-
texts and generalizing previously or concurrently acquired knowledge.
To deal with this problem, we propose a lifelong learning-inspired ap-
proach, which allows for fake news detection in multiple languages and
the mutual transfer of knowledge acquired in each of them. Both clas-
sical feature extractors, such as Term frequency-inverse document fre-
quency or Latent Dirichlet Allocation, and integrated deep nlp (Natural
Language Processing)bert (Bidirectional Encoder Representations from
Transformers) models paired with mlp (Multilayer Perceptron) classiﬁer,
were employed. The results of experiments conducted on two datasets
dedicated to the fake news classiﬁcation task (in English and Spanish,
respectively), supported by statistical analysis, conﬁrmed that utiliza-
tion of additional languages could improve performance for traditional
methods. Also, in some cases supplementing the deep learning method
with classical ones can positively impact obtained results. The ability
of models to generalize the knowledge acquired between the analyzed
languages was also observed.
Keywords: natural language processing ·lifelong learning ·classiﬁer
ensemble ·transformers ·bidirectional encoder ·representation learning
It is almost a cliche to say that the modern economy and society strongly depend
on information. Therefore we all expect that the information we are provided
with will be reliable and credible, enabling us to make rational decisions. This
is the ideal world, and the role of information and disinformation has been
appreciated since ancient times and information manipulation has become a
arXiv:2206.11867v1 [cs.CL] 25 May 2022
2 J. Kozal et al.
critical weapon used to gain political or material beneﬁts . Nowadays, the
problem with the use of so-called fake news is strongly noticed because such
a scale of manipulation has not been noted before. One can think of the news
spread related to the COVID-19 pandemic or the Russian invasion of Ukraine.
This problem is particularly evident in digital media, hence almost all global
Internet platforms, such as Facebook or Twitter, indicate in their terms of service
the mechanisms for verifying information. Unfortunately, manual veriﬁcation
of the information veracity and relying on human experts as fact-checkers is
relatively slow. MIT Media Lab proves, that fake news travels farther and faster
than the news that is absolutely legit1. Hence, one of the challenges is to develop
mechanisms that can detect fake news automatically and have the ability to
improve their model continuously.
Among the ML-based approaches to fake news detection, the literature dis-
tinguishes methods for :
–Text analysis consists of analyzing the Natural Language Processing (nlp)
data representation without the linguistic context , as well as the psy-
cholinguistic analysis  and processing the syntactic text structure .
–Reputation analysis, which measures the reputation of an article and pub-
lisher based on sources such as content, reviews, domain, IP address or
–Network analysis, which is related to the graph theory, and is performed to
evaluate the truthfulness of the information .
–Image-manipulation recognition, which are dedicated for the detection of
image tampering , copy-move forgeries , and other image modiﬁcations
such as contrast enhancement .
This work deals with nlp, and introduces a training procedure for a fake
news detection model that can classify news provided in diﬀerent languages and
perform knowledge transfer between them.
2 Related Works
Fake news detection with nlp is based on the extraction of features directly from
the content of the analyzed texts, without taking into account their social con-
text, and is currently the basis for each of the subtasks included in the detection
of fake news . Many of the classic nlp methods are based on bag-of-words
approach , which creates a vector for each document containing information
about the number of times each word appears in it. An example of an extension
of this approach may be n-grams, which can tokenize a sequence of words of a
given length, rather than focusing only on individual words. The minimum range
of n-grams is one (so-called unigrams), while its upper limit strongly depends
on the length of the analyzed documents. Additionally, the weights of individual
Lifelong Learning Natural Language Processing . . . 3
words can be normalized for a speciﬁc document using Term Frequency (tf)
, or for the entire analyzed corpus, using Term Frequency-Inverse Document
Frequency (tf-idf) . Another noteworthy approach to extracting features
from text is Latent Dirichlet Allocation (lda), which is a generative probabilis-
tic model for topic modeling .
Such approaches, despite their simplicity, are successfully used in the liter-
ature in the task of recognizing fake news. Hassan et al classiﬁed Tweets as
credible and not credible, using ﬁve diﬀerent classiﬁcation algorithms and word-
based n-gram analysis including tf and tf-idf . The authors have proven
the eﬀectiveness of this approach on the pheme benchmark dataset . Bharad-
waj and Shao employed reccurent neural networks, random forest classiﬁer and
naive bayes classiﬁer coupled with n-grams of various lengths in order to classify
fake news dataset from Kaggle . Similar experiments, including e.g. n-grams,
tf-idf and Gradient Boosting Classiﬁer, were conducted by Wynne and Wint
. Kaur et al. and Telang et al. evaluated multiple combinations of various
classiﬁer and extraction techniques in the fake news detection task [15,29]. Usage
of the ensemble methods for disinformation detection was explored in .
Currently, deep neural networks Deep Neural Networks (dnn) based on pre-
training general language representations are very popular in fake news detect-
ing. . A strong point of the currently used goings, which additionally ﬁne-tune
the token representations using supervised learning, is that a few parameters
need to be learned from scratch only . Recently, many works show the use-
fulness of knowledge transfer in the case of natural language inference . bert
(Bidirectional Encoder Representations from Transformers) is particularly rec-
ognizable, which was designed for training with the use of unlabeled text .
Then, bert is ﬁne-tuned for a given problem with an additional output layer.
Each recognition system dedicated to language data is based on two basic steps.
The ﬁrst is vectorization, which allows for the transformation of a set of textual
documents into a numerical data set interpretable by modeling procedure, and
the second is an appropriate induction that allows for the construction of a model
in a problem marked with bias. These procedures can be conducted separately,
in a serial manner, as in the case of the classic tf-idf and mlp pair, where the
basic extractor determines the vector representation and the neural network ﬁts
in the space designated by it, or jointly, in a parallel manner, as is the case in
deep models, such as bert.
Regardless of the integrated processing procedure typical of deep learning
approaches, in which extraction and induction take place within the same struc-
ture, also in models such as bert we can delineate the constructive elements
responsible for the extraction and – at the end of the neural network structure –
fully connected layers, which in their logic are identical to typical, shallow mlp
networks. This creates some potential for integrating classical and deep feature
extraction methods for nlp. The deep model, getting trained on the basis of a
4 J. Kozal et al.
given bias, trains weights responsible for both extraction of attributes and clas-
siﬁcation, updating them so that they are better suited to the construction of
the proper problem space. It is therefore possible to separate the extraction part
of the deep learning model and – using typical propagation – use it to obtain
attributes for the classiﬁcation model in the same manner as classical extraction
methods such as tf-idf.
Therefore, in the proposed method, the mlp model is used as a base classiﬁer,
the default in the structure for deep extractors, and at the same time, good at
optimizing the decision space for classical extraction methods. Moreover, for the
analysis of the integration potential of heterogeneous extractors, three methods
commonly used in the literature were selected:
–Term Frequency-Inverse Document Frequency – legacy model based on nor-
malization to document (tf) and to corpora (idf) as the most basic method
currently used in nlp applications.
–Latent Dirichlet Allocation – a generative statistical model, similar to the
latest unsupervised learning methods, but using approaches to identify sig-
niﬁcant thematic factors.
–Bidirectional Encoder Representations from Transformers – the current state-
of-the-art model, being a clear baseline in contemporary nlp experiments.
The bert model, unlike tf-idf and lda, is not a language-independent nlp
task solver. To enable its operation in a multilingual environment, base models
were selected, trained on diﬀerent corpora, with a similar structure size. First
was standard version pre-trained on English corpus (bert), second beto 
- variant of bert architecture pre-trained on Spanish dataset (bertspa), and
third, the standard bert pre-trained on a multilingual corpus (bertmult).
nlp extractors, regardless of the procedure used, tend to generate many re-
dundant attributes, signiﬁcantly increasing the optimization space of the neural
network. In the case of an integration architecture based on the concatenation
of features obtained from a heterogeneous pool of extractors, this tendency will
be further deepened. Additionally, diﬀerent extraction contexts may also allow
for the diversiﬁcation of models, where diﬀerent processing algorithms allow for
alternative insight on the data. For this proposal, the following solutions will be
–Mutual Information (mi)
–Analysis of Variance (anova)
–Principal Components Analysis (pca)
With the use of all the tools introduced above, it is possible to propose two
alternative architectures for integrating the extractors. They are presented in
Figure 1, compiling the processing schemes of sis and ers policies.
As part of the Simple Integration Schema policy (sis), each of the extractors
– independently – builds an extraction model which, after processing the avail-
able language corpora (A, B and C corpora) for each of them, allows for the
transformation from a text into a vector representation by pre-limited, constant
Lifelong Learning Natural Language Processing . . . 5
Simple integration schema
TF-IDF LDA MULT ENG SPA
Extractor reduction schema
Concatenated feature space
Concatenated feature space
Concatenated feature space
Feature extraction approaches
Fig. 1. Ensemble diversiﬁcation policies
dimensionality. For each transformation, a separate classiﬁer is built, which ﬁ-
nally goes to the pools of models dedicated to diﬀerent corpora, integrated inter-
nally and externally through the accumulation of support. Internal integration is
carried out classically, and external integration depends on the source languages
of the corpora. In the case of a linguistically homogeneous pool, the accumu-
lation of support is not modiﬁed. Otherwise, (a) the number of languages is a
multiplier of the number of problem classes, (b) monolingual models receive zero
foreign language support, and (c) multilingual models replicate class supports
divided by the language multiplier to preserve the complementarity condition of
the support vector.
The Extractor Reduction Schema (ers) strategy performs the extraction in
the same way for sis. However, models are not built for each of the extractors,
and the object representations obtained within the corpora are concatenated into
6 J. Kozal et al.
a wide, common space of the problem. The space dimensionality is reduced to a
predetermined size, i.e. upper limit of the number of attributes per extractor. A
separate classiﬁcation model is trained for each corpora, constituting a narrow
pool that is integrated in the same way as external sis ensemble policy.
The aim of the research is to obtain experimentally veriﬁed information about
the appropriate integration strategy of classiﬁer ensembles for the needs of mul-
tilingual natural language processing systems. Both the potentials of various
extraction methods – in each of the selected corpora – and the capabilities of
the various dimension reducers in the ers policy will be veriﬁed.
The experiments aim to analyze how knowledge gained from diﬀerent languages
can positively aﬀect the overall classiﬁcation performance. During the experi-
mental evaluation, we intend to answer the following research questions:
1. How does fake news classiﬁcation performance depend on diﬀerent extraction
2. How do ensemble models diversiﬁed by various extractors, training sets and
training features improve fake-news classiﬁcation?
3. Can we utilize models trained with diﬀerent language corpora to improve
other language classiﬁcation?
4. What methods for feature integration can be useful in a multilingual envi-
Datasets. Two datasets were utilized in this study, namely: The Spanish Fake
News Corpus 2 and Kaggle Fake News: Build a system to identify unreliable
news articles3. These datasets were treated as a representation of fake-news
distribution in Spanish and English, respectively. The third variant of a dataset
was a mixed dataset that combined both Spanish and English datasets. Detailed
information about the number of samples for each dataset is given in Table 1. As
we can see both Spanish and English datasets are internally balanced. However,
please note a signiﬁcant, 1:31 imbalance in the presence of the source datasets
in mixed article collection. For this reason in further parts we report balanced
accuracy as the evaluation metric, which in case of mixed dataset actually shows
the quality of the four-class problem – sensitizing itself to errors performed on
Data preprocessing. Each word was changed to lowercase before computing
tf-idf features. For this extraction method, features were computed based on
unigrams. We remove words that occur in less than 20 documents or more than
half of all documents during lda preprocessing. lda features were computed with
Lifelong Learning Natural Language Processing . . . 7
Table 1. Description of used datasets. Fake-news learning examples are considered as
dataset #samples #negative #positive ir
esp fake 676 338 338 1:1
kaggle 20 800 10 387 10 413 1:1
mixed 21 476 10 725 10 751 1:1
ir 1:31 1:31 1:31
chunk size 2000, 20 passes, and 400 iterations. Both tf-idf and lda extracted
100 features from each document.
Classiﬁcation models. MLP used for training on extracted features had two
hidden layers with 500 neurons each and was trained by Adam optimizer 
with a learning rate 0.001 for 200 epochs with Early Stopping . bert-based
models were trained for 5 epochs each with AdamW  optimizer and learning
Experimental protocol. To obtain more reliable results 5x2 cross validation
was employed. However, there is a signiﬁcant diﬀerence in the number of English
and Spanish datasets samples. For this reason, stratiﬁcation was introduced for
Spanish and English labels in the case of the mixed dataset. This modiﬁcation
allows avoiding over-representation of learning examples from one language in
Result analysis. To analyse the classiﬁcation performance of the chosen algo-
rithms we employed a 5x2cv combined F test for statistical analysis with p-value
Implementation and reproducibility. Github repository with code for all
experiments is available online 4.
E1 Classiﬁcation based on single feature extraction method Results
of classiﬁcation accuracy obtained for a single feature extraction methods are
presented in Tables 2 and 3. We use these values as a baseline for further ex-
periments. We state a signiﬁcant diﬀerence between all deep models and tf-idf
and lda for text and title attributes. Additionally, for text attribute bert was
better than bertmult for the kaggle dataset, lda than tf-idf for the kaggle and
mixed datasets. Solution based on bertspa was better than bertmult for esp
fake dataset and better than any other model for the mixed dataset.
E2 Classiﬁcation based on multiple feature extraction methods In this
part of experiments we create ensemble according to Simple Integration Schema
and Extractor Reduction Schema introduced earlier. We use multiple base models
8 J. Kozal et al.
Table 2. Accuracy for each extractor trained and evaluated on one dataset for text
dataset extraction method
tf-idf lda mult eng spa
kaggle .866 .911 .981 .985 —
— 1 1, 2 1, 2, 3 —
esp fake .751 .618 .786 — .853
2 — 2 — 1, 2, 3
mixed .729 .768 .972 .974 .982
— 1 1, 2 1, 2 1, 2, 3, 4
Table 3. Accuracy for each extractor trained and evaluated on one dataset for title
dataset extraction method
tf-idf lda mult eng spa
kaggle .904 .829 .958 .963 —
2 — 1, 2 1, 2 —
esp fake .540 .553 .675 — .852
— — 1, 2 — 1, 2, 3
mixed .646 .631 .942 .948 .982
— — 1, 2 1, 2, 3 1, 2, 3, 4
trained with the same language. Results are presented in tables 4 and 5. pca
was signiﬁcantly worse than sa for esp fake and mixed and all other methods for
the mixed dataset with text attribute. For title attribute, pca was better than
sa with the kaggle dataset, sa was better than all other methods for esp fake,
pca was worse than other methods, and minfo and anova were better than sa
for the mixed dataset.
E3 Classiﬁcation with diﬀerent languages We utilize a single extraction
method trained on diﬀerent datasets to precompute features for one dataset.
Extracted features are concatenated. Then, these features are used to train the
single classiﬁer. The impact of diﬀerent language features on fake news classiﬁ-
cation performance is presented in table 6 and 7. To keep the presentation of the
results consistent, the statistical analysis was carried out by comparing diﬀerent
models while keeping the dataset and ensemble construction method the same.
No signiﬁcant diﬀerence was found for text or title attributes.
Lifelong Learning Natural Language Processing . . . 9
Table 4. Accuracy for diﬀerent ensemble construction methods with text attribute.
dataset integration method
sa minfo anova pca
kaggle .988 .987 .987 .983
— — — —
esp fake .851 .833 .831 .788
4 — — —
mixed .913 .894 .903 .785
4 4 4 —
Table 5. Accuracy for diﬀerent ensemble construction methods with title attribute.
dataset integration method
sa minfo anova pca
kaggle .955 .963 .963 .964
— 1, 3 1 1
esp fake .730 .737 .732 .674
4 — — —
mixed .885 .905 .879 .760
4 4 4 —
E4 Classiﬁcation with diﬀerent extraction methods and languages
Classiﬁcation performance of ensemble using a combination of all languages and
extraction methods is presented in tables 8 and 9. For text attribute, we state
meaningful diﬀerences between pca and the rest of the methods and between
minfo and sa for the kaggle dataset and the diﬀerence between ANVOA and
pca for the mixed dataset. For title attribute, we found that pca is better than
minfo for kaggle, sa is better than pca for esp fake. For mixed dataset, anova
is better than all other methods, sa is better than pca, and minfo is better
than sa and pca.
Visualization of learned features To provide more insight into learned rep-
resentations, we visualize average feature vectors. We consider features for the
dataset that the extractor was trained on and features generated by applying
the trained extractor to two other datasets. Next, we plot the average feature
vector as a heatmap to show the most active features for each training-evaluation
dataset pair. Results are presented in Fig. 2. First, we focus on plots for tf-idf
and lda. Most active features are obtained for the same dataset that the extrac-
tor was trained on. When applying the extractor to the datasets with a diﬀerent
language, some features have signiﬁcantly higher values. This indicates that ap-
10 J. Kozal et al.
Table 6. Accuracy for ensembles obtained from evaluating one extractor on all 3
datasets with text attribute.
dataset vectorization approach
tf-idf lda mult. eng. beto
kaggle .933 .951 .934 .950 —
esp fake .741 .764 .757 — .761
mixed .775 .772 .782 .740 .780
kaggle .922 .943 .921 .941 —
esp fake .759 .772 .781 — .746
mixed .809 .802 .805 .801 .818
kaggle .924 .945 .925 .944 —
esp fake .771 .773 .783 — .748
mixed .801 .793 .795 .773 .791
kaggle .921 .942 .922 .944 —
esp fake .757 .775 .780 — .749
mixed .790 .815 .782 .788 .813
plying the extraction model to a diﬀerent language can provide an additional
source of information that can be beneﬁcial for classiﬁcation quality. Further-
more, it can explain why exploiting extractors trained on diﬀerent languages
improved performance for tf-idf and lda in the third experiment.
These ﬁndings are strictly connected to the inner workings of extraction algo-
rithms. When extracting features with tf-idf ﬁrst step is to create vocabulary
from the corpus. Next, feature vectors are computed based on the frequency of
each word. When processing text with words that mostly fall outside of vocabu-
lary, tf-idf will return a sparse vector. Similar reasoning can be applied to lda.
Lifelong Learning Natural Language Processing . . . 11
Table 7. Accuracy for ensembles obtained from evaluating one extractor with all 3
datasets with title attribute.
dataset vectorization approach
tf-idf lda mult. eng. beto
kaggle .916 .931 .913 .928 —
esp fake .638 .668 .655 — .622
mixed .676 .688 .678 .659 .669
kaggle .903 .921 .901 .921 —
esp fake .628 .680 .681 — .635
mixed .740 .743 .747 .742 .739
kaggle .914 .926 .901 .923 —
esp fake .633 .682 .685 — .627
mixed .751 .756 .750 .749 .743
kaggle .913 .929 .911 .927 —
esp fake .618 .685 .679 — .631
mixed .764 .756 .769 .727 .767
Transformers use a diﬀerent procedure for text preprocessing. bert tokenizer can
divide a single word into separate tokens if it falls outside of vocabulary. Then
these tokens are used to generate word embeddings that are passed through the
transformer network. As a result, when dealing with diﬀerent languages model
will not produce sparse output but some vector that may or may not contain
useful information. This is shown in the two lowest rows of Fig 2. There is no
signiﬁcant diﬀerence between beto average features, regardless of whether they
are computed with the same language model was trained on or not. This in
12 J. Kozal et al.
Table 8. Results for utilizing in ensemble both diﬀerent extraction methods and dif-
ferent languages with text attribute.
dataset integration method
sa minfo anova pca
kaggle .986 .988 .988 .990
— — — 1
esp fake .845 .830 .831 .804
— — — —
mixed .884 .875 .838 .869
3 — — —
Table 9. Results for utilizing in ensemble both diﬀerent extraction methods and dif-
ferent languages with title attribute.
dataset integration method
sa minfo anova pca
kaggle .965 .963 .964 .963
— — — —
esp fake .771 .750 .753 .673
4 — — —
mixed .762 .908 .909 .806
— 1, 4 1, 4 —
turn, can explain why we observe a decrease in accuracy for the deep models in
4.3 Lessons learned
When analyzing accuracy obtained for single extractors (Tab. 2 and 3) one can
notice that deep learning-based methods obtain best results. This is expected,
as deep models currently dominate nlp ﬁeld [9,25,4]. One can also notice that
models dedicated for a speciﬁc language obtained better results than MultiB-
ERT. This can be explained by a lower number of learning examples in datasets
utilized in our experiments. MultiBERT is the largest model. Therefore, it can
be prone to overﬁtting with a smaller amount of data. Also, it is worth noting
that the diﬀerence in performance between beto and MultiBERT is larger than
between the English version of bert and MultiBERT. This is probably caused by
an improved pretraining procedure from RoBERTa , that was used in beto.
Better pretraining can lead to improved performance in downstream tasks [9,25].
Although MultiBERT was initially trained with a multi-language corpus and has
the largest number of parameters, it is not the best model for the mixed dataset.
Lifelong Learning Natural Language Processing . . . 13
Fig. 2. Visualization of average features learned by lda and tf-idf and beto. Dataset
labels in y-axis correspond to training datasets and labels in x-axis correspond to
evaluation datasets. Brighter colors correspond to higher values.
Unfortunately, we found no good explanation for this phenomenon. Results ob-
tained for the Spanish dataset are worse compared to kaggle and mixed. This can
be easily explained by a larger number of samples in the kaggle dataset compared
to esp fake (please refer to Tab. 1). These ﬁndings provide an answer to research
question 1. Deep models obtain better or similar results when comparing text
and title attributes. Unfortunately, the same cannot be stated about tf-idf and
lda, where performance for the title attribute is worse in most cases.
Regardless of the ensemble construction methodology obtained accuracy is
close for all methods (Tab. 4 and 5). Comparing values of metrics to the ﬁrst
experiment, we can state that employing an ensemble can beneﬁt accuracy only
for the English dataset. This is probably caused by the close performance of
all models for the kaggle dataset. In the case of the Spanish dataset, there is
a clear gap between beto and the rest of the models, and in the case of the
mixed dataset, there is a gap between deep learning models and two other.
When combining strong classiﬁers with weaker ones in one ensemble, we can
obtain worse results than for the strongest classiﬁer. In this case, the utilization
of additional models with worse performance is more harmful than helpful. These
results answer research question 2.
14 J. Kozal et al.
In the third experiment feature extractor trained on a single dataset was
used to extract features from all datasets and create a new ensemble from 3
diﬀerent sets of features (Tab. 6 and 7). This approach is beneﬁcial for tf-
idf and lda but detrimental for deep models. The statistical analysis results
show no signiﬁcant diﬀerences between tf-idf,lda, and deep models. This
ﬁnding is quite interesting due to the clear dominance of deep models in the ﬁrst
experiment. It is important to note that deep models achieved worse results than
in the ﬁrst experiment; however, at the same time, there was a signiﬁcant gain
in performance for tf-idf and lda. We provide a possible explanation when
discussing feature visualization from Fig 2. This answers research question 3.
Utilizing both diﬀerent models and features extracted for diﬀerent languages
(Tab. 8 and 9) does not provide an improvement over baselines, nor constructing
ensembles with the utilization of diﬀerent models. These ﬁndings, combined with
conclusions from Experiment 2, answer research question 4.
This study aimed to apply nlp methods to the problem of detecting misinforma-
tion in messages produced in diﬀerent languages and verify whether it is possible
to transfer knowledge between models trained for diﬀerent languages.
Based on the results of the experimental studies, it was not found that diﬀer-
ent models or attributes extracted for diﬀerent languages led to a noticeable, i.e.,
statistically signiﬁcant, improvement over the baseline results. Also, no signiﬁ-
cant improvement was conﬁrmed using methods based on the classiﬁer ensemble
concept. The limited scope of datasets contents can explain this. Articles in the
Spanish dataset are written about local aﬀairs. Therefore intersections of topics
between two datasets can be small. Moreover, this indicates that preparing a
fake-news detection model with a single language and utilizing learned knowl-
edge for other languages is a diﬃcult problem and should be further explored
in research. Having this in mind, future work should focus on preparing articles
and datasets for multiple languages. By utilizing the same type of article top-
ics in multiple languages, one can obtain better results. Our work showed that
some classes of models could beneﬁt from utilizing multiple languages, so it is
reasonable to expect that aligning of topics should further improve results.
In the calculation of the achieved results, as well as the literature analysis,
it seems that the direction of further work may be related to the transfer of
knowledge, but rather within a single language, while it seems attractive to
transfer knowledge between tasks of fake news detection, but diﬀerent subject
1. Bharadwaj, P., Shao, Z.: Fake news detection with semantic features and text
mining. International Journal on Natural Language Computing (IJNLC) Vol 8
Lifelong Learning Natural Language Processing . . . 15
2. Bilal, M., Habib, H.A., Mehmood, Z., Saba, T., Rashid, M.: Single and multi-
ple copy–move forgery detection and localization in digital images based on the
sparsely encoded distinctive features and dbscan clustering. Arabian Journal for
Science and Engineering 45(4), 2975–2992 (2020)
3. Bondi, L., Lameri, S., Guera, D., Bestagini, P., Delp, E.J., Tubaro, S., et al.: Tam-
pering detection and localization through clustering of camera-based cnn features.
In: CVPR Workshops. vol. 2 (2017)
4. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A.,
Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win-
ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark,
J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language
models are few-shot learners. CoRR abs/2005.14165 (2020), https://arxiv.
5. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-
trained bert model and evaluation data. In: PML4DC at ICLR 2020 (2020)
6. Choraś, M., Demestichas, K., Giełczyk, A., Herrero, Á., Ksieniewicz, P., Re-
moundou, K., Urda, D., Woźniak, M.: Advanced machine learning techniques for
fake news (online disinformation) detection: A systematic mapping study. Applied
Soft Computing 101, 107050 (2021)
7. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning
of universal sentence representations from natural language inference data. arXiv
preprint arXiv:1705.02364 (2017)
8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
9. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-
tional transformers for language understanding. CoRR abs/1810.04805 (2018),
10. Girosi, F., Jones, M., Poggio, T.: Regularization theory and neu-
ral networks architectures. Neural Computation 7(2), 219–269 (1995).
11. Harris, Z.S.: Distributional structure. Word 10(2-3), 146–162 (1954)
12. Hassan, N., Gomaa, W., Khoriba, G., Haggag, M.: Credibility detection in twitter
using word n-gram analysis and supervised machine learning techniques. Interna-
tional Journal of Intelligent Engineering and Systems 13(1), 291–300 (2020)
13. Hoﬀman, M., Blei, D., Bach, F.: Online learning for latent dirichlet allocation.
vol. 23, pp. 856–864 (11 2010)
14. Jones, K.S.: A statistical interpretation of term speciﬁcity and its application in
retrieval. Journal of documentation (1972)
15. Kaur, S., Kumar, P., Kumaraguru, P.: Automating fake news detection system
using multi-level voting model. Soft Computing 24(12), 9049–9069 (2020)
16. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International
Conference on Learning Representations (12 2014)
17. Ksieniewicz, P., Choraś, M., Kozik, R., Woźniak, M.: Machine learning methods
for fake news classiﬁcation. In: International Conference on Intelligent Data Engi-
neering and Automated Learning. pp. 332–339. Springer (2019)
18. Kumar, S., Carley, K.M.: Tree lstms with convolution units to predict stance and
rumor veracity in social media conversations. In: Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics. pp. 5047–5058 (2019)
16 J. Kozal et al.
19. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized BERT pretraining
approach. CoRR abs/1907.11692 (2019), http://arxiv.org/abs/1907.11692
20. Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. CoRR
abs/1711.05101 (2017), http://arxiv.org/abs/1711.05101
21. Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary
information. IBM Journal of research and development 1(4), 309–317 (1957)
22. Posadas Durán, J., Gomez Adorno, H., Sidorov, G., Moreno, J.: Detection of fake
news in a new corpus for the spanish language. Journal of Intelligent & Fuzzy
Systems 36, 4869–4876 (05 2019). https://doi.org/10.3233/JIFS-179034
23. Posetti, J., Matthews, A.: A short guide to the history of ‘fake news’ and disinfor-
mation. International Center for Journalists 7(2018), 2018–07 (2018)
24. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language un-
derstanding with unsupervised learning (2018)
25. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Lan-
guage models are unsupervised multitask learners (2018), https://d4mucfpksywv.
26. Rodríguez, Á.I., Iglesias, L.L.: Fake news detection using deep learning. arXiv
preprint arXiv:1910.03496 (2019)
27. Saquete, E., Tomás, D., Moreda, P., Martínez-Barco, P., Palomar, M.: Fighting
post-truth using natural language processing: A review and open challenges. Expert
systems with applications 141, 112943 (2020)
28. Suryawanshi, P., Padiya, P., Mane, V.: Detection of contrast enhancement forgery
in previously and post compressed jpeg images. In: 2019 IEEE 5th International
Conference for Convergence in Technology (I2CT). pp. 1–4. IEEE (2019)
29. Telang, H., More, S., Modi, Y., Kurup, L.: Anempirical analysis of classiﬁcation
models for detection of fake news articles. In: 2019 IEEE International Conference
on Electrical, Computer and Communication Technologies (ICECCT). pp. 1–7.
30. Wynne, H.E., Wint, Z.Z.: Content based fake news detection using n-gram models.
In: Proceedings of the 21st International Conference on Information Integration
and Web-based Applications & Services. pp. 669–673 (2019)
31. Xu, K., Wang, F., Wang, H., Yang, B.: Detecting fake news over online social
media via domain reputations and content understanding. Tsinghua Science and
Technology 25(1), 20–27 (2019)
32. Zhang, D., Zhou, L., Kehoe, J.L., Kilic, I.Y.: What online reviewer behaviors re-
ally matter? eﬀects of verbal and nonverbal behaviors on detection of fake online
reviews. Journal of Management Information Systems 33(2), 456–481 (2016)
33. Zhou, X., Zafarani, R.: Network-based fake news detection: A pattern-driven ap-
proach. ACM SIGKDD explorations newsletter 21(2), 48–60 (2019)
34. Zubiaga, A., Wong Sak Hoi, G., Liakata, M., Procter, R.: Pheme dataset of rumours
and non-rumours (Oct 2016)