Conference PaperPDF Available

From Stances' Imbalance to Their Hierarchical Representation and Detection



Stance detection has gained increasing interest from the research community due to its importance for fake news detection. The goal of stance detection is to categorize an overall position of a subject towards an object into one of the four classes: agree, disagree, dis-cuss, and unrelated. One of the major problems faced by current machine learning models used for stance detection is caused by a severe class imbalance among these classes. Hence, most models fail to correctly classify instances that fall into minority classes. In this paper, we address this problem by proposing a hierarchical representation of these classes, which combines the agree, disagree, and discuss classes under a new related class. Further, we propose a two-layer neural network that learns from this hierarchical representation and controls the error propagation between the two layers using the Maximum Mean Discrepancy regularizer. Compared with conventional four-way classifiers, this model has two advantages: (1) the hierarchical architecture mitigates the class imbalance problem; (2) the regularization makes the model to better discern between the related and unrelated stances. An extensive experimentation demonstrates state-of-the-art accuracy performance of the proposed model for stance detection.
From Stances’ Imbalance to Their Hierarchical
Representation and Detection
Qiang Zhang
University College London
London, United Kingdom
Shangsong Liang
Sun Yat-sen University
Guangzhou, China
Aldo Lipani
University College London
London, United Kingdom
Zhaochun Ren
Shandong University
Qingdao, China
Emine Yilmaz
University College London
London, United Kingdom
Stance detection has gained increasing interest from the research
community due to its importance for fake news detection. The goal
of stance detection is to categorize an overall position of a subject
towards an object into one of the four classes: agree,disagree,dis-
cuss, and unrelated. One of the major problems faced by current
machine learning models used for stance detection is caused by a
severe class imbalance among these classes. Hence, most models
fail to correctly classify instances that fall into minority classes. In
this paper, we address this problem by proposing a hierarchical
representation of these classes, which combines the agree, disagree,
and discuss classes under a new related class. Further, we propose
a two-layer neural network that learns from this hierarchical rep-
resentation and controls the error propagation between the two
layers using the Maximum Mean Discrepancy regularizer. Com-
pared with conventional four-way classiers, this model has two
advantages: (1) the hierarchical architecture mitigates the class im-
balance problem; (2) the regularization makes the model to better
discern between the related and unrelated stances. An extensive ex-
perimentation demonstrates state-of-the-art accuracy performance
of the proposed model for stance detection.
Information systems Information extraction
hierarchical classier, maximum mean discrepancy
ACM Reference Format:
Qiang Zhang, Shangsong Liang, Aldo Lipani, Zhaochun Ren, and Emine
Yilmaz. 2019. From Stances’ Imbalance to Their Hierarchical Representation
and Detection. In Proceedings of the 2019 World Wide Web Conference (WWW
’19), May 13–17, 2019, San Francisco, CA, USA. ACM, New York, NY, USA,
10 pages.
This paper is published under the Creative Commons Attribution 4.0 International
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution.
WWW ’19, May 13–17, 2019, San Francisco, CA, USA
2019 IW3C2 (International World Wide Web Conference Committee), published
under Creative Commons CC-BY 4.0 License.
ACM ISBN 978-1-4503-6674-8/19/05.
The quality of online news is usually less substantiated than that
of traditional news services such as magazines or newspapers [
]. A large volume of fake news is being produced for political
or economical purposes [
]. Fake news are those news ar-
ticles that purport to be factual, but which contain misstatements
of fact with intention to arouse passions, attract viewership, or
deceive [
]. Verifying news content needs to retrieve evi-
dences and determine their stance with respect to the news claims,
which proposes new challenges for the conventional stance detec-
tion task [
]. We specify evidence as text, e.g. web-pages and
documents, that can be used to prove if news content is or is not
true. Moreover, automatic stance detection has broad applications
in information retrieval and text entailment [34, 42].
The task of stance detection is to identify the stance of an evi-
dence towards a given news claim [
]. Stances can be catego-
rized into four classes: agree,disagree,discuss and unrelated [
Two characteristics make the stance detection task peculiar. On the
one hand, news claims and evidences are often unrelated – gener-
ating a severe class imbalance problem; On the other hand, since
the non-related classes are by denition related, intuitively, the
identication of an evidence as related or unrelated to a news claim
is semantically dierent from the identication of an evidence as
belonging to one of the other three classes. These two characteris-
tics suggests the natural presence of a hierarchical structure among
stance classes.
Stance detection has been studied in areas of information ex-
traction and natural language processing [
]. However, previ-
ous methods tackle the task as a multiclass classication problem,
neglecting the hierarchical structure in stance classes. Also, the
commonly-used four-way classiers are easily inuenced by the
class imbalance problem. In this paper, we address this issue by
modeling the stance detection task as a two-layer neural network.
The rst layer aims at identifying the relatedness of the evidence,
while the second layer aims at classifying, those evidences iden-
tied as related, into the other three classes: agree, disagree and
discuss. Moreover, by studying various level of dependence assump-
tions between the two layers: (1) independent, when there is no
error propagation between the two layers; (2) dependent, when the
error propagation is left free, and; (3) learned, when the error prop-
agation is controlled by Maximum Mean Discrepancy (MMD), we
show that when learned, the neural network (a) better separates the
distributions of related and unrelated stances and (b) outperforms
the state-of-the-art accuracy for the stance detection task.
The remainder of the paper is organized as follows: § 2 summarizes
the related work; § 3 denes the stance detection task; § 4 details the
proposed hierarchical classication model and the regularization
term; § 5 describes the used datasets and experimental setup; § 6 is
devoted to experimental results, and; § 7 concludes the paper.
Machine learning techniques are widely researched to tackle the
stance detection task. Previous works focus on political or congres-
sional oor debates [
] and online forums [
]. Most of these works rely on content-based features, such as
sentiment analysis and topic-specic features learned from labeled
datasets for a closed set of topics.
Two methods only consider the agree, disagree and discuss
classes: Bar-Haim et al
. [7]
split the stance detection task to three
sub-tasks and propose a Contrast Classication Algorithm to distin-
guish agree and disagree classes; Augenstein et al
. [4]
build a neural
network architecture based on bidirectional conditional encoding
on a Tweeter dataset. A long-short term memory (LSTM) encodes
the claim and another LSTM encodes the text with the encoded
claim as initial states. These methods fail to consider the unrelated
Two other methods consider all the classes, but use two dierent
models: Bourgonje et al
. [10]
use the lemmatized
-gram matching
and a rule-based procedure to decide the evidence relatedness, and
a three-way logistic regression classier to distinguish among the
relevant classes; Wang et al
. [43]
rstly develop a gradient boosted
decision tree (GBDT) model [
] to determine the evidence related-
ness, then another GBDT model is used to distinguish stances of the
text towards the claim. These methods involve feature engineering
in separate models and cannot be jointly optimized to achieve the
best performance.
Other methods that also consider all the classes have been de-
veloped during the Fake News Challenge stage 1 (FNC-1) [
]. The
winner team uses a 50%/50% weighted average between a GBDT
model and a convolutional neural network (CNN) [
]. The second
best performance is achieved by an ensemble of ve multi-layer
perceptrons (MLPs) where input features include bag-of-words,
semantic analysis in addition to the baseline features developed
by the challenge organizers [
]. Compared to the above two solu-
tions, the third best team does not try ensemble methods. They use
TF-IDF features and an MLP as a four-way classier [
]. Zhang
et al
. [48]
propose a ranking method to tackle the task and achieve
empirical performance improvements. However, these methods all
neglect the hierarchical structure among the four types of stances
and suer from class imbalance.
Deep learning-based methods have also been applied in the
FNC-1. Bajaj
utilizes LSTM, CNN and their variants to detect
stances. Bajaj nds that an attention-augmented CNN obtains the
best performance. Rakholia and Bhargava
analyze the eective-
ness of dierent ways of text coding, such as independent coding,
bidirectional conditional encoding and attentive readers, and con-
clude that the attentive reader model is the most suitable for the
task. Ma et al
. [23]
propose a multi-task learning algorithm that
jointly detect rumours and stances. However, all these methods fail
to achieve high accuracy for the agree and disagree classes.
There are three major defects in all the aforementioned meth-
ods: (a) they neglect the hierarchical relationships among the four
stances; (b) they suer from the class imbalance problem, and; (c)
they fail to achieve acceptable detection performance for the agree
and disagree classes.
The stance detection task consists in classifying the stance of an
evidence towards a claim as one of the four classes: agree, disagree,
discuss and unrelated. Formal denitions of these four stances are:
agree – the evidence supports the claim;
disagree – the evidence denies the claim;
– the evidence does not have a position about the claim;
unrelated – the evidence is not about the claim.
In this section, we detail our proposed two-layer neural network
for stance detection. § 4.1 outlines the model. In order to better
dierentiate between the related and unrelated classes, we design
an MMD regularization term in § 4.2. This is then integrated into
the two-layer neural network loss function in § 4.3. In Figure 1, we
show the architecture of our model.
4.1 Two-Layer Neural Network
Let the input space be formed by
-dimensional real vectors in a
neural network, denoted as
. The four-class label can be
transformed into a one-hot vector
. The
-dimension of
) is 1
when the stance is the
-element in the label set
{aдree ,disaдree,
discuss,unrelated }
and 0 otherwise. The hidden layer with param-
learns to map
to a
-dimensional hidden representation
For the two-layer classication, the rst layer decides whether the
evidence is related to a claim. Hence, the rst classication layer is
called the relatedness layer. This layer is parameterized by
learns to produce a 2-dimensional normalized vector ˆ
ras follows:
Note that the
function is included in
to normalize the
2-dimensional vector, so each component of the vector
the probability that the neural network assigns
to the related and
unrelated classes, i.e., p(rel ated)and p(unrel ated).
The second layer classies the evidence into the related classes,
i.e., agree, disagree, or discuss stances. Hence, the second classica-
tion layer is called the stance layer. The stance layer is parameterized
by θsand learns to produce a 3-dimensional normalized vector ˆ
r· (1,0);θs),(3)
where the vector multiplication
r· (
extracts the rst element
. Note that the
function is also included in
to nor-
malize the 3-dimensional vector, so that each component of the
denotes the conditional probability that the neural network
 
Relatedness Layer
 
    
 
Stance Layer
Input Output
Figure 1: The architecture of our proposed two-layer neural network.
to agree, disagree and discuss given that
is related, i.e.,
p(aдree |related),p(disaдree |r elat ed), and p(discuss|rel ated ).
We dene the classication loss by the Kullback-Leibler (KL) di-
vergence [
], which measures the dierence between the network
outputs and labels:
is the ground-truth relatedness of the input data.
is com-
puted from a label yas follows:
r=( (y,e4),(y=e4)),(5)
where is the indicator function,
is a 4-dimensional one-hot
vector with fourth element equal to 1. When
is veried, it
indicates that the label belongs to the unrelated class. Similarly, the
stance classication loss can be dened as:
is the ground-truth stance of the input data.
is computed
from a label yas follows:
s=( (y=e1),(y=e2),(y=e3)),(7)
are 4-dimensional one-hot vectors with rst, second,
and third elements equal to 1. When
is veried, it indicates
that the label belongs to the agree class, when
is veried,
it indicates that the label belongs to the disagree class, and when
is veried, it indicates that the label belongs to the discuss
Finally, we now dene the loss function for the two-layer neural
network as the linear combination between the loss function of the
relatedness layer (lr) and the loss function of the stance layer (ls):
where αleverages the importance of the two classication layers.
4.2 Maximum Mean Discrepancy
The classication of related/unrelated stances is a dierent task
from that of agree/disagree/discuss stances. Therefore, data repre-
sentations from the relatedness layer and the stance layer can be
seen as samples drawn from two dierent distributions. In order
to measure distribution discrepancy between these two layers, we
employ the Maximum Mean Discrepancy (MMD) [
] as a regular-
ization term. The MMD does not involve density estimation and
thus is a non-parametric way of measuring the dierence between
distributions. MMD has achieved success in face recognition and
image annotation [15].
MMD is dened as follows:
Denition 4.1. Maximum Mean Discrepancy [
]: “Let
two Borel probability distributions over a space
and let
be sets with independent identically distributed samples drawn
. The MMD is dened by a class
of map functions
X → H as:
(Ep[ψ(x)] − Eq[ψ(z)]).(9)
Here, xand zare samples from Xand Z.
In other words, the MMD equation denes the largest possible
distance between two expectations over the set of function
. More-
over, “when
is the reproducing kernel Hilbert space (RKHS) [
this means that for all
x∈ X
, the linear point evaluation function
exists and is continuous. When
is the unit
ball in a universal RKHS, it is guaranteed that
detect any discrepancy between pand q[9, 35].
denote the distribution for the rst layer samples (un-
related hidden representations) in our model, with sample set
1, . . . , u1
and according to Eq.
their generating set
1, . . . , v1
. And,
denotes the distribution for the second
layer samples (agree, disagree and discuss hidden representations),
with sample set
1, . . . , u2
and according to Eq.
generating set
1, . . . , v2
are the number of
samples in
. Thus we have
, where
is a
matrix in the projection layer.
jare the space dimensions. According to Eq. (1), the hidden repre-
is parameterized by
, thus the empirical expression
of MMD is parameterized by θuand θd:
i;θu) − 1
By constantly changing the projection layer parameterized by
we nd the maximum expectation dierence between the represen-
tations of the two classication layers.
4.3 Optimization
The more dierent two distributions are, the larger the MMD is.
Hence, in order to make the distributions easier to be distinguished,
a larger MMD regularization term is preferred, and we treat the
regularization term as an extra goal besides classication. We in-
tegrate the two-layer classication loss (see Eq.
) and the MMD
regularization term (see Eq.
) into a single objective function
). Specically, we add these two sub-goals with a hyperparameter
βas follows:
L(θu,θr,θs,θd)=lc(θu,θr,θs) − β·d(θu,θd),(12)
leverages the importance of the regularization. The larger
the MMD regularization term is, the easier is for the classier to
distinguish between the related and unrelated stances. Thus, the
sign of the regularization term is negative.
The optimization involves the minimization of the classication
loss Lwith respect to θu,θr,θs, and θdas follows:
Optimizing the model consists of two sub-goals. On the one hand,
we want to maximize the distribution discrepancy between the
two classication layers. On the other hand, we want to minimize
the classication loss of both layers. Both of these two sub-goals
involve the feature layer parameter
update, but in opposite
update directions. The optimization process will not stop until
a saddle point (the feature layer parameters can be well applied
in both sub-goals) is reached. Algorithm 1 shows the parameter
update process, which is based on the mini-batch gradient descent
4.4 Prediction
Given as input a feature vector
, the classier outputs the following
p(unrelated )
p(aдree |related)
p(disaдree|rel ated )
p(discuss|r elated )
. However, these last 3 probabilities are not
comparable with the rst one. To make them comparable we derive
Algorithm 1:
Parameter update process based on the mini-
batch gradient descent algorithm.
input : Sample mini-batch {vi,ri,si}n
i=1, mini-batch size n,
00000 hyperparameters α,β, and µ
output :θu,θr,θs,θd
2Initialize θu,θr,θs,θd;
4/* forward propagation */
6for i from 1to n do
10 lrlr+lr
11 if ri· (1,0)=1then
12 /* classify related */
13 ˆ
ri· (1,0);θs);
14 ls
15 else
16 /* unrelated */
17 ls
18 lsls+ls
19 d=MMD({ui,ri}n
20 /* backward propagation */
21 θsθsµ·α·ls
22 θrθrµ· ( lr
23 θdθd+µ·β·d
24 θuθuµ· ( lr
25 until θu,θr,θs,θdconverge;
. By observing that the class
agree is assumed as related, thus
p(aдree,rel ated )=p(aдree )
, we
derive that:
p(aдree)=p(aдree,related )
=p(aдree |related) × p(related)
=p(aдree |related)×(1p(unrelated)).(14)
Similarly, for the other two classes we derive that:
p(disaдree)=p(dis aдree |r elat ed)×(1p(unr elated)),
p(discuss)=p(discuss|rel ated)×(1p(unrelated )).(15)
Thereby, the model actual output ˆ
y=(p(aдree ),p(disaдree),p(discuss),p(unrelated)),(16)
where the class with the highest probability corresponds to the
predicted stance.
We start this section by presenting the datasets and evaluation
measures relevant to the stance detection task. Then, we describe
the features used by our model and the model parameterization.
Finally, we present the baselines. The software used to run the
experiments of this paper is available on the website of the rst
5.1 Datasets
Experiments are conducted on two publicly available datasets: the
Emergent dataset
] and the FNC-1 dataset
. In these two datasets,
a claim consists of a news article headline and an evidence of a
news article content. These datasets are split into train and test
subsets; see Table 1 for statistics about the splits.
The FNC-1 dataset consist of 75,385 instances. Each instance
in the dataset is a pair claim-evidence labeled as one of the four
stances: agree, disagree, discuss and unrelated. The ratio of training
data over testing data in the FNC-1 dataset is
2:1. Every class
accounts for a similar percentage in the train and test subsets. The
unrelated stances are the majority (over 70%) in both subsets, while
the disagree stances are less than 3%. The agree and discuss stances
are less than 20% and 10%.
The Emergent dataset is similar to the FNC-1 dataset, however it
contains only agree, disagree and discuss stances. Hence, it needs to
be augmented with unrelated stances. Similarly to how the FNC-1
dataset unrelated stances have been labeled, we manually labeled
unrelated stances by pairing a claim with an unrelated evidence, i.e.,
paired with another claim. Moreover, to make the class distributions
less imbalanced, we make the ratio of related stances and unrelated
1:1. The augmented Emergent dataset contains 4,071 training
labels and 1,024 testing labels with a ratio of
4:1. Class distributions
between train and test subsets are similar.
Compared to the FNC-1 dataset, the class distributions of the
augmented Emergent dataset is more balanced. The percentage of
unrelated stances is about 50%, whereas the percentages of agree
and disagree stances are about 24% and 8%. Both datasets have
similar percentages of the discuss stances.
5.2 Evaluation Measure
In line with the FNC-1 challenge, the evaluation is based on a
weighted two-level scoring system based on the accuracy mea-
sure. This evaluation measure, called relative score, evaluates a
model by splitting the stance detection task into two sub-tasks, re-
lated/unrelated and agree/disagree/discuss classication sub-tasks.
To the former sub-task is given a 25% weight. This is done because
this sub-task is considered to be easier than the latter sub-task to
which is given a 75% weight.
We report the evaluation measures: relative score, accuracy, and
accuracy on a per class basis.
5.3 Feature Extraction
To represent claims and evidences we use a bag-of-words approach.
For each claim and evidence we generate a TF-IDF vector, and for
each pair claim-evidence we compute their cosine similarity. We
also include the FNC-1 ocial features into the input feature vector.
The nal set of features include:
TF-IDF vectors of claims;
TF-IDF vectors of evidences;
Cosine similarity (CosSim) between the claim vector and the
Ratio of word overlap (WordLap) between the claim and the
An Indicator whether a claim has refuting words (RefWord);
The polarity (Pol) of the claim and the evidence;
The number of overlapping
-grams (NGrams) for
n∈ {
4,5,6}between the claim and the evidence.
For the TF-IDF vectors, we only use the top 2,000 most frequent
terms except stop-words. All of these features are concatenated to
form the input feature vector v.
5.4 Experimental Setting
The following hyperparameters have been set via a ve-cross vali-
dation on the train subsets:
The dimension kof hidden representations is set to 100;
The dimension jof the MMD is set to 10;
The activation function used in the hidden layers is set to
The parameters
are set to 1
5and 1
3for the Emergent and
FNC-1 datasets.
The parameter βis set to 0.001;
We include a L2 regularization term [
] for the MLP weight
parameters in the nal loss function to mitigate overtting. Dropout
is also used to mitigate overtting with rate set to 0
6. We train
in mini-batches of size 64 over the entire train subset. Note that
the gradient steps in Algorithm 1 can easily be alternated with a
more powerful optimizer such as the Adam optimizer [
]. Early
stopping is applied when the classication loss on the validation
subset does not get smaller for three continuous iterations. The
whole model is implemented with TensorFlow.
5.5 Baselines
We compare our model against the methods mentioned in Section 2.
These methods are detailed in the following. Among them we dis-
tinguish between methods that use the same features as ours and
methods that learn their representations. We start with the latter
type, we call these representation learning-based baselines:
Bidirectional LSTM (BiLSTM).
Augenstein et al
. [4]
build a neu-
ral network architecture based on bidirectional LSTM on a
Tweeter dataset. A LSTM encodes the claim, and another
LSTM encodes the evidence with the encoded claim set as
initial states. The 100-d GloVe word embedding is used as
input [30];
Attentive CNN (AtCNN).
builds an attention-augmented
CNN. The claim and the evidence are input to a convolu-
tional neural network to obtain hidden representations, and
the attention mechanism is employed to locate the most
inuential words or phases on the nal results;
Table 1: Statistics of the datasets.
Subset Stance Emergent FNC-1
Number Percentage Number Percentage
Training agree 992 24.37 3,678 7.36
disagree 303 7.44 840 1.68
discuss 776 19.06 8,909 17.83
unrelated 2,000 49.13 36,545 73.13
4,071 49,972
Testing agree 246 24.02 1,903 7.49
disagree 91 8.89 697 2.74
discuss 776 19.06 4,464 17.57
unrelated 500 48.83 18,349 72.20
1,024 25,413
Memory Network (MN).
Mohtarami et al
. [26]
develop an end-
to-end memory network for stance detection. The network
operates at the paragraph level and integrates convolutional
and recurrent neural networks, as well as a similarity matrix
as part of the overall architecture;
Ranbking Model (RM).
Zhang et al
. [48]
build a ranking method
to tackle the stance detection and achieve empirical perfor-
mance improvements. A ranking loss function is proposed to
replace Softmax and maximize the representation dierence
between four classes of stance.
We now review the second type of baselines: those methods
that use the same features as our method, we call these feature
engineering-based baselines:
Ocial Baseline (OB).
This is the FNC-1 ocial baseline that
uses one gradient boosting decision trees model for four-
way classication;
Logistic Regression (LR).
Bourgonje et al
. [10]
-gram match-
ing and a rule-based procedure to decide relatedness, and
three-way logistic regression to distinguish among the re-
lated classes;
Gradient Boosted Decision Trees (GBDT).
Wang et al
. [43]
velop two GBDT models, one to determine the relatedness
of an evidence to a claim, and another to distinguish among
the related classes;
Multi-Layer Perception (MLP).
This model [
] achieved the
third best performance in FNC-1. It extracts TF-IDF and
cosine similarity between claims and evidences as input fea-
tures, and uses a MLP as the four-class classier.
In this section, we start by analyzing the dependency assumption.
Then, we compare and contrast our model against the baselines.
Next, we provide a sensitivity analysis of the hyperparameters. We
conclude with an impact analysis of the features used by the model.
6.1 Dependency Assumption
In Figure 2 we show the eect of the 3 dependency assumptions
by visualizing the learned representations using a t-SNE projec-
tion [
]. We observe that when the classiers are assumed inde-
pendent, i.e., the classication is performed in cascade — no error is
propagated from the second layer to the rst during training — then
the learned representation well separates the unrelated class from
the unrelated ones. When the classiers are assumed dependent,
i.e., the two classiers are trained together — the error is left free
to propagate from the second layer to the rst — then the learned
representation is not very well separated. However, when the de-
pendence assumption of the two classiers is learned via the MMD
regularization, i.e., the two classiers are trained together with the
error propagation controlled by the regularizer, then the learned
representation is again well separated like in the rst case. Well-
separated representations suggest a greater discriminative power
of the model — the unrelated and related classes are almost linearly
The last three rows of Tables 2 and 3 show the performance
of our model on the two test subsets for each one of the three
assumptions: independent, dependent, and learned. Looking at the
accuracy of the unrelated class, we observe that the accuracy is
greater when the learned representations are well-separated, as in
the independent and learned cases. Furthermore, looking at all the
other scores, we observe that the learned assumption outperforms
both the independent and dependent assumptions in all other cases,
demonstrating that learning together both, relatedness and stance
of the evidences towards claims, is benecial to the stance detection
6.2 Overall Performance
In Tables 2 and 3 we compare our model against the state-of-the-art
models. Our model achieves the best stance detection performance
for the relative score on both datasets. The model achieves 89.30%
on the augmented Emergent test subset and 88.15% on the FNC-1
test subset.
By comparing with four-way classication baselines (OB, MLP,
BiLSTM, AtCNN, MN and RM) we demonstrate the advantage of
separating the relatedness detection from the stance detection. We
observe that these classiers perform poorly on the disagree class,
which is caused by the large percentage dierence between the
minority disagree class and the majority unrelated class. Further,
the more imbalanced the evaluation dataset becomes, the worse per-
formance the four-way classiers achieve on the minority disagree
(a) Independent (b) Dependent (c) Learned
Figure 2: t-SNE visualization of the hidden representations on the training data. The hidden representations of model trained
(a) with separated layers (b) together but without regularization, and (c) with MMD regularization.
Table 2: Performance comparison of our model against the State-of-the-Art models on the augmented Emergent dataset.
Model Accuracy (%) Relative Score (%)
agree disagree discuss unrelated
Feature Engineering-Based Baselines
OB 33.56 23.44 70.23 84.00 74.86
LR (Bourgonje et al.) 66.73 40.51 78.33 78.00 83.45
GBDT (Wang et al.) 80.62 50.42 83.52 88.00 87.53
MLP (Riedel et al.) 58.53 23.64 79.05 95.00 85.43
RM (Zhang et al.) 64.56 40.42 85.45 96.00 87.69
Representation Learning-Based Baselines
BiLSTM (Augenstein et al. ) 43.21 12.57 78.55 96.00 81.37
AtCNN (Bajaj) 44.78 14.60 72.44 97.00 83.56
MN (Mohtarami et al.) 54.64 40.05 72.10 89.00 85.92
Our Models
Independent 74.54 45.32 82.59 95.49 86.33
Dependent 63.54 44.68 68.35 95.00 86.72
Learned 82.52 69.05 84.30 97.00 89.30
By comparing with baselines that separate the relatedness detec-
tion from the stance detection (LR and GBDT) we demonstrate the
superiority of a single end-to-end model. LR and GBDT are better
on the disagree class, although their overall performance is worse
than our model.
In Figure 3 we show the confusion matrix of our model. Here
we observe the detection performance on a per class basis. For the
related/unrelated classication, we correctly classify 97.00% and
99.53% unrelated instances on the augmented Emergent and the
FNC-1 test subsets. We can see that there is some misclassication
between the agree and unrelated classes, and between the discuss
and unrelated classes. The misclassication of the disagree class
accounts for the largest error of the unrelated instances.
Our model achieves an accuracy of 69.05% and 72.35% for the
disagree class on the Emergent and the FNC-1 test subsets. The
classication accuracy is largely improved compared to the state-
of-the-art. Some misclassication error exists between agree and
disagree. However, our model can distinguish between the discuss
and the disagree with few errors. While the number of discuss cases
is the largest and the number of disagree instances is the smallest,
our model does not mistake disagree instances as discuss ones, i.e.,
the model has learned the core representation dierence between
these two classes. Due to ambiguous expressions, misclassication
between agree and discuss is the cause of most errors between these
classes, which leads to a slightly worse accuracy for the discuss
class on the Emergent (84.30%) and FNC-1 (77.49%) test subsets.
Table 3: Performance comparison of our model against the State-of-the-Art models on the FNC-1 dataset.
Model Accuracy (%) Relative Score (%)
agree disagree discuss unrelated
Feature Engineering-Based Baselines
OB 10.51 1.00 79.66 97.98 75.20
LR (Bourgonje et al.) 67.42 31.61 75.23 95.36 80.63
GBDT (Wang et al.) 82.93 69.82 33.52 95.42 86.72
MLP (Riedel et al.) 44.04 6.60 81.38 97.90 81.72
RM (Zhang et al.) 64.90 27.26 84.41 99.12 86.66
Representation Learning-Based Baselines
BiLSTM (Augenstein et al.) 35.96 0.94 80.33 98.54 78.70
AtCNN (Bajaj ) 38.67 8.24 70.63 91.25 75.77
MN (Mohtarami et al.) 16.92 60.22 81.27 95.50 79.92
Our Models
Independent 72.41 37.90 68.23 97.43 83.47
Dependent 61.34 42.93 59.38 99.05 85.32
Learned 80.61 72.35 77.49 99.53 88.15
Figure 3: The confusion matrices of our model for the augmented Emergent (on the left) and FNC-1 (on the right) datasets.
Two reasons account for the improved empirical performance
observed on our model. On the one hand, the mitigation of the
class imbalance problem. Contrary to the four-way classiers that
directly compare the disagree and unrelated instances, the hierar-
chical model avoids the direct comparison of this minority disagree
class (which is less than 2% in the FNC-1 dataset) with the majority
unrelated one (which is more than 70% in the FNC-1 dataset). On
the other hand, the MMD term that maximizes the discrepancy be-
tween the unrelated class and the aggregated related classes. Since
the agree, disagree and discuss belong to the same class, the related
class, the MMD regularization promotes the emergence of features
that are useful to separate the class pairs: agree with unrelated,
disagree with unrelated, and discuss with unrelated.
6.3 Hyperparameters Sensitivity
In this subsection we discuss the sensitivity to the hyperparame-
ters of our model. The most inuential hyperparameters for the
proposed model are
. The former controls the relative impor-
tance of classication layers. The latter leverages the regularization.
In Figures 4(a) and 4(b) we show how the performance of the
model changes when varying
for the augmented Emergent
and FNC-1 test subsets.
is searched between 0
1and 3
steps of 0
1, and
is searched in
, we observe that the performance of the model improves
quickly as
increases and peaks at 1
5and 1
3for the FNC-1 and
augmented Emergent datasets, then the performance experiences
a slight decrease when
is increased. We hypothesize that the
is related to the class balance between the unrelated class
and the related ones. The more unbalanced the dataset is towards
the unrelated class, larger is the optimal
. For
, we observe that the
performance is the highest when
is set to 0.001. This happens for
both augmented Emergent and FNC-1 test subsets. These optimal
values of
observed on the test subsets are equal to the one
found when training the model.
(a) Emergent (b) FNC-1
Figure 4: Sensitivity of the trained model when varying the parameters αand βon the test subset of the augmented Emergent
(on the left) and FNC-1 (on the right) datasets.
Table 4: Performance of our model with dierent feature sets on the FNC-1 dataset. “/” denotes no feature set is removed.
Removed Feature Set Accuracy (%)
agree disagree discuss unrelated
CosSim 71.53 85.08 78.76 69.37
WordLap 67.43 80.49 77.31 77.89
RefWord 74.43 64.37 77.03 97.49
Pol 60.49 67.93 80.92 98.79
NGrams 74.27 75.73 87.82 84.52
/ 80.61 82.35 77.49 99.53
6.4 Feature Analysis
In this subsection we evaluate and discuss the importance of each
feature towards the nal prediction. To examine the inuence of
each feature on the nal performance, we do a leave-one feature set-
out approach and record the classication accuracy on the stance
detection task. The following analysis is only based on the FNC-1
dataset. Similar results are observed on the augmented Emergent
In Table 4 we show the results of this analysis. We observe that
removing the CosSim feature leads to a large decrease in accuracy
for the unrelated class. Similarly, the use of WordLap has a positive
eect for the agree class, and it also contributes to the unrelated
class. The RefWord and Pol features help for the classes agree and
disagree, while removing the NGram feature leads to an increase on
the discuss class, i.e., the NGram feature causes confusion between
the discuss and the other classes.
In this paper, we studied the problem of stance detection: the clas-
sication of the stance of an evidence towards a claim into one of
the four classes: agree, disagree, discuss and unrelated.
We proposed a hierarchical representation of the stance classes,
where the classes agree, disagree and discuss are combined together
into a class referred as the related class. The main idea here is to
divide a concept into sub-concepts that are organized in a hierar-
chical structure, and design constraints between sub-concepts in
order to make the model parameter optimization more sensible.
The primary advantage of this hierarchical representation is that it
is useful to overcome the class imbalance problem.
This hierarchical representation has inspired the proposed two-
layer neural network to tackle the stance detection task. The rst
layer performs a related-unrelated classication, while the second
layer performs a more ne-grained classication among the related
classes. Furthermore, we have empirically demonstrated that (1) it
is advantageous to learn these two classication tasks together, and
(2) the dependency between these two layers can be learned through
a MMD regularization term, which measures the representation
discrepancy between the two layers. Experiments on two publicly
available datasets have shown that our model is able to outperform
the state-of-the-art stance detection methods.
As future work we consider the enriching of the proposed model
as follows. First, integrating a credibility evaluation of information
sources as features. Second, improving the explainability of the
model by showing which words or phrases are the most inuential
in predicting the stance via attention mechanisms.
This project was funded by the EPSRC Fellowship titled "Task Based
Information Retrieval", grant reference number EP/P024289/1. We
acknowledge the support of NVIDIA Corporation with the donation
of the Titan Xp GPU used for this research.
Hunt Allcott and Matthew Gentzkow. 2017. Social media and fake news in the
2016 election. Technical Report. National Bureau of Economic Research.
Pranav Anand, Marilyn Walker, Rob Abbott, Jean E Fox Tree, Robeson Bowmani,
and Michael Minor. 2011. Cats rule and dogs drool!: Classifying stance in on-
line debate. In Proceedings of the 2nd workshop on computational approaches to
subjectivity and sentiment analysis. Association for Computational Linguistics,
Nachman Aronszajn. 1950. Theory of reproducing kernels. Transactions of the
American mathematical society 68, 3 (1950), 337–404.
Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, and Kalina Bontcheva.
2016. Stance Detection with Bidirectional Conditional Encoding. In Proceedings
of the 2016 Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, 876–885.
Sean Baird, Sibley Doug, and Yuxi Pan. 2017. Talos Targets Disinformation with
Fake News Challenge Victory. (2017).
06/talos-fake- news-challenge.html
Samir Bajaj. 2017. “The Pope Has a New Baby!” Fake News Detection Using
Deep Learning. (2017).
Roy Bar-Haim, Indrajit Bhattacharya, Francesco Dinuzzo, Amrita Saha, and Noam
Slonim. 2017. Stance classication of context-dependent claims. In Proceedings of
the 15th Conference of the European Chapter of the Association for Computational
Linguistics: Volume 1, Long Papers, Vol. 1. 251–261.
Adam J Berinsky. 2017. Rumors and health care reform: experiments in political
misinformation. British Journal of Political Science 47, 2 (2017), 241–262.
Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bern-
hard Schölkopf, and Alex J Smola. 2006. Integrating structured biological data by
kernel maximum mean discrepancy. Bioinformatics 22, 14 (2006), e49–e57.
Peter Bourgonje, Julian Moreno Schneider, and Georg Rehm. 2017. From clickbait
to fake news detection: an approach based on detecting the stance of headlines to
articles. In Proceedings of the 2017 EMNLP Workshop: Natural Language Processing
meets Journalism. 84–89.
Clinton Burfoot, Steven Bird, and Timothy Baldwin. 2011. Collective classi-
cation of congressional oor-debate transcripts. In Proceedings of the 49th An-
nual Meeting of the Association for Computational Linguistics: Human Language
Technologies-Volume 1. Association for Computational Linguistics, 1506–1515.
Sophie Chesney, Maria Liakata, Massimo Poesio, and Matthew Purver. 2017.
Incongruent headlines: Yet another way to mislead your readers. In Proceedings
of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism.
Jiachen Du, Ruifeng Xu, Yulan He, and Lin Gui. 2017. Stance classication with
target-specic neural attention networks. International Joint Conferences on
Articial Intelligence.
William Ferreira and Andreas Vlachos. 2016. Emergent: a novel data-set for stance
classication. In Proceedings of the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies.
Bo Geng, Dacheng Tao, and Chao Xu. 2011. DAML: Domain adaptation metric
learning. IEEE Transactions on Image Processing 20, 10 (2011), 2980–2989.
Andreas Hanselowski, PVS Avinesh, Benjamin Schiller, and Felix Caspelherr.
2017. Description of the system developed by team athene in the FNC-1. Technical
Report. Technical report.
Andreas Hanselowski, Avinesh PVS, Benjamin Schiller, Felix Caspelherr, Deban-
jan Chaudhuri, Christian M Meyer, and Iryna Gurevych. 2018. A Retrospective
Analysis of the Fake News Challenge Stance Detection Task. arXiv preprint
arXiv:1806.05180 (2018).
Andreas Hanselowski, Avinesh PVS, Benjamin Schiller, Felix Caspelherr, Deban-
jan Chaudhuri, Christian M. Meyer, and Iryna Gurevych. 2018. A Retrospective
Analysis of the Fake News Challenge Stance-Detection Task. In Proceedings of
the 27th International Conference on Computational Linguistics. Association for
Computational Linguistics, 1859–1874.
Kazi Saidul Hasan and Vincent Ng. 2013. Stance classication of ideological
debates: Data, models, features, and constraints. In Proceedings of the Sixth Inter-
national Joint Conference on Natural Language Processing. 1348–1356.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980 (2014).
Solomon Kullback and Richard A Leibler. 1951. On information and suciency.
The annals of mathematical statistics 22, 1 (1951), 79–86.
Srijan Kumar and Neil Shah. 2018. False information on web and social media: A
survey. arXiv preprint arXiv:1804.08559 (2018).
Jing Ma, Wei Gao, and Kam-Fai Wong. 2018. Detect Rumor and Stance Jointly
by Neural Multi-task Learning. In Companion of the The Web Conference 2018 on
The Web Conference 2018. International World Wide Web Conferences Steering
Committee, 585–593.
Laurens van der Maaten and Georey Hinton. 2008. Visualizing data using t-SNE.
Journal of machine learning research 9, Nov (2008), 2579–2605.
Todor Mihaylov, Georgi Georgiev, and Preslav Nakov. 2015. Finding opinion
manipulation trolls in news community forums. In Proceedings of the Nineteenth
Conference on Computational Natural Language Learning. 310–314.
Mitra Mohtarami, Ramy Baly, James Glass, Preslav Nakov, Lluís Màrquez, and
Alessandro Moschitti. 2018. Automatic Stance Detection Using End-to-End
Memory Networks. arXiv preprint arXiv:1804.07581 (2018).
Akiko Murakami and Rudy Raymond. 2010. Support or oppose?: classifying
positions in online debates from reply activities and opinion expressions. In
Proceedings of the 23rd International Conference on Computational Linguistics:
Posters. Association for Computational Linguistics, 869–875.
Nasser M Nasrabadi. 2007. Pattern recognition and machine learning. Journal of
electronic imaging 16, 4 (2007), 049901.
Arnold Neumaier. 1998. Solving ill-conditioned and singular linear systems: A
tutorial on regularization. SIAM review 40, 3 (1998), 636–666.
Jerey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
Global vectors for word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP). 1532–1543.
Kashyap Popat, Subhabrata Mukherjee, Jannik Strötgen, and Gerhard Weikum.
2017. Where the truth lies: Explaining the credibility of emerging claims on
the web and social media. In Proceedings of the 26th International Conference
on World Wide Web Companion. International World Wide Web Conferences
Steering Committee, 1003–1012.
Neel Rakholia and Shruti Bhargava. 2017. “Is it true?” – Deep Learning for Stance
Detection in News. (2017).
Benjamin Riedel, Isabelle Augenstein, Georgios P Spithourakis, and Sebastian
Riedel. 2017. A simple but tough-to-beat baseline for the Fake News Challenge
stance detection task. arXiv preprint arXiv:1707.03264 (2017).
Sebastian Ruder, John Glover, Afshin Mehrabani, and Parsa Ghaari. 2018. 360
Stance Detection. In Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Demonstrations. 31–35.
Bernhard Schölkopf, Koji Tsuda, and Jean-Philippe Vert. 2004. Kernel methods in
computational biology. MIT press.
Prashant Shiralkar, Alessandro Flammini, Filippo Menczer, and Giovanni Luca
Ciampaglia. 2017. Finding streams in knowledge graphs to support fact checking.
In Data Mining (ICDM), 2017 IEEE International Conference on. IEEE, 859–864.
Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake news
detection on social media: A data mining perspective. ACM SIGKDD Explorations
Newsletter 19, 1 (2017), 22–36.
Swapna Somasundaran and Janyce Wiebe. 2009. Recognizing stances in online
debates. In Proceedings of the Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on Natural Language Processing
of the AFNLP: Volume 1-Volume 1. Association for Computational Linguistics,
Swapna Somasundaran and Janyce Wiebe. 2010. Recognizing stances in ide-
ological on-line debates. In Proceedings of the NAACL HLT 2010 Workshop on
Computational Approaches to Analysis and Generation of Emotion in Text. Associ-
ation for Computational Linguistics, 116–124.
Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining
support or opposition from Congressional oor-debate transcripts. In Proceed-
ings of the 2006 conference on empirical methods in natural language processing.
Association for Computational Linguistics, 327–335.
Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false
news online. Science 359, 6380 (2018), 1146–1151.
Marilyn A Walker, Pranav Anand, Robert Abbott, and Ricky Grant. 2012. Stance
classication using dialogic properties of persuasion. In Proceedings of the 2012
conference of the North American chapter of the association for computational lin-
guistics: Human language technologies. Association for Computational Linguistics,
Xuezhi Wang, Cong Yu, Simon Baumgartner, and Flip Korn. 2018. Relevant
Document Discovery for Fact-Checking Articles. In Companion of the The Web
Conference 2018 on The Web Conference 2018. International World Wide Web
Conferences Steering Committee, 525–533.
Jen Weedon, William Nuland, and Alex Stamos. 2017. Information operations
and Facebook. version 1 (2017), 27.
Houping Xiao et al
2018. Multi-sourced Information Trustworthiness Analysis:
Applications and Theory. Ph.D. Dissertation. State University of New York at
Ainur Yessenalina, Yisong Yue, and Claire Cardie. 2010. Multi-level structured
models for document-level sentiment classication. In Proceedings of the 2010
Conference on Empirical Methods in Natural Language Processing. Association for
Computational Linguistics, 1046–1056.
Qiang Zhang, Aldo Lipani, Shangsong Liang, and Emine Yilmaz. 2019. Reply-
Aided Detection of Misinformation via Bayesian Deep Learning. In Companion
Proceedings of the The Web Conference 2019. ACM Press.
Qiang Zhang, Emine Yilmaz, and Shangsong Liang. 2018. Ranking-based Method
for News Stance Detection. In Companion Proceedings of the The Web Conference
2018. ACM Press.
... Existing approaches that try to cope with this problem are ineffective in detecting instances of minority classes. For instance, whereas the overall performance of state-of-theart systems (Baird et al. 2017;Hanselowski et al. 2017;Riedel et al. 2017;Bhatt et al. 2018;Hanselowski et al. 2018;Zhang et al. 2019; Masood and Aker 2018) ranges between 58% and 61% (F1 macro-average), the performance on the disagree class ranges between 3% and 18% only. However, this class is of key importance in fact-checking since it enables detecting documents that provide evidence for invalidating false claims. ...
... Stance detection is a classification problem in natural language processing where the stance of a (piece of) text towards a particular target is explored. Stance detection has been applied in different contexts, including social media (stance of a tweet towards an entity or topic) (Mohammad et al. 2016;Du et al. 2017;Augenstein et al. 2016;Lai et al. 2017;Sun et al. 2018;Ebrahimi et al. 2016;Xu et al. 2018), online debates (stance of a user post or argument/claim towards a controversial topic or statement) (Walker et al. 2012;Sridhar et al. 2015;Bar-Haim et al. 2017;Guggilla et al. 2016), and news media (stance of an article towards a claim) (Pomerleau and Rao 2017;Hanselowski et al. 2018;Bhatt et al. 2018;Wang et al. 2018;Zhang et al. 2019). Our work falls under the context of news media where the ultimate objective is the detection of fake news. ...
... Moreover, the authors do not provide the evaluation datasets and the used contradiction vocabulary. A more recent work (Zhang et al. 2019) also applies a two-stage approach where a first stage distinguishes related from unrelated documents and a second detects the actual stance (agree, disagree, neutral). A hierarchical neural network that controls the error propagation between the two stages has been proposed. ...
Full-text available
Fact checking is an essential challenge when combating fake news. Identifying documents that agree or disagree with a particular statement (claim) is a core task in this process. In this context, stance detection aims at identifying the position (stance) of a document towards a claim. Most approaches address this task through classification models that do not consider the highly imbalanced class distribution. Therefore, they are particularly ineffective in detecting the minority classes (for instance, ‘disagree’), even though such instances are crucial for tasks such as fact-checking by providing evidence for detecting false claims. In this paper, we exploit the hierarchical nature of stance classes which allows us to propose a modular pipeline of cascading binary classifiers, enabling performance tuning on a per step and class basis. We implement our approach through a combination of neural and traditional classification models that highlight the misclassification costs of minority classes. Evaluation results demonstrate state-of-the-art performance of our approach and its ability to significantly improve the classification performance of the important ‘disagree’ class.
... Outside the FNC-1 competition but using its dataset other work and experiments have been carried out. [40] addressed the problem proposing a hierarchical representation of the classes, which combines agree, disagree and discuss in a new related class. A two-layer neural network is learning from this hierarchical representation of classes and a weighted accuracy of 88.15% is obtained with their proposal. ...
... Next, the fifth and sixth rows include the results of recent approaches [40,12] that also addressed the headline stance detection task using the FNC-1 dataset, but did not participate in this challenge. Since there was no public code available, these results were also calculated from the confusion matrices provided in their respective papers. ...
... Whereas our approach outperforms the other automatic systems in terms of agree and discuss classes, accuracy, and relative score, it was outperformed in the disagree class by [40] and in the unrelated class by top-3 best systems that participated in the FNC-1 challenge and [40]. When the results obtained by the participants in the FNC-1 competition are analyzed independently for each of the classes, it can be seen that except for the classification of unrelated headlines -whose results are close to 100% in F1 measure, and this happens also for the remaining approaches as well-for the remaining classes, the results are very limited. ...
The spread of fake news and misinformation is causing serious problems to society, partly due to the fact that more and more people only read headlines or highlights of news assuming that everything is reliable, instead of carefully analysing whether it can contain distorted or false information. Specifically, the headline of a correctly designed news item must correspond to a summary of the main information of that news item. Unfortunately, this is not always happening, since various interests, such as increasing the number of clicks as well as political interests can be behind of the generation of a headlines that does not meet its intended original purpose. This paper analyses the use of automatic news summaries to determine the stance (i.e., position) of a headline with respect to the body of text associated with it. To this end, we propose a two-stage approach that uses summary techniques as input for both classifiers instead of the full text of the news body, thus reducing the amount of information that must be processed while maintaining the important information. The experimentation has been carried out using the Fake News Challenge FNC-1 dataset, leading to a 94.13% accuracy, surpassing the state of the art. It is especially remarkable that the proposed approach, which uses only the relevant information provided by the automatic summaries instead of the full text, is able to classify the different stance categories with very competitive results, so it can be concluded that the use of the automatic extractive summaries has a positive impact for determining the stance of very short information (i.e., headline, sentence) with respect to its whole content.
... Some studies proposed to train supervised models based on hand-crafted features [3,50,56]. To alleviate feature engineering, Zhang et al. [51] proposed to learn a hierarchical representation of stance classes to overcome the class imbalance problem. Further, multi-task learning framework was utilized to mutually reinforce stance detection and rumor classification simultaneously [20,21,34,46]. ...
... In the follow-up studies, a range of hand-crafted features [3,50,56] as well as temporal traits [29,30] were studied to train stance detection models. More recently, deep neural networks were utilized for stance representation learning and classification to alleviate feature engineering and pursue stronger generalizability, such as bidirectional RNNs [2] and two-layer neural networks for learning hierarchical representation of stance classes [51]. Some studies further took into account conversation structure, such as the tree-based LSTM model for detecting stances [19,55] and the tree-structured multi-task framework for joint detection of rumors and stances [46]. ...
Conference Paper
Full-text available
The diffusion of rumors on social media generally follows a propagation tree structure, which provides valuable clues on how an original message is transmitted and responded by users over time. Recent studies reveal that rumor verification and stance detection are two relevant tasks that can jointly enhance each other despite their differences. For example, rumors can be debunked by cross-checking the stances conveyed by their relevant posts, and stances are also conditioned on the nature of the rumor. However, stance detection typically requires a large training set of labeled stances at post level, which are rare and costly to annotate. Enlightened by Multiple Instance Learning (MIL) scheme, we propose a novel weakly supervised joint learning framework for rumor verification and stance detection which only requires bag-level class labels concerning the rumor's veracity. Specifically, based on the propagation trees of source posts, we convert the two multi-class problems into multiple MIL-based binary classification problems where each binary model is focused on differentiating a target class (of rumor or stance) from the remaining classes. Then, we propose a hierarchical attention mechanism to aggregate the binary predictions, including (1) a bottom-up/top-down tree attention layer to aggregate binary stances into binary veracity; and (2) a discriminative attention layer to aggregate the binary class into finer-grained classes. Extensive experiments conducted on three Twitter-based datasets demonstrate promising performance of our model on both claim-level rumor detection and post-level stance classification compared with state-of-the-art methods.
... In the semantic web community and the fields of knowledge representation and knowledge base construction/augmentation, facts are seen as the knowledge that is represented in KGs or KBs [6,9,12,28,31,46,47,50,110,110,115,131,153,165,189,193,196,200,216,223]. More precisely, items in KGs or KBs are coined statements of facts or assertions or triples encoding/representing facts [28,31,115,165,193], with the facts being assumed to be true, can be proven to be true or are likely to hold [31,131,142]. ...
... Viewpoint extraction is closely connected to the stance detection problem, a supervised classification problem in NLP where the stance of a piece of text towards a particular target is explored. Stance detection has been applied in different contexts, including social media (stance of a tweet towards an entity or topic) [10,38,41,93,116,174,210], online debates (stance of a user post or argument/claim towards a controversial topic or statement) [13,67,167,198], and news media (stance of an article towards a claim) [20,70,141,203,216]. A recent work by Schiller et al. [158] details the different and varying task definitions found in previous works, diverging not only with regard to domains, but also classes and number and type of inputs, and introduce a benchmark for stance detection that allows the comparison of models against a variety of heterogeneous datasets. ...
Full-text available
Analyzing statements of facts and claims in online discourse is subject of a multitude of research areas. Methods from natural language processing and computational linguistics help investigate issues such as the spread of biased narratives and falsehoods on the Web. Related tasks include fact-checking, stance detection and argumentation mining. Knowledge-based approaches, in particular works in knowledge base construction and augmentation, are concerned with mining, verifying and representing factual knowledge. While all these fields are concerned with strongly related notions, such as claims, facts and evidence, terminology and conceptualisations used across and within communities vary heavily, making it hard to assess commonalities and relations of related works and how research in one field may contribute to address problems in another. We survey the state-of-the-art from a range of fields in this interdisciplinary area across a range of research tasks. We assess varying definitions and propose a conceptual model – Open Claims – for claims and related notions that takes into consideration their inherent complexity, distinguishing between their meaning, linguistic representation and context. We also introduce an implementation of this model by using established vocabularies and discuss applications across various tasks related to online discourse analysis.
... The random pairing of the claims to other articles in FNC resulted in a significant amount of claim-document pairs that are unrelated to each other. There are several approaches attempted to predict stance on the FNC dataset that include LSTMs, memory networks, and transformers (Hanselowski et al., 2018;Conforti et al., 2018a;Zhang et al., 2019;Schiller et al., 2020;Schütz et al., 2021). ...
... First, the task can be approached differently by only doing stance detection on the three related classes (Conforti et al., 2018b), or merging the discuss and unrelated classes to one class called neutral or other (Khouja, 2020). Second, by keeping all classes but training a model to predict relatedness first then predict stance from the three related classes only (Zhang et al., 2019). Third, by developing an evaluation metric that rewards models that make correct predictions among the related classes more than correct predictions from the unrelated class such as the one used in the Fake News Challenge (Pomerleau and Rao, 2017). ...
Full-text available
With the continuing spread of misinformation and disinformation online, it is of increasing importance to develop combating mechanisms at scale in the form of automated systems that support multiple languages. One task of interest is claim veracity prediction, which can be addressed using stance detection with respect to relevant documents retrieved online. To this end, we present our new Arabic Stance Detection dataset (AraStance) of 910 claims from a diverse set of sources comprising three fact-checking websites and one news website. AraStance covers false and true claims from multiple domains (e.g., politics, sports, health) and several Arab countries, and it is wellbalanced between related and unrelated documents with respect to the claims. We benchmark AraStance, along with two other stance detection datasets, using a number of BERTbased models. Our best model achieves an accuracy of 85% and a macro F1 score of 78%, which leaves room for improvement and reflects the challenging nature of AraStance and the task of stance detection in general.
... AI-based fact-checking support in the multiple steps above is fundamentally based on document to claim mapping (document being a news article/blog, a social media post, etc.) and more specifically on two IR/NLP tasks: presence detection and stance classification [1,3,12,13,26,39,43]. The detection of previously factchecked claims (step 2), became a target of research interest only recently [28] and is one of the least studied research problems related to fact-checking [24]. ...
False information has a significant negative influence on individuals as well as on the whole society. Especially in the current COVID-19 era, we witness an unprecedented growth of medical misinformation. To help tackle this problem with machine learning approaches, we are publishing a feature-rich dataset of approx. 317k medical news articles/blogs and 3.5k fact-checked claims. It also contains 573 manually and more than 51k automatically labelled mappings between claims and articles. Mappings consist of claim presence, i.e., whether a claim is contained in a given article, and article stance towards the claim. We provide several baselines for these two tasks and evaluate them on the manually labelled part of the dataset. The dataset enables a number of additional tasks related to medical misinformation, such as misinformation characterisation studies or studies of misinformation diffusion between sources.
... More so than ever before, social media has a responsibility for our mental wellbeing, as the arbiter of interactions between colleagues, friends and loved ones [24,13]. It is therefore a matter of the utmost importance that we make this platform a safe environment, protected against those wishing to corrupt the service with fake news [20]. ...
Conference Paper
Misinformation takes the form of a false claim under the guise of fact. It is necessary to protect social media against misinformation by means of effective misinformation detection and analysis. To this end, we formulate misinformation propagation as a dynamic graph, then extract the temporal evolution patterns and geometric features of the propagation graph based on Temporal Point Processes (TPPs). TPPs provide the appropriate modelling framework for a list of stochastic, discrete events. In this context, that is a sequence of social user engagements. Furthermore, we forecast the cumulative number of engaged users based on a power law. Such forecasting capabilities can be useful in assessing the threat level of misinformation pieces. By jointly considering the geometric and temporal propagation patterns, our model has achieved comparable performance with state-of-the-art baselines on two well known datasets.
... In this context, we develop a mixed approach grounded on the theories about attitude formation instead of following a hate speech approach, which is limited to hate but not necessarily opposition/approval or feelings of threat/empathy toward migration. Each formation theory defines an attitude, and, in cases where the classifier confidence is low, we define an undisclosed stance to account for participation in the debate without disclosing attitude [51]. Particularly, we build upon our previous work to classify users into attitudes as political stances using a tree-based classifier [17]. ...
Full-text available
Understanding public opinion towards immigrants is key to prevent acts of violence, discrimination and abuse. Traditional data sources, such as surveys, provide rich insights into the formation of such attitudes; yet, they are costly and offer limited temporal granularity, providing only a partial understanding of the dynamics of attitudes towards immigrants. Leveraging Twitter data and natural language processing, we propose a framework to measure attitudes towards immigration in online discussions. Grounded in theories of social psychology, the proposed framework enables the classification of users’ into profile stances of positive and negative attitudes towards immigrants and characterisation of these profiles quantitatively summarising users’ content and temporal stance trends. We use a Twitter sample composed of 36 K users and 160 K tweets discussing the topic in 2017, when the immigrant population in the country recorded an increase by a factor of four from 2010. We found that the negative attitude group of users is smaller than the positive group, and that both attitudes have different distributions of the volume of content. Both types of attitudes show fluctuations over time that seem to be influenced by news events related to immigration. Accounts with negative attitudes use arguments of labour competition and stricter regulation of immigration. In contrast, accounts with positive attitudes reflect arguments in support of immigrants’ human and civil rights. The framework and its application can inform policy makers about how people feel about immigration, with possible implications for policy communication and the design of interventions to improve negative attitudes.
... Unfortunately, the accuracy was meager due to the presence of human bias in the annotations [16]. To counter this drawback, a multi-step approach to annotation could be used [3,24]. The intermediate steps can be more rigorously defined, and thus, annotation subjectivity can be lowered. ...
The headline of a news article is designed to succinctly summarize its content, providing the reader with a clear understanding of the news item. Unfortunately, in the post-truth era, headlines are more focused on attracting the reader’s attention for ideological or commercial reasons, thus leading to mis- or disinformation through false or distorted headlines. One way of combating this, although a challenging task, is by determining the relation between the headline and the body text to establish the stance. Hence, to contribute to the detection of mis- and disinformation, this paper proposes an approach—HeadlineStanceChecker—that determines the stance of a headline with respect to the body text to which it is associated. The novelty rests on the use of a two-stage classification architecture that uses summarization techniques to shape the input for both classifiers instead of directly passing the full news body text, thereby reducing the amount of information to be processed while keeping important information. Specifically, summarization is done through Positional Language Models leveraging on semantic resources to identify salient information in the body text that is then compared to its corresponding headline. The results obtained show that our approach achieves 94.31% accuracy for the overall classification and the best FNC-1 relative score compared with the state of the art. It is especially remarkable that the system, which uses only the relevant information provided by the automatic summaries instead of the whole text, is able to classify the different stance categories with very competitive results, especially in the discuss stance between the headline and the news body text. It can be concluded that using automatic extractive summaries as input of our approach together with the two-stage architecture is an appropriate solution to the problem.
Conference Paper
Full-text available
Social media platforms are a plethora of misinformation and its potential negative influence on the public is a growing concern. This concern has drawn the attention of the research community on developing mechanisms to detect misinformation. The task of misinformation detection consists of classifying whether a claim is True or False. Most research concentrates on developing machine learning models, such as neural networks, that outputs a single value in order to predict the veracity of a claim. One of the major problem faced by these models is the inability of representing the uncertainty of the prediction, which is due incomplete or finite available information about the claim being examined. We address this problem by proposing a Bayesian deep learning model. The Bayesian model outputs a distribution used to represent both the prediction and its uncertainty. In addition to the claim content, we also encode auxiliary information given by people's replies to the claim. First, the model encodes a claim to be verified, and generate a prior belief distribution from which we sample a latent variable. Second, the model encodes all the people's replies to the claim in a temporal order through a Long Short Term Memory network in order to summarize their content. This summary is then used to update the prior belief generating the posterior belief. Moreover, in order to train this model, we develop a Stochastic Gradient Variational Bayes algorithm to approximate the analytically intractable posterior distribution. Experiments conducted on two public datasets demonstrate that our model outperforms the state-of-the-art detection models.
Conference Paper
Full-text available
We present an effective end-to-end memory network (MN) model that jointly (i) predicts whether a given document can be considered as relevant evidence for a given claim, and (ii) extracts snippets of evidence that can be used to reason about the factuality of the target claim. Our model combines the advantages of convolutional and recurrent neural networks as part of a MN. We further introduce a similarity-based matrix at the inference level of the MN in order to extract snippets of evidence for input claims more accurately. Our experiments on the Fake News Challenge dataset demonstrate the effectiveness of our approach.
Conference Paper
Full-text available
In recent years, an unhealthy phenomenon characterized as the massive spread of fake news or unverified information (i.e., rumors) has become increasingly a daunting issue in human society. The rumors commonly originate from social media outlets, primarily microblogging platforms, being viral afterwards by the wild, willful propagation via a large number of participants. It is observed that rumorous posts often trigger versatile, mostly controversial stances among participating users. Thus, determining the stances on the posts in question can be pertinent to the successful detection of rumors, and vice versa. Existing studies, however, mainly regard rumor detection and stance classification as separate tasks. In this paper, we argue that they should be treated as a joint, collaborative effort, considering the strong connections between the veracity of claim and the stances expressed in responsive posts. Enlightened by the multi-task learning scheme, we propose a joint framework that unifies the two highly pertinent tasks, i.e., rumor detection and stance classification. Based on deep neural networks, we train both tasks jointly using weight sharing to extract the common and task-invariant features while each task can still learn its task-specific features. Extensive experiments on real-world datasets gathered from Twitter and news portals demonstrate that our proposed framework improves both rumor detection and stance classification tasks consistently with the help of the strong inter-task connections, achieving much better performance than state-of-the-art methods.
Conference Paper
With the support of major search platforms such as Google and Bing, fact-checking articles, which can be identified by their adoption of the ClaimReview structured markup, have gained widespread recognition for their role in the fight against digital misinformation. A claim-relevant document is an online document that addresses, and potentially expresses a stance towards, some claim. The claim-relevance discovery problem, then, is to find claim-relevant documents. Depending on the verdict from the fact check, claim-relevance discovery can help identify online misinformation. In this paper, we provide an initial approach to the claim-relevance discovery problem by leveraging various information retrieval and machine learning techniques. The system consists of three phases. First, we retrieve candidate documents based on various features in the fact-checking article. Second, we apply a relevance classifier to filter away documents that do not address the claim. Third, we apply a language feature based classifier to distinguish documents with different stances towards the claim. We experimentally demonstrate that our solution achieves solid results on a large-scale dataset and beats state-of-the-art baselines. Finally, we highlight a rich set of case studies to demonstrate the myriad of remaining challenges and that this problem is far from being solved.
Conference Paper
A valuable step towards news veracity assessment is to understand stance from different information sources, and the process is known as the stance detection. Specifically, the stance detection is to detect four kinds of stances ("agree'', "disagree'', "discuss'' and "unrelated'') of the news towards a claim. Existing methods tried to tackle the stance detection problem by classification-based algorithms. However, classification-based algorithms make a strong assumption that there is clear distinction between any two stances, which may not be held in the context of stance detection. Accordingly, we frame the detection problem as a ranking problem and propose a ranking-based method to improve detection performance. Compared with the classification-based methods, the ranking-based method compare the true stance and false stances and maximize the difference between them. Experimental results demonstrate the effectiveness of our proposed method.
We investigated the differential diffusion of all of the verified true and false news stories distributed on Twitter from 2006 to 2017. The data comprise ~126,000 stories tweeted by ~3 million people more than 4.5 million times. We classified news as true or false using information from six independent fact-checking organizations that exhibited 95 to 98% agreement on the classifications. Falsehood diffused significantly farther, faster, deeper, and more broadly than the truth in all categories of information, and the effects were more pronounced for false political news than for false news about terrorism, natural disasters, science, urban legends, or financial information. We found that false news was more novel than true news, which suggests that people were more likely to share novel information. Whereas false stories inspired fear, disgust, and surprise in replies, true stories inspired anticipation, sadness, joy, and trust. Contrary to conventional wisdom, robots accelerated the spread of true and false news at the same rate, implying that false news spreads more than the truth because humans, not robots, are more likely to spread it.