Conference PaperPDF Available

From Stances' Imbalance to Their Hierarchical Representation and Detection

Authors:

Abstract

Stance detection has gained increasing interest from the research community due to its importance for fake news detection. The goal of stance detection is to categorize an overall position of a subject towards an object into one of the four classes: agree, disagree, dis-cuss, and unrelated. One of the major problems faced by current machine learning models used for stance detection is caused by a severe class imbalance among these classes. Hence, most models fail to correctly classify instances that fall into minority classes. In this paper, we address this problem by proposing a hierarchical representation of these classes, which combines the agree, disagree, and discuss classes under a new related class. Further, we propose a two-layer neural network that learns from this hierarchical representation and controls the error propagation between the two layers using the Maximum Mean Discrepancy regularizer. Compared with conventional four-way classifiers, this model has two advantages: (1) the hierarchical architecture mitigates the class imbalance problem; (2) the regularization makes the model to better discern between the related and unrelated stances. An extensive experimentation demonstrates state-of-the-art accuracy performance of the proposed model for stance detection.
From Stances’ Imbalance to Their Hierarchical
Representation and Detection
Qiang Zhang
University College London
London, United Kingdom
qiang.zhang.16@ucl.ac.uk
Shangsong Liang
Sun Yat-sen University
Guangzhou, China
liangshangsong@gmail.com
Aldo Lipani
University College London
London, United Kingdom
aldo.lipani@ucl.ac.uk
Zhaochun Ren
Shandong University
Qingdao, China
zhaochun.ren@sdu.edu.cn
Emine Yilmaz
University College London
London, United Kingdom
emine.yilmaz@ucl.ac.uk
ABSTRACT
Stance detection has gained increasing interest from the research
community due to its importance for fake news detection. The goal
of stance detection is to categorize an overall position of a subject
towards an object into one of the four classes: agree,disagree,dis-
cuss, and unrelated. One of the major problems faced by current
machine learning models used for stance detection is caused by a
severe class imbalance among these classes. Hence, most models
fail to correctly classify instances that fall into minority classes. In
this paper, we address this problem by proposing a hierarchical
representation of these classes, which combines the agree, disagree,
and discuss classes under a new related class. Further, we propose
a two-layer neural network that learns from this hierarchical rep-
resentation and controls the error propagation between the two
layers using the Maximum Mean Discrepancy regularizer. Com-
pared with conventional four-way classiers, this model has two
advantages: (1) the hierarchical architecture mitigates the class im-
balance problem; (2) the regularization makes the model to better
discern between the related and unrelated stances. An extensive ex-
perimentation demonstrates state-of-the-art accuracy performance
of the proposed model for stance detection.
CCS CONCEPTS
Information systems Information extraction
;
Sentiment
analysis.
KEYWORDS
hierarchical classier, maximum mean discrepancy
ACM Reference Format:
Qiang Zhang, Shangsong Liang, Aldo Lipani, Zhaochun Ren, and Emine
Yilmaz. 2019. From Stances’ Imbalance to Their Hierarchical Representation
and Detection. In Proceedings of the 2019 World Wide Web Conference (WWW
’19), May 13–17, 2019, San Francisco, CA, USA. ACM, New York, NY, USA,
10 pages. https://doi.org/10.1145/3308558.3313724
This paper is published under the Creative Commons Attribution 4.0 International
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution.
WWW ’19, May 13–17, 2019, San Francisco, CA, USA
©
2019 IW3C2 (International World Wide Web Conference Committee), published
under Creative Commons CC-BY 4.0 License.
ACM ISBN 978-1-4503-6674-8/19/05.
https://doi.org/10.1145/3308558.3313724
1 INTRODUCTION
The quality of online news is usually less substantiated than that
of traditional news services such as magazines or newspapers [
1
,
45
,
47
]. A large volume of fake news is being produced for political
or economical purposes [
8
,
22
,
41
]. Fake news are those news ar-
ticles that purport to be factual, but which contain misstatements
of fact with intention to arouse passions, attract viewership, or
deceive [
25
,
37
,
44
]. Verifying news content needs to retrieve evi-
dences and determine their stance with respect to the news claims,
which proposes new challenges for the conventional stance detec-
tion task [
31
,
36
]. We specify evidence as text, e.g. web-pages and
documents, that can be used to prove if news content is or is not
true. Moreover, automatic stance detection has broad applications
in information retrieval and text entailment [34, 42].
The task of stance detection is to identify the stance of an evi-
dence towards a given news claim [
12
,
13
]. Stances can be catego-
rized into four classes: agree,disagree,discuss and unrelated [
17
].
Two characteristics make the stance detection task peculiar. On the
one hand, news claims and evidences are often unrelated – gener-
ating a severe class imbalance problem; On the other hand, since
the non-related classes are by denition related, intuitively, the
identication of an evidence as related or unrelated to a news claim
is semantically dierent from the identication of an evidence as
belonging to one of the other three classes. These two characteris-
tics suggests the natural presence of a hierarchical structure among
stance classes.
Stance detection has been studied in areas of information ex-
traction and natural language processing [
11
,
40
]. However, previ-
ous methods tackle the task as a multiclass classication problem,
neglecting the hierarchical structure in stance classes. Also, the
commonly-used four-way classiers are easily inuenced by the
class imbalance problem. In this paper, we address this issue by
modeling the stance detection task as a two-layer neural network.
The rst layer aims at identifying the relatedness of the evidence,
while the second layer aims at classifying, those evidences iden-
tied as related, into the other three classes: agree, disagree and
discuss. Moreover, by studying various level of dependence assump-
tions between the two layers: (1) independent, when there is no
error propagation between the two layers; (2) dependent, when the
error propagation is left free, and; (3) learned, when the error prop-
agation is controlled by Maximum Mean Discrepancy (MMD), we
show that when learned, the neural network (a) better separates the
distributions of related and unrelated stances and (b) outperforms
the state-of-the-art accuracy for the stance detection task.
The remainder of the paper is organized as follows: § 2 summarizes
the related work; § 3 denes the stance detection task; § 4 details the
proposed hierarchical classication model and the regularization
term; § 5 describes the used datasets and experimental setup; § 6 is
devoted to experimental results, and; § 7 concludes the paper.
2 RELATED WORK
Machine learning techniques are widely researched to tackle the
stance detection task. Previous works focus on political or congres-
sional oor debates [
11
,
40
,
46
] and online forums [
2
,
19
,
27
,
38
,
39
,
42
]. Most of these works rely on content-based features, such as
sentiment analysis and topic-specic features learned from labeled
datasets for a closed set of topics.
Two methods only consider the agree, disagree and discuss
classes: Bar-Haim et al
. [7]
split the stance detection task to three
sub-tasks and propose a Contrast Classication Algorithm to distin-
guish agree and disagree classes; Augenstein et al
. [4]
build a neural
network architecture based on bidirectional conditional encoding
on a Tweeter dataset. A long-short term memory (LSTM) encodes
the claim and another LSTM encodes the text with the encoded
claim as initial states. These methods fail to consider the unrelated
class.
Two other methods consider all the classes, but use two dierent
models: Bourgonje et al
. [10]
use the lemmatized
n
-gram matching
and a rule-based procedure to decide the evidence relatedness, and
a three-way logistic regression classier to distinguish among the
relevant classes; Wang et al
. [43]
rstly develop a gradient boosted
decision tree (GBDT) model [
28
] to determine the evidence related-
ness, then another GBDT model is used to distinguish stances of the
text towards the claim. These methods involve feature engineering
in separate models and cannot be jointly optimized to achieve the
best performance.
Other methods that also consider all the classes have been de-
veloped during the Fake News Challenge stage 1 (FNC-1) [
18
]. The
winner team uses a 50%/50% weighted average between a GBDT
model and a convolutional neural network (CNN) [
5
]. The second
best performance is achieved by an ensemble of ve multi-layer
perceptrons (MLPs) where input features include bag-of-words,
semantic analysis in addition to the baseline features developed
by the challenge organizers [
16
]. Compared to the above two solu-
tions, the third best team does not try ensemble methods. They use
TF-IDF features and an MLP as a four-way classier [
33
]. Zhang
et al
. [48]
propose a ranking method to tackle the task and achieve
empirical performance improvements. However, these methods all
neglect the hierarchical structure among the four types of stances
and suer from class imbalance.
Deep learning-based methods have also been applied in the
FNC-1. Bajaj
[6]
utilizes LSTM, CNN and their variants to detect
stances. Bajaj nds that an attention-augmented CNN obtains the
best performance. Rakholia and Bhargava
[32]
analyze the eective-
ness of dierent ways of text coding, such as independent coding,
bidirectional conditional encoding and attentive readers, and con-
clude that the attentive reader model is the most suitable for the
task. Ma et al
. [23]
propose a multi-task learning algorithm that
jointly detect rumours and stances. However, all these methods fail
to achieve high accuracy for the agree and disagree classes.
There are three major defects in all the aforementioned meth-
ods: (a) they neglect the hierarchical relationships among the four
stances; (b) they suer from the class imbalance problem, and; (c)
they fail to achieve acceptable detection performance for the agree
and disagree classes.
3 STANCE DETECTION TASK
The stance detection task consists in classifying the stance of an
evidence towards a claim as one of the four classes: agree, disagree,
discuss and unrelated. Formal denitions of these four stances are:
agree – the evidence supports the claim;
disagree – the evidence denies the claim;
discuss
– the evidence does not have a position about the claim;
unrelated – the evidence is not about the claim.
4 HIERARCHICAL CLASSIFICATION
In this section, we detail our proposed two-layer neural network
for stance detection. § 4.1 outlines the model. In order to better
dierentiate between the related and unrelated classes, we design
an MMD regularization term in § 4.2. This is then integrated into
the two-layer neural network loss function in § 4.3. In Figure 1, we
show the architecture of our model.
4.1 Two-Layer Neural Network
Let the input space be formed by
m
-dimensional real vectors in a
neural network, denoted as
vRm
. The four-class label can be
transformed into a one-hot vector
y
. The
i
-dimension of
y
(
yi
) is 1
when the stance is the
i
-element in the label set
{aдree ,disaдree,
discuss,unrelated }
and 0 otherwise. The hidden layer with param-
eters
θu
learns to map
v
to a
k
-dimensional hidden representation
uRk:
u=f(v;θu).(1)
For the two-layer classication, the rst layer decides whether the
evidence is related to a claim. Hence, the rst classication layer is
called the relatedness layer. This layer is parameterized by
θr
and
learns to produce a 2-dimensional normalized vector ˆ
ras follows:
ˆ
r=д(u;θr).(2)
Note that the
Somax
function is included in
д
to normalize the
2-dimensional vector, so each component of the vector
ˆ
r
denotes
the probability that the neural network assigns
v
to the related and
unrelated classes, i.e., p(rel ated)and p(unrel ated).
The second layer classies the evidence into the related classes,
i.e., agree, disagree, or discuss stances. Hence, the second classica-
tion layer is called the stance layer. The stance layer is parameterized
by θsand learns to produce a 3-dimensional normalized vector ˆ
s:
ˆ
s=h(ˆ
r· (1,0);θs),(3)
where the vector multiplication
ˆ
r· (
1
,
0
)
extracts the rst element
of
ˆ
r
. Note that the
Somax
function is also included in
h
to nor-
malize the 3-dimensional vector, so that each component of the
vector
ˆ
s
denotes the conditional probability that the neural network
 
v
Relatedness Layer
 

    



 
Stance Layer
MMD
Input Output
1



Figure 1: The architecture of our proposed two-layer neural network.
assigns
v
to agree, disagree and discuss given that
v
is related, i.e.,
p(aдree |related),p(disaдree |r elat ed), and p(discuss|rel ated ).
We dene the classication loss by the Kullback-Leibler (KL) di-
vergence [
21
], which measures the dierence between the network
outputs and labels:
lr(θu,θr):=KL(rˆ
r),(4)
where
r
is the ground-truth relatedness of the input data.
r
is com-
puted from a label yas follows:
r=( (y,e4),(y=e4)),(5)
where is the indicator function,
e4
is a 4-dimensional one-hot
vector with fourth element equal to 1. When
y=e4
is veried, it
indicates that the label belongs to the unrelated class. Similarly, the
stance classication loss can be dened as:
ls(θu,θr,θs):=KL(sˆ
s),(6)
where
s
is the ground-truth stance of the input data.
s
is computed
from a label yas follows:
s=( (y=e1),(y=e2),(y=e3)),(7)
where
e1
,
e2
,
e3
are 4-dimensional one-hot vectors with rst, second,
and third elements equal to 1. When
y=e1
is veried, it indicates
that the label belongs to the agree class, when
y=e2
is veried,
it indicates that the label belongs to the disagree class, and when
y=e3
is veried, it indicates that the label belongs to the discuss
class.
Finally, we now dene the loss function for the two-layer neural
network as the linear combination between the loss function of the
relatedness layer (lr) and the loss function of the stance layer (ls):
lc(θu,θr,θs):=lr(θu,θr)+α·ls(θu,θr,θs),(8)
where αleverages the importance of the two classication layers.
4.2 Maximum Mean Discrepancy
The classication of related/unrelated stances is a dierent task
from that of agree/disagree/discuss stances. Therefore, data repre-
sentations from the relatedness layer and the stance layer can be
seen as samples drawn from two dierent distributions. In order
to measure distribution discrepancy between these two layers, we
employ the Maximum Mean Discrepancy (MMD) [
9
] as a regular-
ization term. The MMD does not involve density estimation and
thus is a non-parametric way of measuring the dierence between
distributions. MMD has achieved success in face recognition and
image annotation [15].
MMD is dened as follows:
Denition 4.1. Maximum Mean Discrepancy [
9
]: “Let
p
and
q
be
two Borel probability distributions over a space
X
and let
X
and
Z
be sets with independent identically distributed samples drawn
from
p
and
q
. The MMD is dened by a class
Ψ
of map functions
ψ
:
X → H as:
MMD(p,q,Ψ)=sup
ψΨ
(Ep[ψ(x)] − Eq[ψ(z)]).(9)
Here, xand zare samples from Xand Z.
In other words, the MMD equation denes the largest possible
distance between two expectations over the set of function
Ψ
. More-
over, “when
H
is the reproducing kernel Hilbert space (RKHS) [
3
],
this means that for all
x∈ X
, the linear point evaluation function
mapping
ψψ(x)
exists and is continuous. When
Ψ
is the unit
ball in a universal RKHS, it is guaranteed that
MMD(p,q,Ψ)
will
detect any discrepancy between pand q[9, 35].
Let
p
denote the distribution for the rst layer samples (un-
related hidden representations) in our model, with sample set
U1={u1
1, . . . , u1
n1}
and according to Eq.
(1)
their generating set
V1={v1
1, . . . , v1
n1}
. And,
q
denotes the distribution for the second
layer samples (agree, disagree and discuss hidden representations),
with sample set
U2={u2
1, . . . , u2
n2}
and according to Eq.
(1)
their
generating set
V2={v2
1, . . . , v2
n2}
.
n1
and
n2
are the number of
samples in
U1
and
U2
. Thus we have
X=Rk
and
H=Rj
with
ψ(x)=θdx
, where
θd
is a
j×k
matrix in the projection layer.
k
and
jare the space dimensions. According to Eq. (1), the hidden repre-
sentation
u
is parameterized by
θu
, thus the empirical expression
of MMD is parameterized by θuand θd:
d(θu,θd)=1
n1
n1
Õ
i=1
θdu1
i1
n2
n1+n2
Õ
i=n1+1
θdu2
i(10)
=1
n1
n1
Õ
i=1
θdf(v1
i;θu) − 1
n2
n1+n2
Õ
i=n1+1
θdf(v2
i;θu).(11)
By constantly changing the projection layer parameterized by
θd
,
we nd the maximum expectation dierence between the represen-
tations of the two classication layers.
4.3 Optimization
The more dierent two distributions are, the larger the MMD is.
Hence, in order to make the distributions easier to be distinguished,
a larger MMD regularization term is preferred, and we treat the
regularization term as an extra goal besides classication. We in-
tegrate the two-layer classication loss (see Eq.
(8)
) and the MMD
regularization term (see Eq.
(10)
) into a single objective function
(
L
). Specically, we add these two sub-goals with a hyperparameter
βas follows:
L(θu,θr,θs,θd)=lc(θu,θr,θs) − β·d(θu,θd),(12)
where
β
leverages the importance of the regularization. The larger
the MMD regularization term is, the easier is for the classier to
distinguish between the related and unrelated stances. Thus, the
sign of the regularization term is negative.
The optimization involves the minimization of the classication
loss Lwith respect to θu,θr,θs, and θdas follows:
min
θu,θr,θs,θd
L(θu,θr,θs,θd).(13)
Optimizing the model consists of two sub-goals. On the one hand,
we want to maximize the distribution discrepancy between the
two classication layers. On the other hand, we want to minimize
the classication loss of both layers. Both of these two sub-goals
involve the feature layer parameter
θu
update, but in opposite
update directions. The optimization process will not stop until
a saddle point (the feature layer parameters can be well applied
in both sub-goals) is reached. Algorithm 1 shows the parameter
update process, which is based on the mini-batch gradient descent
algorithm.
4.4 Prediction
Given as input a feature vector
v
, the classier outputs the following
probabilities:
p(unrelated )
,
p(aдree |related)
,
p(disaдree|rel ated )
,
and
p(discuss|r elated )
. However, these last 3 probabilities are not
comparable with the rst one. To make them comparable we derive
Algorithm 1:
Parameter update process based on the mini-
batch gradient descent algorithm.
input : Sample mini-batch {vi,ri,si}n
i=1, mini-batch size n,
00000 hyperparameters α,β, and µ
output :θu,θr,θs,θd
1begin
2Initialize θu,θr,θs,θd;
3repeat
4/* forward propagation */
5lr
,ls0;
6for i from 1to n do
7uif(vi;θu);
8ˆ
riд(ui;θr);
9lr
iKL(riˆ
ri);
10 lrlr+lr
i;
11 if ri· (1,0)=1then
12 /* classify related */
13 ˆ
sih(ˆ
ri· (1,0);θs);
14 ls
iKL(siˆ
si);
15 else
16 /* unrelated */
17 ls
i=0;
18 lsls+ls
i;
19 d=MMD({ui,ri}n
i=1;θd);
20 /* backward propagation */
21 θsθsµ·α·ls
θs;
22 θrθrµ· ( lr
θr
+α·ls
θr);
23 θdθd+µ·β·d
θd;
24 θuθuµ· ( lr
θu
+α·ls
θuβ·d
θu);
25 until θu,θr,θs,θdconverge;
p(aдree)
,
p(disaдree)
and
p(discuss)
. By observing that the class
agree is assumed as related, thus
p(aдree,rel ated )=p(aдree )
, we
derive that:
p(aдree)=p(aдree,related )
=p(aдree |related) × p(related)
=p(aдree |related)×(1p(unrelated)).(14)
Similarly, for the other two classes we derive that:
p(disaдree)=p(dis aдree |r elat ed)×(1p(unr elated)),
p(discuss)=p(discuss|rel ated)×(1p(unrelated )).(15)
Thereby, the model actual output ˆ
yis:
ˆ
y=(p(aдree ),p(disaдree),p(discuss),p(unrelated)),(16)
where the class with the highest probability corresponds to the
predicted stance.
5 EXPERIMENTAL SETUP
We start this section by presenting the datasets and evaluation
measures relevant to the stance detection task. Then, we describe
the features used by our model and the model parameterization.
Finally, we present the baselines. The software used to run the
experiments of this paper is available on the website of the rst
author.
5.1 Datasets
Experiments are conducted on two publicly available datasets: the
Emergent dataset
1
[
14
] and the FNC-1 dataset
2
. In these two datasets,
a claim consists of a news article headline and an evidence of a
news article content. These datasets are split into train and test
subsets; see Table 1 for statistics about the splits.
The FNC-1 dataset consist of 75,385 instances. Each instance
in the dataset is a pair claim-evidence labeled as one of the four
stances: agree, disagree, discuss and unrelated. The ratio of training
data over testing data in the FNC-1 dataset is
2:1. Every class
accounts for a similar percentage in the train and test subsets. The
unrelated stances are the majority (over 70%) in both subsets, while
the disagree stances are less than 3%. The agree and discuss stances
are less than 20% and 10%.
The Emergent dataset is similar to the FNC-1 dataset, however it
contains only agree, disagree and discuss stances. Hence, it needs to
be augmented with unrelated stances. Similarly to how the FNC-1
dataset unrelated stances have been labeled, we manually labeled
unrelated stances by pairing a claim with an unrelated evidence, i.e.,
paired with another claim. Moreover, to make the class distributions
less imbalanced, we make the ratio of related stances and unrelated
ones
1:1. The augmented Emergent dataset contains 4,071 training
labels and 1,024 testing labels with a ratio of
4:1. Class distributions
between train and test subsets are similar.
Compared to the FNC-1 dataset, the class distributions of the
augmented Emergent dataset is more balanced. The percentage of
unrelated stances is about 50%, whereas the percentages of agree
and disagree stances are about 24% and 8%. Both datasets have
similar percentages of the discuss stances.
5.2 Evaluation Measure
In line with the FNC-1 challenge, the evaluation is based on a
weighted two-level scoring system based on the accuracy mea-
sure. This evaluation measure, called relative score, evaluates a
model by splitting the stance detection task into two sub-tasks, re-
lated/unrelated and agree/disagree/discuss classication sub-tasks.
To the former sub-task is given a 25% weight. This is done because
this sub-task is considered to be easier than the latter sub-task to
which is given a 75% weight.
We report the evaluation measures: relative score, accuracy, and
accuracy on a per class basis.
5.3 Feature Extraction
To represent claims and evidences we use a bag-of-words approach.
For each claim and evidence we generate a TF-IDF vector, and for
1https://github.com/willferreira/mscproject.
2https://github.com/FakeNewsChallenge/fnc-1.
each pair claim-evidence we compute their cosine similarity. We
also include the FNC-1 ocial features into the input feature vector.
The nal set of features include:
TF-IDF vectors of claims;
TF-IDF vectors of evidences;
Cosine similarity (CosSim) between the claim vector and the
evidence;
Ratio of word overlap (WordLap) between the claim and the
evidence;
An Indicator whether a claim has refuting words (RefWord);
The polarity (Pol) of the claim and the evidence;
The number of overlapping
n
-grams (NGrams) for
n∈ {
2
,
3
,
4,5,6}between the claim and the evidence.
For the TF-IDF vectors, we only use the top 2,000 most frequent
terms except stop-words. All of these features are concatenated to
form the input feature vector v.
5.4 Experimental Setting
The following hyperparameters have been set via a ve-cross vali-
dation on the train subsets:
The dimension kof hidden representations is set to 100;
The dimension jof the MMD is set to 10;
The activation function used in the hidden layers is set to
ReLu;
The parameters
α
are set to 1
.
5and 1
.
3for the Emergent and
FNC-1 datasets.
The parameter βis set to 0.001;
We include a L2 regularization term [
29
] for the MLP weight
parameters in the nal loss function to mitigate overtting. Dropout
is also used to mitigate overtting with rate set to 0
.
6. We train
in mini-batches of size 64 over the entire train subset. Note that
the gradient steps in Algorithm 1 can easily be alternated with a
more powerful optimizer such as the Adam optimizer [
20
]. Early
stopping is applied when the classication loss on the validation
subset does not get smaller for three continuous iterations. The
whole model is implemented with TensorFlow.
5.5 Baselines
We compare our model against the methods mentioned in Section 2.
These methods are detailed in the following. Among them we dis-
tinguish between methods that use the same features as ours and
methods that learn their representations. We start with the latter
type, we call these representation learning-based baselines:
Bidirectional LSTM (BiLSTM).
Augenstein et al
. [4]
build a neu-
ral network architecture based on bidirectional LSTM on a
Tweeter dataset. A LSTM encodes the claim, and another
LSTM encodes the evidence with the encoded claim set as
initial states. The 100-d GloVe word embedding is used as
input [30];
Attentive CNN (AtCNN).
Bajaj
[6]
builds an attention-augmented
CNN. The claim and the evidence are input to a convolu-
tional neural network to obtain hidden representations, and
the attention mechanism is employed to locate the most
inuential words or phases on the nal results;
Table 1: Statistics of the datasets.
Subset Stance Emergent FNC-1
Number Percentage Number Percentage
Training agree 992 24.37 3,678 7.36
disagree 303 7.44 840 1.68
discuss 776 19.06 8,909 17.83
unrelated 2,000 49.13 36,545 73.13
4,071 49,972
Testing agree 246 24.02 1,903 7.49
disagree 91 8.89 697 2.74
discuss 776 19.06 4,464 17.57
unrelated 500 48.83 18,349 72.20
1,024 25,413
Memory Network (MN).
Mohtarami et al
. [26]
develop an end-
to-end memory network for stance detection. The network
operates at the paragraph level and integrates convolutional
and recurrent neural networks, as well as a similarity matrix
as part of the overall architecture;
Ranbking Model (RM).
Zhang et al
. [48]
build a ranking method
to tackle the stance detection and achieve empirical perfor-
mance improvements. A ranking loss function is proposed to
replace Softmax and maximize the representation dierence
between four classes of stance.
We now review the second type of baselines: those methods
that use the same features as our method, we call these feature
engineering-based baselines:
Ocial Baseline (OB).
This is the FNC-1 ocial baseline that
uses one gradient boosting decision trees model for four-
way classication;
Logistic Regression (LR).
Bourgonje et al
. [10]
use
n
-gram match-
ing and a rule-based procedure to decide relatedness, and
three-way logistic regression to distinguish among the re-
lated classes;
Gradient Boosted Decision Trees (GBDT).
Wang et al
. [43]
de-
velop two GBDT models, one to determine the relatedness
of an evidence to a claim, and another to distinguish among
the related classes;
Multi-Layer Perception (MLP).
This model [
33
] achieved the
third best performance in FNC-1. It extracts TF-IDF and
cosine similarity between claims and evidences as input fea-
tures, and uses a MLP as the four-class classier.
6 RESULTS AND DISCUSSION
In this section, we start by analyzing the dependency assumption.
Then, we compare and contrast our model against the baselines.
Next, we provide a sensitivity analysis of the hyperparameters. We
conclude with an impact analysis of the features used by the model.
6.1 Dependency Assumption
In Figure 2 we show the eect of the 3 dependency assumptions
by visualizing the learned representations using a t-SNE projec-
tion [
24
]. We observe that when the classiers are assumed inde-
pendent, i.e., the classication is performed in cascade — no error is
propagated from the second layer to the rst during training — then
the learned representation well separates the unrelated class from
the unrelated ones. When the classiers are assumed dependent,
i.e., the two classiers are trained together — the error is left free
to propagate from the second layer to the rst — then the learned
representation is not very well separated. However, when the de-
pendence assumption of the two classiers is learned via the MMD
regularization, i.e., the two classiers are trained together with the
error propagation controlled by the regularizer, then the learned
representation is again well separated like in the rst case. Well-
separated representations suggest a greater discriminative power
of the model — the unrelated and related classes are almost linearly
separable.
The last three rows of Tables 2 and 3 show the performance
of our model on the two test subsets for each one of the three
assumptions: independent, dependent, and learned. Looking at the
accuracy of the unrelated class, we observe that the accuracy is
greater when the learned representations are well-separated, as in
the independent and learned cases. Furthermore, looking at all the
other scores, we observe that the learned assumption outperforms
both the independent and dependent assumptions in all other cases,
demonstrating that learning together both, relatedness and stance
of the evidences towards claims, is benecial to the stance detection
task.
6.2 Overall Performance
In Tables 2 and 3 we compare our model against the state-of-the-art
models. Our model achieves the best stance detection performance
for the relative score on both datasets. The model achieves 89.30%
on the augmented Emergent test subset and 88.15% on the FNC-1
test subset.
By comparing with four-way classication baselines (OB, MLP,
BiLSTM, AtCNN, MN and RM) we demonstrate the advantage of
separating the relatedness detection from the stance detection. We
observe that these classiers perform poorly on the disagree class,
which is caused by the large percentage dierence between the
minority disagree class and the majority unrelated class. Further,
the more imbalanced the evaluation dataset becomes, the worse per-
formance the four-way classiers achieve on the minority disagree
class.
(a) Independent (b) Dependent (c) Learned
Figure 2: t-SNE visualization of the hidden representations on the training data. The hidden representations of model trained
(a) with separated layers (b) together but without regularization, and (c) with MMD regularization.
Table 2: Performance comparison of our model against the State-of-the-Art models on the augmented Emergent dataset.
Model Accuracy (%) Relative Score (%)
agree disagree discuss unrelated
Feature Engineering-Based Baselines
OB 33.56 23.44 70.23 84.00 74.86
LR (Bourgonje et al.) 66.73 40.51 78.33 78.00 83.45
GBDT (Wang et al.) 80.62 50.42 83.52 88.00 87.53
MLP (Riedel et al.) 58.53 23.64 79.05 95.00 85.43
RM (Zhang et al.) 64.56 40.42 85.45 96.00 87.69
Representation Learning-Based Baselines
BiLSTM (Augenstein et al. ) 43.21 12.57 78.55 96.00 81.37
AtCNN (Bajaj) 44.78 14.60 72.44 97.00 83.56
MN (Mohtarami et al.) 54.64 40.05 72.10 89.00 85.92
Our Models
Independent 74.54 45.32 82.59 95.49 86.33
Dependent 63.54 44.68 68.35 95.00 86.72
Learned 82.52 69.05 84.30 97.00 89.30
By comparing with baselines that separate the relatedness detec-
tion from the stance detection (LR and GBDT) we demonstrate the
superiority of a single end-to-end model. LR and GBDT are better
on the disagree class, although their overall performance is worse
than our model.
In Figure 3 we show the confusion matrix of our model. Here
we observe the detection performance on a per class basis. For the
related/unrelated classication, we correctly classify 97.00% and
99.53% unrelated instances on the augmented Emergent and the
FNC-1 test subsets. We can see that there is some misclassication
between the agree and unrelated classes, and between the discuss
and unrelated classes. The misclassication of the disagree class
accounts for the largest error of the unrelated instances.
Our model achieves an accuracy of 69.05% and 72.35% for the
disagree class on the Emergent and the FNC-1 test subsets. The
classication accuracy is largely improved compared to the state-
of-the-art. Some misclassication error exists between agree and
disagree. However, our model can distinguish between the discuss
and the disagree with few errors. While the number of discuss cases
is the largest and the number of disagree instances is the smallest,
our model does not mistake disagree instances as discuss ones, i.e.,
the model has learned the core representation dierence between
these two classes. Due to ambiguous expressions, misclassication
between agree and discuss is the cause of most errors between these
classes, which leads to a slightly worse accuracy for the discuss
class on the Emergent (84.30%) and FNC-1 (77.49%) test subsets.
Table 3: Performance comparison of our model against the State-of-the-Art models on the FNC-1 dataset.
Model Accuracy (%) Relative Score (%)
agree disagree discuss unrelated
Feature Engineering-Based Baselines
OB 10.51 1.00 79.66 97.98 75.20
LR (Bourgonje et al.) 67.42 31.61 75.23 95.36 80.63
GBDT (Wang et al.) 82.93 69.82 33.52 95.42 86.72
MLP (Riedel et al.) 44.04 6.60 81.38 97.90 81.72
RM (Zhang et al.) 64.90 27.26 84.41 99.12 86.66
Representation Learning-Based Baselines
BiLSTM (Augenstein et al.) 35.96 0.94 80.33 98.54 78.70
AtCNN (Bajaj ) 38.67 8.24 70.63 91.25 75.77
MN (Mohtarami et al.) 16.92 60.22 81.27 95.50 79.92
Our Models
Independent 72.41 37.90 68.23 97.43 83.47
Dependent 61.34 42.93 59.38 99.05 85.32
Learned 80.61 72.35 77.49 99.53 88.15
Figure 3: The confusion matrices of our model for the augmented Emergent (on the left) and FNC-1 (on the right) datasets.
Two reasons account for the improved empirical performance
observed on our model. On the one hand, the mitigation of the
class imbalance problem. Contrary to the four-way classiers that
directly compare the disagree and unrelated instances, the hierar-
chical model avoids the direct comparison of this minority disagree
class (which is less than 2% in the FNC-1 dataset) with the majority
unrelated one (which is more than 70% in the FNC-1 dataset). On
the other hand, the MMD term that maximizes the discrepancy be-
tween the unrelated class and the aggregated related classes. Since
the agree, disagree and discuss belong to the same class, the related
class, the MMD regularization promotes the emergence of features
that are useful to separate the class pairs: agree with unrelated,
disagree with unrelated, and discuss with unrelated.
6.3 Hyperparameters Sensitivity
In this subsection we discuss the sensitivity to the hyperparame-
ters of our model. The most inuential hyperparameters for the
proposed model are
α
and
β
. The former controls the relative impor-
tance of classication layers. The latter leverages the regularization.
In Figures 4(a) and 4(b) we show how the performance of the
model changes when varying
α
and
β
for the augmented Emergent
and FNC-1 test subsets.
α
is searched between 0
.
1and 3
.
0with
steps of 0
.
1, and
β
is searched in
{
0
,
0
.
1
,
0
.
01
,
0
.
001
,
0
.
0001
,
0
.
00001
}
.
For
α
, we observe that the performance of the model improves
quickly as
α
increases and peaks at 1
.
5and 1
.
3for the FNC-1 and
augmented Emergent datasets, then the performance experiences
a slight decrease when
α
is increased. We hypothesize that the
optimal
α
is related to the class balance between the unrelated class
and the related ones. The more unbalanced the dataset is towards
the unrelated class, larger is the optimal
α
. For
β
, we observe that the
performance is the highest when
β
is set to 0.001. This happens for
both augmented Emergent and FNC-1 test subsets. These optimal
values of
α
and
β
observed on the test subsets are equal to the one
found when training the model.
(a) Emergent (b) FNC-1
Figure 4: Sensitivity of the trained model when varying the parameters αand βon the test subset of the augmented Emergent
(on the left) and FNC-1 (on the right) datasets.
Table 4: Performance of our model with dierent feature sets on the FNC-1 dataset. “/” denotes no feature set is removed.
Removed Feature Set Accuracy (%)
agree disagree discuss unrelated
CosSim 71.53 85.08 78.76 69.37
WordLap 67.43 80.49 77.31 77.89
RefWord 74.43 64.37 77.03 97.49
Pol 60.49 67.93 80.92 98.79
NGrams 74.27 75.73 87.82 84.52
/ 80.61 82.35 77.49 99.53
6.4 Feature Analysis
In this subsection we evaluate and discuss the importance of each
feature towards the nal prediction. To examine the inuence of
each feature on the nal performance, we do a leave-one feature set-
out approach and record the classication accuracy on the stance
detection task. The following analysis is only based on the FNC-1
dataset. Similar results are observed on the augmented Emergent
dataset.
In Table 4 we show the results of this analysis. We observe that
removing the CosSim feature leads to a large decrease in accuracy
for the unrelated class. Similarly, the use of WordLap has a positive
eect for the agree class, and it also contributes to the unrelated
class. The RefWord and Pol features help for the classes agree and
disagree, while removing the NGram feature leads to an increase on
the discuss class, i.e., the NGram feature causes confusion between
the discuss and the other classes.
7 CONCLUSION
In this paper, we studied the problem of stance detection: the clas-
sication of the stance of an evidence towards a claim into one of
the four classes: agree, disagree, discuss and unrelated.
We proposed a hierarchical representation of the stance classes,
where the classes agree, disagree and discuss are combined together
into a class referred as the related class. The main idea here is to
divide a concept into sub-concepts that are organized in a hierar-
chical structure, and design constraints between sub-concepts in
order to make the model parameter optimization more sensible.
The primary advantage of this hierarchical representation is that it
is useful to overcome the class imbalance problem.
This hierarchical representation has inspired the proposed two-
layer neural network to tackle the stance detection task. The rst
layer performs a related-unrelated classication, while the second
layer performs a more ne-grained classication among the related
classes. Furthermore, we have empirically demonstrated that (1) it
is advantageous to learn these two classication tasks together, and
(2) the dependency between these two layers can be learned through
a MMD regularization term, which measures the representation
discrepancy between the two layers. Experiments on two publicly
available datasets have shown that our model is able to outperform
the state-of-the-art stance detection methods.
As future work we consider the enriching of the proposed model
as follows. First, integrating a credibility evaluation of information
sources as features. Second, improving the explainability of the
model by showing which words or phrases are the most inuential
in predicting the stance via attention mechanisms.
ACKNOWLEDGMENTS
This project was funded by the EPSRC Fellowship titled "Task Based
Information Retrieval", grant reference number EP/P024289/1. We
acknowledge the support of NVIDIA Corporation with the donation
of the Titan Xp GPU used for this research.
REFERENCES
[1]
Hunt Allcott and Matthew Gentzkow. 2017. Social media and fake news in the
2016 election. Technical Report. National Bureau of Economic Research.
[2]
Pranav Anand, Marilyn Walker, Rob Abbott, Jean E Fox Tree, Robeson Bowmani,
and Michael Minor. 2011. Cats rule and dogs drool!: Classifying stance in on-
line debate. In Proceedings of the 2nd workshop on computational approaches to
subjectivity and sentiment analysis. Association for Computational Linguistics,
1–9.
[3]
Nachman Aronszajn. 1950. Theory of reproducing kernels. Transactions of the
American mathematical society 68, 3 (1950), 337–404.
[4]
Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, and Kalina Bontcheva.
2016. Stance Detection with Bidirectional Conditional Encoding. In Proceedings
of the 2016 Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, 876–885. https://doi.org/10.18653/
v1/D16-1084
[5]
Sean Baird, Sibley Doug, and Yuxi Pan. 2017. Talos Targets Disinformation with
Fake News Challenge Victory. (2017). https://blog.talosintelligence.com/2017/
06/talos-fake- news-challenge.html
[6]
Samir Bajaj. 2017. “The Pope Has a New Baby!” Fake News Detection Using
Deep Learning. (2017).
[7]
Roy Bar-Haim, Indrajit Bhattacharya, Francesco Dinuzzo, Amrita Saha, and Noam
Slonim. 2017. Stance classication of context-dependent claims. In Proceedings of
the 15th Conference of the European Chapter of the Association for Computational
Linguistics: Volume 1, Long Papers, Vol. 1. 251–261.
[8]
Adam J Berinsky. 2017. Rumors and health care reform: experiments in political
misinformation. British Journal of Political Science 47, 2 (2017), 241–262.
[9]
Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bern-
hard Schölkopf, and Alex J Smola. 2006. Integrating structured biological data by
kernel maximum mean discrepancy. Bioinformatics 22, 14 (2006), e49–e57.
[10]
Peter Bourgonje, Julian Moreno Schneider, and Georg Rehm. 2017. From clickbait
to fake news detection: an approach based on detecting the stance of headlines to
articles. In Proceedings of the 2017 EMNLP Workshop: Natural Language Processing
meets Journalism. 84–89.
[11]
Clinton Burfoot, Steven Bird, and Timothy Baldwin. 2011. Collective classi-
cation of congressional oor-debate transcripts. In Proceedings of the 49th An-
nual Meeting of the Association for Computational Linguistics: Human Language
Technologies-Volume 1. Association for Computational Linguistics, 1506–1515.
[12]
Sophie Chesney, Maria Liakata, Massimo Poesio, and Matthew Purver. 2017.
Incongruent headlines: Yet another way to mislead your readers. In Proceedings
of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism.
56–61.
[13]
Jiachen Du, Ruifeng Xu, Yulan He, and Lin Gui. 2017. Stance classication with
target-specic neural attention networks. International Joint Conferences on
Articial Intelligence.
[14]
William Ferreira and Andreas Vlachos. 2016. Emergent: a novel data-set for stance
classication. In Proceedings of the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies.
ACL.
[15]
Bo Geng, Dacheng Tao, and Chao Xu. 2011. DAML: Domain adaptation metric
learning. IEEE Transactions on Image Processing 20, 10 (2011), 2980–2989.
[16]
Andreas Hanselowski, PVS Avinesh, Benjamin Schiller, and Felix Caspelherr.
2017. Description of the system developed by team athene in the FNC-1. Technical
Report. Technical report.
[17]
Andreas Hanselowski, Avinesh PVS, Benjamin Schiller, Felix Caspelherr, Deban-
jan Chaudhuri, Christian M Meyer, and Iryna Gurevych. 2018. A Retrospective
Analysis of the Fake News Challenge Stance Detection Task. arXiv preprint
arXiv:1806.05180 (2018).
[18]
Andreas Hanselowski, Avinesh PVS, Benjamin Schiller, Felix Caspelherr, Deban-
jan Chaudhuri, Christian M. Meyer, and Iryna Gurevych. 2018. A Retrospective
Analysis of the Fake News Challenge Stance-Detection Task. In Proceedings of
the 27th International Conference on Computational Linguistics. Association for
Computational Linguistics, 1859–1874. http://aclweb.org/anthology/C18-1158
[19]
Kazi Saidul Hasan and Vincent Ng. 2013. Stance classication of ideological
debates: Data, models, features, and constraints. In Proceedings of the Sixth Inter-
national Joint Conference on Natural Language Processing. 1348–1356.
[20]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980 (2014).
[21]
Solomon Kullback and Richard A Leibler. 1951. On information and suciency.
The annals of mathematical statistics 22, 1 (1951), 79–86.
[22]
Srijan Kumar and Neil Shah. 2018. False information on web and social media: A
survey. arXiv preprint arXiv:1804.08559 (2018).
[23]
Jing Ma, Wei Gao, and Kam-Fai Wong. 2018. Detect Rumor and Stance Jointly
by Neural Multi-task Learning. In Companion of the The Web Conference 2018 on
The Web Conference 2018. International World Wide Web Conferences Steering
Committee, 585–593.
[24]
Laurens van der Maaten and Georey Hinton. 2008. Visualizing data using t-SNE.
Journal of machine learning research 9, Nov (2008), 2579–2605.
[25]
Todor Mihaylov, Georgi Georgiev, and Preslav Nakov. 2015. Finding opinion
manipulation trolls in news community forums. In Proceedings of the Nineteenth
Conference on Computational Natural Language Learning. 310–314.
[26]
Mitra Mohtarami, Ramy Baly, James Glass, Preslav Nakov, Lluís Màrquez, and
Alessandro Moschitti. 2018. Automatic Stance Detection Using End-to-End
Memory Networks. arXiv preprint arXiv:1804.07581 (2018).
[27]
Akiko Murakami and Rudy Raymond. 2010. Support or oppose?: classifying
positions in online debates from reply activities and opinion expressions. In
Proceedings of the 23rd International Conference on Computational Linguistics:
Posters. Association for Computational Linguistics, 869–875.
[28]
Nasser M Nasrabadi. 2007. Pattern recognition and machine learning. Journal of
electronic imaging 16, 4 (2007), 049901.
[29]
Arnold Neumaier. 1998. Solving ill-conditioned and singular linear systems: A
tutorial on regularization. SIAM review 40, 3 (1998), 636–666.
[30]
Jerey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
Global vectors for word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP). 1532–1543.
[31]
Kashyap Popat, Subhabrata Mukherjee, Jannik Strötgen, and Gerhard Weikum.
2017. Where the truth lies: Explaining the credibility of emerging claims on
the web and social media. In Proceedings of the 26th International Conference
on World Wide Web Companion. International World Wide Web Conferences
Steering Committee, 1003–1012.
[32]
Neel Rakholia and Shruti Bhargava. 2017. “Is it true?” – Deep Learning for Stance
Detection in News. (2017).
[33]
Benjamin Riedel, Isabelle Augenstein, Georgios P Spithourakis, and Sebastian
Riedel. 2017. A simple but tough-to-beat baseline for the Fake News Challenge
stance detection task. arXiv preprint arXiv:1707.03264 (2017).
[34]
Sebastian Ruder, John Glover, Afshin Mehrabani, and Parsa Ghaari. 2018. 360
Stance Detection. In Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Demonstrations. 31–35.
[35]
Bernhard Schölkopf, Koji Tsuda, and Jean-Philippe Vert. 2004. Kernel methods in
computational biology. MIT press.
[36]
Prashant Shiralkar, Alessandro Flammini, Filippo Menczer, and Giovanni Luca
Ciampaglia. 2017. Finding streams in knowledge graphs to support fact checking.
In Data Mining (ICDM), 2017 IEEE International Conference on. IEEE, 859–864.
[37]
Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake news
detection on social media: A data mining perspective. ACM SIGKDD Explorations
Newsletter 19, 1 (2017), 22–36.
[38]
Swapna Somasundaran and Janyce Wiebe. 2009. Recognizing stances in online
debates. In Proceedings of the Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on Natural Language Processing
of the AFNLP: Volume 1-Volume 1. Association for Computational Linguistics,
226–234.
[39]
Swapna Somasundaran and Janyce Wiebe. 2010. Recognizing stances in ide-
ological on-line debates. In Proceedings of the NAACL HLT 2010 Workshop on
Computational Approaches to Analysis and Generation of Emotion in Text. Associ-
ation for Computational Linguistics, 116–124.
[40]
Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining
support or opposition from Congressional oor-debate transcripts. In Proceed-
ings of the 2006 conference on empirical methods in natural language processing.
Association for Computational Linguistics, 327–335.
[41]
Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false
news online. Science 359, 6380 (2018), 1146–1151.
[42]
Marilyn A Walker, Pranav Anand, Robert Abbott, and Ricky Grant. 2012. Stance
classication using dialogic properties of persuasion. In Proceedings of the 2012
conference of the North American chapter of the association for computational lin-
guistics: Human language technologies. Association for Computational Linguistics,
592–596.
[43]
Xuezhi Wang, Cong Yu, Simon Baumgartner, and Flip Korn. 2018. Relevant
Document Discovery for Fact-Checking Articles. In Companion of the The Web
Conference 2018 on The Web Conference 2018. International World Wide Web
Conferences Steering Committee, 525–533.
[44]
Jen Weedon, William Nuland, and Alex Stamos. 2017. Information operations
and Facebook. version 1 (2017), 27.
[45]
Houping Xiao et al
.
2018. Multi-sourced Information Trustworthiness Analysis:
Applications and Theory. Ph.D. Dissertation. State University of New York at
Bualo.
[46]
Ainur Yessenalina, Yisong Yue, and Claire Cardie. 2010. Multi-level structured
models for document-level sentiment classication. In Proceedings of the 2010
Conference on Empirical Methods in Natural Language Processing. Association for
Computational Linguistics, 1046–1056.
[47]
Qiang Zhang, Aldo Lipani, Shangsong Liang, and Emine Yilmaz. 2019. Reply-
Aided Detection of Misinformation via Bayesian Deep Learning. In Companion
Proceedings of the The Web Conference 2019. ACM Press.
[48]
Qiang Zhang, Emine Yilmaz, and Shangsong Liang. 2018. Ranking-based Method
for News Stance Detection. In Companion Proceedings of the The Web Conference
2018. ACM Press.
... In this sense, considerable research uses the stance datasets Emergent [25] or its extended version FNC-1 [26] to create misleading headline detection approaches. Some research using these datasets are [27]- [29]. Although it is a methodology widely used in the treatment of misleading headlines, as [11] indicated, determining the stance between headline and body text may not carry enough weight to determine incongruency between the two textual elements. ...
... The results obtained in the class unrelated to predict the test set indicate that the systems are capable of detecting this class, with high performance, corroborating the results obtained in the literature on this type of semantic relationship between texts [27]. With respect to the other two classes, the systems achieved remarkable results, but there is room for improvement. ...
Article
Full-text available
Misleading headlines are part of the disinformation problem. Headlines should give a concise summary of the news story helping the reader to decide whether to read the body text of the article, which is why headline accuracy is a crucial element of a news story. This work focuses on detecting misleading headlines through the automatic identification of contradiction between the headline and body text of a news item. When the contradiction is detected, the reader is alerted to the lack of precision or trustworthiness of the headline in relation to the body text. To facilitate the automatic detection of misleading headlines, a new Spanish dataset is created (ES_Headline_Contradiction) for the purpose of identifying contradictory information between a headline and its body text. This dataset annotates the semantic relationship between headlines and body text by categorising the relation between texts as compatible , contradictory and unrelated . Furthermore, another novel aspect of this dataset is that it distinguishes between different types of contradictions, thereby enabling a more fine-grain identification of them. The dataset was built via a novel semi-automatic methodology, which resulted in a more cost-efficient development process. The results of the experiments show that pre-trained language models can be fine-tuned with this dataset, producing very encouraging results for detecting incongruency or non-relation between headline and body text.
... This weighting is static during the training, and the value is set constantly to one-fifth. Recently, Zhang et al. (2019) investigated the problem of stances' imbalance. The authors proposed a hierarchical model (i.e., a two-layer neural network) for stance detection. ...
... Support (Source message) Query Support Comment Comment Comment Deny Therefore, the data obtained from the social network is highly imbalanced in the "query" and "deny" classes. Due to severe class imbalance in rumor stance data, current machine learning models often fail to classify instances that fall into "query" and "deny" classes correctly (Lukasik et al. 2019;Zhang et al. 2019). Hence, the stance classification model should effectively detect imbalanced data, especially in the "deny" and "query" classes. ...
Article
Full-text available
As online social networks are experiencing extreme popularity growth, determining the veracity of online statements denoted by rumors automatically as earliest as possible is essential to prevent the harmful effects of propagating misinformation. Early detection of rumors is facilitated by considering the wisdom of the crowd through analyzing different attitudes expressed toward a rumor (i.e., users’ stances). Stance detection is an imbalanced problem as the querying and denying stances against a given rumor are significantly less than supportive and commenting stances. However, the success of stance-based rumor detection significantly depends on the efficient detection of “query” and “deny” classes. The imbalance problem has led the previous stance classifier models to bias toward the majority classes and ignore the minority ones. Consequently, the stance and subsequently rumor classifiers have been faced with the problem of low performance. This paper proposes a novel adaptive cost-sensitive loss function for learning imbalanced stance data using deep neural networks, which improves the performance of stance classifiers in rare classes. The proposed loss function is a cost-sensitive form of cross-entropy loss. In contrast to most of the existing cost-sensitive deep neural network models, the utilized cost matrix is not manually set but adaptively tuned during the learning process. Hence, the contributions of the proposed method are both in the formulation of the loss function and the algorithm for calculating adaptive costs. The experimental results of applying the proposed algorithm to stance classification of real Twitter and Reddit data demonstrate its capability in detecting rare classes while improving the overall performance. The proposed method improves the mean F-score of rare classes by about 13% in RumorEval 2017 dataset and about 20% in RumorEval 2019 dataset.
... Some studies proposed to train supervised models based on hand-crafted features [3,50,56]. To alleviate feature engineering, Zhang et al. [51] proposed to learn a hierarchical representation of stance classes to overcome the class imbalance problem. Further, multi-task learning framework was utilized to mutually reinforce stance detection and rumor classification simultaneously [20,21,34,46]. ...
... In the follow-up studies, a range of hand-crafted features [3,50,56] as well as temporal traits [29,30] were studied to train stance detection models. More recently, deep neural networks were utilized for stance representation learning and classification to alleviate feature engineering and pursue stronger generalizability, such as bidirectional RNNs [2] and two-layer neural networks for learning hierarchical representation of stance classes [51]. Some studies further took into account conversation structure, such as the tree-based LSTM model for detecting stances [19,55] and the tree-structured multi-task framework for joint detection of rumors and stances [46]. ...
Conference Paper
Full-text available
The diffusion of rumors on social media generally follows a propagation tree structure, which provides valuable clues on how an original message is transmitted and responded by users over time. Recent studies reveal that rumor verification and stance detection are two relevant tasks that can jointly enhance each other despite their differences. For example, rumors can be debunked by cross-checking the stances conveyed by their relevant posts, and stances are also conditioned on the nature of the rumor. However, stance detection typically requires a large training set of labeled stances at post level, which are rare and costly to annotate. Enlightened by Multiple Instance Learning (MIL) scheme, we propose a novel weakly supervised joint learning framework for rumor verification and stance detection which only requires bag-level class labels concerning the rumor's veracity. Specifically, based on the propagation trees of source posts, we convert the two multi-class problems into multiple MIL-based binary classification problems where each binary model is focused on differentiating a target class (of rumor or stance) from the remaining classes. Then, we propose a hierarchical attention mechanism to aggregate the binary predictions, including (1) a bottom-up/top-down tree attention layer to aggregate binary stances into binary veracity; and (2) a discriminative attention layer to aggregate the binary class into finer-grained classes. Extensive experiments conducted on three Twitter-based datasets demonstrate promising performance of our model on both claim-level rumor detection and post-level stance classification compared with state-of-the-art methods.
... Meanwhile, Zhang [12] introduced an end-to-end ranking algorithm using a Multi-Layer Perceptron (MLP), where TF-IDF was utilized to extract features from headlines and article bodies. In a subsequent study, Zhang [54] addressed the classification challenge by proposing a hierarchical model that grouped the 'agree', 'disagree', and 'discuss' categories into a single 'related' class for more effective categorization. ...
Article
Full-text available
Online social networks (OSNs) are inundated with an enormous daily influx of news shared by users worldwide. Information can originate from any OSN user and quickly spread, making the task of fact-checking news both time-consuming and resource-intensive. To address this challenge, researchers are exploring machine learning techniques to automate fake news detection. This paper specifically focuses on detecting the stance of content producers—whether they support or oppose the subject of the content. Our study aims to develop and evaluate advanced text-mining models that leverage pre-trained language models enhanced with meta features derived from headlines and article bodies. We sought to determine whether incorporating the cosine distance feature could improve model prediction accuracy. After analyzing and assessing several previous competition entries, we identified three key tasks for achieving high accuracy: (1) a multi-stage approach that integrates classical and neural network classifiers, (2) the extraction of additional text-based meta features from headline and article body columns, and (3) the utilization of recent pre-trained embeddings and transformer models.
... Note that the criterion for determining the relevance to the target differs from that for the prediction of stance. Focusing on this point, Zhang et al [25] divided the task into relevance classification and stance classification for relevant data, and trained a multilayer perceptron using term frequency-inverse document frequency-based feature vectors. In addition, Roy et al. [20] divided the stance classification into three hierarchical classification stages: "related/unrelated," "stance/neutral," and "agree/disagree." ...
Article
Full-text available
This study focuses on a method for differentiating between the stance of citizens and city councilors on political issues (i.e., in favor or against) and attempts to compare the arguments of both sides. We created a dataset by annotating citizen tweets and city council minutes with labels for four attributes: stance, usefulness, regional dependence, and relevance. We then fine-tuned pretrained large language model using this dataset to assign the attribute labels to a large quantity of unlabeled data automatically. We introduced multitask learning to train each attribute jointly with relevance to identify the clues by focusing on those sentences that were relevant to the political issues. Our prediction models are based on T5, a large language model suitable for multitask learning. We compared the results from our system with those that used BERT or RoBERTa. Our experimental results showed that the macro-F1-scores for stance were improved by 1.8% for citizen tweets and 1.7% for city council minutes with multitask learning. Using the fine-tuned model to analyze real opinion gaps, we found that although the vaccination regime was positively evaluated by city councilors in Fukuoka city, it was not rated very highly by citizens.
... The output of this CNN is then sent to a multilayer perceptron (MLP) with 4-class output: agree, disagree, discuss, and unrelated, and trained end-to-end. Zhang et al. (2019) addressed the problem by proposing a hierarchical representation of the classes, which combines agree, disagree and discuss in a new related class. A two-layer neural network is used to learn from this hierarchical representation of classes and a weighted accuracy of 88.15% is obtained by their proposal. ...
Article
Full-text available
Identification of stance has recently gained a lot of attention with the extreme growth of fake news and filter bubbles. Over the last decade, many feature-based and deep-learning approaches have been proposed to solve Stance Detection. However, almost none of the existing works focus on providing a meaningful explanation for their prediction. In this work, we study Stance Detection with an emphasis on generating explanations for the predicted stance by capturing the pivotal argumentative structure embedded in a document. We propose to build a Stance Tree which utilizes Rhetorical Parsing to construct an evidence tree and to use Dempster Shafer Theory to aggregate the evidence. Human studies show that our unsupervised technique of generating stance explanations outperforms the SOTA extractive summarization method in terms of informativeness, non-redundancy, coverage, and overall quality. Furthermore, experiments show that our explanation-based stance prediction excels or matches the performance of the SOTA model on various benchmark datasets.
... AI-based fact-checking support in the multiple steps above is fundamentally based on document to claim mapping (document being a news article/blog, a social media post, etc.) and more specifically on two IR/NLP tasks: presence detection and stance classification [1,3,12,13,26,39,43]. The detection of previously factchecked claims (step 2), became a target of research interest only recently [28] and is one of the least studied research problems related to fact-checking [24]. ...
Preprint
Full-text available
False information has a significant negative influence on individuals as well as on the whole society. Especially in the current COVID-19 era, we witness an unprecedented growth of medical misinformation. To help tackle this problem with machine learning approaches, we are publishing a feature-rich dataset of approx. 317k medical news articles/blogs and 3.5k fact-checked claims. It also contains 573 manually and more than 51k automatically labelled mappings between claims and articles. Mappings consist of claim presence, i.e., whether a claim is contained in a given article, and article stance towards the claim. We provide several baselines for these two tasks and evaluate them on the manually labelled part of the dataset. The dataset enables a number of additional tasks related to medical misinformation, such as misinformation characterisation studies or studies of misinformation diffusion between sources.
Article
Full-text available
Users rely heavily on social media to consume and share news, facilitating the mass dis-semination of genuine and fake stories. The proliferation of misinformation on various social media platforms has serious consequences for society. The inability to differentiate between the sev-eral forms of false news on Twitter is a major obstacle to effective detection of fake news. Researchers have made progress toward a solution by placing a greater emphasis on methods for identifying bogus news. The dataset FNC-1, which includes four categories for identifying false news, will be used in this study. The state-of-the-art methods for spotting fake news are evaluated and compared using big data technology (Spark) and machine learning. The methodology of this study employed a decentralized Spark cluster to create a stacked ensemble model. Following feature extraction using N-grams, Hashing TF-IDF, and count vectorizer, we used the proposed stacked ensemble classification model. The results show that the suggested model has a superior classification performance of 92.45% in the F1 score compared to the 83.10 % F1 score of the baseline approach. The proposed model achieved an additional 9.35% F1 score compared to the state-of-the-art techniques.
Conference Paper
Full-text available
Social media platforms are a plethora of misinformation and its potential negative influence on the public is a growing concern. This concern has drawn the attention of the research community on developing mechanisms to detect misinformation. The task of misinformation detection consists of classifying whether a claim is True or False. Most research concentrates on developing machine learning models, such as neural networks, that outputs a single value in order to predict the veracity of a claim. One of the major problem faced by these models is the inability of representing the uncertainty of the prediction, which is due incomplete or finite available information about the claim being examined. We address this problem by proposing a Bayesian deep learning model. The Bayesian model outputs a distribution used to represent both the prediction and its uncertainty. In addition to the claim content, we also encode auxiliary information given by people's replies to the claim. First, the model encodes a claim to be verified, and generate a prior belief distribution from which we sample a latent variable. Second, the model encodes all the people's replies to the claim in a temporal order through a Long Short Term Memory network in order to summarize their content. This summary is then used to update the prior belief generating the posterior belief. Moreover, in order to train this model, we develop a Stochastic Gradient Variational Bayes algorithm to approximate the analytically intractable posterior distribution. Experiments conducted on two public datasets demonstrate that our model outperforms the state-of-the-art detection models.
Article
Full-text available
False information can be created and spread easily through the web and social media platforms, resulting in widespread real-world impact. Characterizing how false information proliferates on social platforms and why it succeeds in deceiving readers are critical to develop efficient detection algorithms and tools for early detection. A recent surge of research in this area has aimed to address the key issues using methods based on feature engineering, graph mining, and information modeling. Majority of the research has primarily focused on two broad categories of false information: opinion-based (e.g., fake reviews), and fact-based (e.g., false news and hoaxes). Therefore, in this work, we present a comprehensive survey spanning diverse aspects of false information, namely (i) the actors involved in spreading false information, (ii) rationale behind successfully deceiving readers, (iii) quantifying the impact of false information, (iv) measuring its characteristics across different dimensions, and finally, (iv) algorithms developed to detect false information. In doing so, we create a unified framework to describe these recent methods and highlight a number of important directions for future research.
Conference Paper
Full-text available
We present an effective end-to-end memory network (MN) model that jointly (i) predicts whether a given document can be considered as relevant evidence for a given claim, and (ii) extracts snippets of evidence that can be used to reason about the factuality of the target claim. Our model combines the advantages of convolutional and recurrent neural networks as part of a MN. We further introduce a similarity-based matrix at the inference level of the MN in order to extract snippets of evidence for input claims more accurately. Our experiments on the Fake News Challenge dataset demonstrate the effectiveness of our approach.
Conference Paper
Full-text available
In recent years, an unhealthy phenomenon characterized as the massive spread of fake news or unverified information (i.e., rumors) has become increasingly a daunting issue in human society. The rumors commonly originate from social media outlets, primarily microblogging platforms, being viral afterwards by the wild, willful propagation via a large number of participants. It is observed that rumorous posts often trigger versatile, mostly controversial stances among participating users. Thus, determining the stances on the posts in question can be pertinent to the successful detection of rumors, and vice versa. Existing studies, however, mainly regard rumor detection and stance classification as separate tasks. In this paper, we argue that they should be treated as a joint, collaborative effort, considering the strong connections between the veracity of claim and the stances expressed in responsive posts. Enlightened by the multi-task learning scheme, we propose a joint framework that unifies the two highly pertinent tasks, i.e., rumor detection and stance classification. Based on deep neural networks, we train both tasks jointly using weight sharing to extract the common and task-invariant features while each task can still learn its task-specific features. Extensive experiments on real-world datasets gathered from Twitter and news portals demonstrate that our proposed framework improves both rumor detection and stance classification tasks consistently with the help of the strong inter-task connections, achieving much better performance than state-of-the-art methods.
Conference Paper
With the support of major search platforms such as Google and Bing, fact-checking articles, which can be identified by their adoption of the schema.org ClaimReview structured markup, have gained widespread recognition for their role in the fight against digital misinformation. A claim-relevant document is an online document that addresses, and potentially expresses a stance towards, some claim. The claim-relevance discovery problem, then, is to find claim-relevant documents. Depending on the verdict from the fact check, claim-relevance discovery can help identify online misinformation. In this paper, we provide an initial approach to the claim-relevance discovery problem by leveraging various information retrieval and machine learning techniques. The system consists of three phases. First, we retrieve candidate documents based on various features in the fact-checking article. Second, we apply a relevance classifier to filter away documents that do not address the claim. Third, we apply a language feature based classifier to distinguish documents with different stances towards the claim. We experimentally demonstrate that our solution achieves solid results on a large-scale dataset and beats state-of-the-art baselines. Finally, we highlight a rich set of case studies to demonstrate the myriad of remaining challenges and that this problem is far from being solved.
Conference Paper
A valuable step towards news veracity assessment is to understand stance from different information sources, and the process is known as the stance detection. Specifically, the stance detection is to detect four kinds of stances ("agree'', "disagree'', "discuss'' and "unrelated'') of the news towards a claim. Existing methods tried to tackle the stance detection problem by classification-based algorithms. However, classification-based algorithms make a strong assumption that there is clear distinction between any two stances, which may not be held in the context of stance detection. Accordingly, we frame the detection problem as a ranking problem and propose a ranking-based method to improve detection performance. Compared with the classification-based methods, the ranking-based method compare the true stance and false stances and maximize the difference between them. Experimental results demonstrate the effectiveness of our proposed method.
Article
Lies spread faster than the truth There is worldwide concern over false news and the possibility that it can influence political, economic, and social well-being. To understand how false news spreads, Vosoughi et al. used a data set of rumor cascades on Twitter from 2006 to 2017. About 126,000 rumors were spread by ∼3 million people. False news reached more people than the truth; the top 1% of false news cascades diffused to between 1000 and 100,000 people, whereas the truth rarely diffused to more than 1000 people. Falsehood also diffused faster than the truth. The degree of novelty and the emotional reactions of recipients may be responsible for the differences observed. Science , this issue p. 1146