ArticlePDF Available

The Quest to Automate Fact-Checking

Authors:
The Quest to Automate Fact-Checking
Naeemul Hassan 1Bill Adair 2James T. Hamilton 3Chengkai Li 1
Mark Tremayne 1Jun Yang 2Cong Yu 4
1University of Texas at Arlington, 2Duke University, 3Stanford University, 4Google Research
1. INTRODUCTION
The growing movement of political fact-checking plays
an important role in increasing democratic accountability
and improving political discourse [7, 3]. Politicians and
media figures make claims about “facts” all the time, but
the new army of fact-checkers can often expose claims that
are false, exaggerated or half-truths. The number of active
fact-checking websites has grown from 44 a year ago to 64
in 2015, according the Duke Reporters’s Lab. 1
The challenge is that the human fact-checkers frequently
have difficulty keeping up with the rapid spread of misinfor-
mation. Technology, social media and new forms of journal-
ism have made it easier than ever to disseminate falsehood-
s and half-truths faster than the fact-checkers can expose
them. There are several reasons that the falsehoods frequent-
ly outpace the truth. One reason is that fact-checking is an
intellectually demanding and laborious process. It requires
more research and a more advanced style of writing than
ordinary journalism. The difficulty of fact-checking, exac-
erbated by a lack of resources for investigative journalism,
leaves many harmful claims unchecked, particularly at the
local level. Another reason is that fact-checking is time-
consuming. It takes about one day to research and write a
typical article, which means a lot of time can lapse after the
political message. Even if the fact-check has already been
published, the voter must undertake research to look it up.
This “gap” in time and availability limits the effectiveness of
fact-checking.
Computation may hold the key to far more effective and ef-
ficient fact-checking, as Cohen et al. [1, 2] and Diakopolous 2
have pointed out. Our eternal quest, the “Holy Grail”, is a
completely automatic fact-checking platform that can detect
a claim as it appears in real time, and instantly provide the
voter with a rating about its accuracy. It makes its calls by
consulting databases of already checked claims, and by an-
alyzing relevant data from reputable sources. In this paper,
we advocate the pursuit of the “Holy Grail” and make a call
to arms to the computing and journalism communities. We
discuss the technical challenges we will face in automating
fact-checking and potential solutions.
The “Holy Grail” may remain far beyond our reach for
many, many years to come. But in pursuing this ambi-
tious goal, we can help fact-checking and improve the po-
litical discourse. One such advancement is our own progress
on ClaimBuster, a tool that helps journalists find political
1http://reporterslab.org/snapshot-of-fact-checking-around-the-
world-july-2015/
2http://towknight.org/research/thinking/scaling-fact-checking/
claims to fact-check. We will use it on the presidential
debates of U.S. Election 2016. We envision, during a debate,
for every sentence spoken by the candidates and extracted
into transcripts, ClaimBuster immediately determines if the
sentence has a factual claim and whether its truthfulness is
important to the public.
2. LIMITATIONS OF CURRENT
PRACTICES OF FACT-CHECKING
Fact-checking is difficult and time-consuming for journal-
ists, which creates a significant gap between the moment
a politician makes a statement and when the fact-check is
ultimately published.
The growth of fact-checking has been hampered by the
nature of the work. It is time-consuming to find claims
to check. Journalists have to spend hours going through
transcripts of speeches, debates and interviews to identify
claims they will research.
Also, fact-checking requires advanced research techniques.
While ordinary journalism can rely on simple “on-the-one-
hand, on-the-other-hand” quotations, a fact-check requires
more thorough research so the journalist can determine the
accuracy of a claim.
Fact-checking also requires advanced writing skills that go
beyond “just the facts” to persuade the reader whether the
statement was true, false or somewhere in between. Fact-
checking is a new form that has been called “reported con-
clusion” journalism.
Those factors mean that fact-checking often takes longer
to produce than traditional journalism, which puts a strain
on staffing and reduces the number of claims that can be
checked. It also creates a time gap between the moment the
statement was made and when the fact-check is ultimately
published. That can take as little as 15 to 30 minutes for
the most simple fact-check to a full day for a more typical
one. A complicated fact-check can take two or more days.
(By contrast, Leskovec, Backstrom and Kleinberg [6] found
a meme typically moves from the news media to blogs in
just 2.5 hours.)
For voters, that means a long gap between the politician’s
claim and a determination whether it was true. The voters
don’t get the information when they really need it. They
must wait and look up on a fact-checking site to find out if
the claim was accurate. This is one of several factors that
emboldens politicians to keep repeating claims even when
they are false.
Another limitation is the outdated nature of the fact-checkers
publishing platforms. Many fact-checking sites still use older
content management systems built for newspapers and blogs
that are are not designed in a modern style for structured
journalism. This limits how well they can be used in com-
putational projects.
3. THE “HOLY GRAIL
We should not be surprised if we can get very close but
never reach the “Holy Grail”. A fully automated fact-checker
calls for fundamental breakthroughs in multiple fronts and,
eventually, it represents a form of Artificial Intelligence (AI).
As remote and intangible as AI may have appeared initially,
though, in merely 60 years scientists have made leaps and
bounds that profoundly changed our world forever. The
quest for the “Holy Grail” of fact-checking will likewise drive
us to constantly improve this important journalistic activity.
The Turing test was proposed by Alan Turing as a way of
gauging a machine’s ability to exhibit artificial intelligence.
Although heavily criticized, the concept has served well in
helping advance the field. Similarly, we need explicit and
tangible measures for assessing the ultimate success of a fact-
checking machine. The “Holy Grail” is a computer-based
fact-checking system bearing the following characteristics:
Fully automated: It checks facts without human interven-
tion. It takes as input the video/audio signals and texts of a
political discourse and returns factual claims and a truthness
rating for each claim (e.g., the Truth-O-Meter by PolitiFact).
Instant: It immediately reaches conclusions and returns
results after claims are made, without noticeable delays.
Accurate: It is equally or more accurate than any human
fact-checker.
Accountable: It self-documents its data sources and anal-
ysis, and makes the process of each fact-check transparent.
This process can then be independently verified, critiqued,
improved, and even extended to other situations.
Such a system mandates many complex steps–extracting
natural language sentences from textual/audio sources; sepa-
rating factual claims from opinions, beliefs, hyperboles, ques-
tions, and so on; detecting topics of factual claims and dis-
cerning which are the “check-worthy” claims; assessing the
veracity of such claims, which itself requires collecting infor-
mation and data, analyzing claims, matching claims with ev-
idence, and presenting conclusions and explanations. Each
step is full of challenges. We now discuss in more detail
these challenges and potential solutions.
3.1 Computational Challenges
On the computational side, there are mainly two funda-
mental challenges. One is to understand what one says.
Computer scientists have made leaps and bounds in speech
recognition and Natural Language Processing (NLP). But
these technologies are far from perfect. The other chal-
lenge lies in our capability of collecting sufficient evidence
for checking facts. We are in the big-data era. A huge
amount of useful data is accessible to us and more is being
made available at every second. Semantic web, knowledge
base, database and data mining technologies help us link
together such data, reason about the data, efficiently process
the data and discover patterns. But, what is being recorded
is still tiny compared to the vast amount of information the
universe holds. Below we list some of the more important
computational hurdles to solve.
Finding claims to check
—Going from raw audio/video signals to natural language.
Extracting contextual information such as speaker, time,
and occasion.
—Defining “checkable” and “check-worthy” of claims. Is the
claim factual (falsifiable) or is it opinion? Should or can we
check opinions? How “interesting” is the claim? How do
we balance “what the public should know” and “what the
public wants to consume”? Can these judgements be made
computationally?
—Extracting claims from natural language. What to do
when a claim spans multiple sentences? What are the rel-
evant features useful for determining whether a claim is
“checkable” or “check-worthy”?
Getting data to check claims
—We should consider at least two types of data sources:
1) claims already checked by various organizations; 2) un-
structured, semi-structured and structured data sources that
provide raw data useful for checking claims, e.g., voting
records, government budget, historical stock data, crime
records, weather records, sports statistics, and Wikipedia.
—Evaluating quality and completeness of sources.
—Matching claims with data sources. This requires struc-
ture/metadata in the database of already checked claims, as
well as data sources.
—Synthesizing and corroborating multiple sources.
—Cleansing data. Given a goal (e.g., to verify a particular
claim), help journalists decide which data sources–or even
which data items–are worthy investigating as high priority.
Checking claims
—How to remove (sometimes intentional) vagueness, how to
spot cherry-picking of data (beyond correctness), how to
evaluate and how to come up with convincing counterar-
guments using data [12, 11, 9].
—The methods in [12, 11, 9] rely on being able to cast a
claim as a mathematical function that can be evaluated over
structured data. Who translate a claim into this function?
Can the translation process be automated?
—Fact verification may need human participation (e.g., so-
cial media as social sensors) or even crowdsourcing (e.g.,
checking whether a bridge really just collapsed). Can a com-
puter system help coordinate and plan human participation
on an unprecedented level? How to remove bias and do
quality control of human inputs? Should such a system be
even considered fully automated?
Monitoring and anticipating claims
—Given evolving data, we can monitor when a claim turns
false/true [5, 9]. Can we anticipate what claims may be
made soon? That way, we can plan ahead and be proactive.
—Challenges in scalable monitoring and parallel detection of
a massive number of claim types/templates.
3.2 Journalistic Challenges
A major barrier to automation is the lack of structured
journalism in fact-checking. Although there’s been tremen-
dous growth in the past few years – 20 new sites around the
world just in the last year, according to the Duke Reporters’
Lab – the vast majority of the world’s fact-checkers are still
relying on old-style blog platforms to publish their articles.
That limits the articles to a traditional headline and text
rather than a newer structured journalism approach that
would include fields such as statement, speaker and location
that would allow for real-time matching. There are no s-
tandards for data fields or formatting. The articles are just
published as plain text.
There also is no single repository where fact-checks from
various news organizations are catalogued. They are kep-
t in the individual archives of many different publication-
s, another factor that makes real-time matching difficult.
Another journalistic barrier is the inconsistency of trans-
parency. Some fact-checkers distill their work to very short
summaries, while others publish lengthy articles with many
quotations and citations.3The lack of structure, the absence
of a repository and the inconsistency in publishing provides
a lack of uniformity for search engines, which do not dis-
tinguish fact-checks from other types of editorial content in
their search results.
Another challenge is the length of time it takes to publish
more difficult fact-checks and to check multiple claims from
the same event. PolitiFact, for example, boasted that it
published 20 separate checks from the Aug. 6 Republican
presidential debate. But it took six days for it to complete
all of those checks. 4
4. CLAIMBUSTER
ClaimBuster is a tool that helps journalists find claims
to fact-check. Figure 1 is the screenshot of the current ver-
sion of ClaimBuster. For every sentence spoken by the par-
ticipants of a presidential debate, ClaimBuster determines
whether the sentence has a factual claim and whether its
truthfulness is important to the public. As shown in Fig-
ure 1, to the left of each sentence there is a score ranging
from 0 (least likely an important factual claim) to 1 (most
likely). The calculation is based on machine learning models
built from thousands of sentences from past debates labeled
by humans. The ranking scores help journalists prioritize
their efforts in assessing the varacity of claims. ClaimBuster
will free journalists from the time-consuming task of finding
check-worthy claims, leaving them with more time for report-
ing and writing. Ultimately, ClaimBuster can be expanded
to other discourses (such as interviews and speeches) and
also adapted for use with social media. Note that the task of
determining check-worthiness of sentences is different from
subjectivity analysis [10]. A sentence can be objective in
nature but not check-worthy. Similarly, a sentence can be
subjective in nature and check-worthy.
4.1 Classification and Ranking
We model ClaimBuster as a classifier and ranker and we
take a supervised learning approach to construct it. We cate-
gorize sentences in presidential debates into three categories:
Non-Factual Sentence (NFS): Subjective sentences (opin-
ions, beliefs, declarations) and many questions fall under this
category. These sentences do not contain any factual claim.
Below are some examples.
But I think it’s time to talk about the future.
You remember the last time you said that?
Unimportant Factual Sentence (UFS): These are fac-
tual claims but not check-worthy. The general public will
not be interested in knowing whether these sentences are
true or false. Fact-checkers do not find these sentences as
important for checking. Some examples are as follows.
Next Tuesday is Election Day.
Two days ago we ate lunch at a restaurant.
3http://reporterslab.org/study-explores-new-questions-about-
quality-of-global-fact-checking/
4http://www.politifact.com/truth-o-
meter/article/2015/aug/12/20-fact-checks-republican-debate/
Check-worthy Factual Sentence (CFS): They contain
factual claims and the general public will be interested in
knowing whether the claims are true. Journalists look for
these type of claims for fact-checking. Some examples are:
He voted against the first Gulf War.
Over a million and a quarter Americans are HIV-positive.
Figure 1: ClaimBuster
Given a sentence, the objective of ClaimBuster is to derive
a score that reflects the degree by which the sentence belongs
to CFS. Many widely-used classification methods support
ranking naturally. For instance, consider a Support Vector
Machine (SVM). We treat CFSs as positive examples and
both NFSs and UFSs as negative examples. SVM finds a
decision boundary between the two types of training exam-
ples. Following Platt’s scaling technique [8], for a given sen-
tence xto be classified, we calculate the posterior probability
P(class =C F S|x) using the SVM’s decision function. The
probability scores of all sentences are used to rank them.
4.2 Data Collection
We constructed a labeled dataset of sentences spoken by
presidential candidates in all past general election presiden-
tial debates. Each sentence is given one of three possible
labels– NFS, UFS, CFS.
Figure 2: Data Collection Interface
There have been a total of 30 presidential debates in the
past. We parsed the debate transcripts and extracted 23075
sentences spoken by the candidates. Furthermore, we only
kept the 20788 sentences that have at least 5 words.
To label the sentences, we developed a data collection
website. Journalists, professors and university students were
invited to participate. A participant was given one sentence
at a time and was asked to label it with one of the three
possible options as shown in Figure 2, corresponding to the
three labels (NFS, UFS, CFS).
In 3 months, we accumulated 226 participants. To de-
tect spammers and low-quality participants, we used 600
screening sentences, picked from all debate episodes. Three
Table 1: Performance
Precision Recall F-measure
NFS 0.90 0.96 0.93
UFS 0.65 0.26 0.37
CFS 0.79 0.74 0.77
k P@k AvgP nDCG
10 1.000 0.024 1.000
25 1.000 0.059 1.000
50 1.000 0.118 1.000
100 0.960 0.223 0.970
200 0.940 0.429 0.951
300 0.853 0.575 0.881
400 0.760 0.667 0.802
500 0.690 0.737 0.840
Table 2: Ranking Accuracy:
Past Presidential Debates
k P@k AvgP nDCG
10 0.400 0.046 0.441
20 0.450 0.084 0.456
30 0.367 0.098 0.401
40 0.325 0.111 0.368
50 0.300 0.122 0.346
60 0.300 0.139 0.356
70 0.300 0.154 0.390
80 0.275 0.159 0.401
90 0.267 0.169 0.422
100 0.270 0.184 0.452
Table 3: Ranking Accuracy:
2015 Republican Debate
experts agreed upon their labels. On average, one out of
every ten sentences given to a participant (without letting
the participant know) was randomly chosen to be a screen-
ing sentence selected from the pool. The participants were
ranked by the degree of agreement on screening sentences
between them and the three experts. The top 30% partici-
pants were considered top-quality participants. There was
a reward system to encourage high quality participants. For
training and evaluating our classification models, we only
used a sentence if its label was agreed upon by two top-
quality participants. Thereby we got 8015 sentences (5860
NFSs, 482 UFSs, 1673 CFSs).
4.3 Feature Extraction
We extracted multiple categories of features from the sen-
tences. We use the following sentence to explain the features.
When President Bush came into office, we had a budget
surplus and the national debt was a little over five tril lion.
Sentiment: We used natural language processing tool Alche-
myAPI5to calculate a sentiment score for each sentence.
The score ranges from -1 (most negative sentiment) to 1
(most positive sentiment). The above sentence has a senti-
ment score -0.846376.
Length: This is the word count of a sentence. Natural
language toolkit NLTK was used for tokenizing a sentence
into words. The example sentence has length 21.
Word: We used words in sentences to build tf-idf features.
After discarding rare words that appear in less than three
sentences, we got 6130 words. We did not apply stemming
or stopword removal.
Part-of-Speech (POS) Tag: We applied NLTK POS tag-
ger on all sentences. There are 43 POS tags in the corpus.
We constructed a feature for each tag. For a sentence, the
count of words belonging to a POS tag is the value of the
corresponding feature. In the example sentence, there are
3 words (came, had, was) with POS tag VBD (Verb,Past
Tense) and 2 words (five, trillion) with POS tag CD (Cardi-
nal Number).
Entity Type: We used AlchemyAPI to extract entities
from sentences. There are 2727 entities in the labeled sen-
tences. They belong to 26 types. The above sentence has an
entity “Bush” of type “Person”. We constructed a feature for
each entity type. For a sentence, its number of entities of a
5http://www.alchemyapi.com/
particular type is the value of the corresponding feature.
P_ CD
le n gt h
P_ V BD
se nt i m e nt
P_ I N
P_ N NS
P_ N N
W _to
ET_Qu a nt i t y
P_ N NP
P_ V BN
P_ V B
W _in
W _th e
P_ D T
P_ P RP
P_ ,
W _th a t
W _an d
W _of
W _we
P_ V BP
P_ JJ
P_$
P_ RB
W _sai d
W _it
P_ V BZ
P_ T O
W _wa s
0 .0 0
0 .0 1
0 .0 2
0 .0 3
0 .0 4
0 .0 5
Im p o rt a n ce
Figure 3: Feature Importance
Feature Selection: There are 6201 features in total. To
avoid over-fitting and attain a simpler model, we performed
feature selection. We trained a random forest classifier for
which we used GINI index to measure the importance of
features in constructing each decision tree. The overall im-
portance of a feature is its average importance over all the
trees. Figure 3 shows the importance of the 30 best features
in the forest. The black solid lines indicate the standard
deviations of importance values. Category types are prefixes
to feature names. We observed that unsurprisingly POS
tag CD (Cardinal Number) is the best feature–check-worthy
factual claims are more likely to contain numeric values (45%
of CFSs in our dataset) and non-factual sentences are less
likely to contain numeric values (6% of NFSs in our dataset).
4.4 Evaluation
We performed 3-class (NFS/UFS/CFS) classification us-
ing several supervised learning methods, including Multino-
mial Naive Bayes Classifier (NBC), Support Vector Machine
(SVM) and Random Forest Classifier (RFC). These methods
were evaluated by 4-fold cross-validation. SVM had the
best accuracy in general. We experimented with various
combinations of the extracted features. Table 1 shows the
performance of SVM using words and POS tag features. On
the CFS class, ClaimBuster achieved 79% precision (i.e., it
is accurate 79% of the time when it declares a CFS sentence)
and 74% recall (i.e., 74% of true CFSs are classified as CFSs).
The classification models had better accuracy on NFS and
CFS than UFS. This is not surprising, since UFS is between
the other two classes and thus the most ambiguous. More
detailed results and analyses based on data collected by an
earlier date can be found in [4].
We used SVM to rank all 8015 sentences (cf. Section 4.2)
by the method in Section 4.1. We measured the accuracy
of the top-k sentences by several commonly-used measures,
including Precision-at-k (P@k), AvgP (Average Precision),
nDCG (Normalized Discounted Cumulative Gain). Table 2
shows these measure values for various k values. In general,
ClaimBuster achieved excellent performance in ranking. For
instance, for top 100 sentences, its precision is 0.96. This
indicates ClaimBuster has a strong agreement with high-
quality human coders on the check-worthiness of sentences.
5. CASE STUDY: 2015 GOP DEBATE
The first Republican primary debate of 2015 (the top-ten
polling candidates) provided an opportunity for a near real-
time test of ClaimBuster. Closed captions of the debate
on Fox News were converted to a text file via TextGrabber,
a device for the hearing impaired, and run through Claim-
Buster. It parsed 1,393 sentences spoken by the candidates
and moderators. ClaimBuster’s scores on these sentences
ranged from a low of 0.045 to a high of 0.861 with a mean
of 0.263. Most sentences (87%) scored below 0.40.
We can compare ClaimBuster’s identification of check-
worthy factual claims against the judgement of professional
journalists and fact checkers. Note that the accuracy of
ClaimBuster is affected by the quality of TextGrabber in
extracting closed captions. In general, the extracted closed
captions demonstrated satisfactory quality. We also per-
formed the same experiments using a human-refined version
of the debate transcript and observed slightly better accu-
racy from ClaimBuster. Due to space limitations, we omit
discussing that result.
Table 4 shows scores ClaimBuster gave to the claims fact-
checked by CNN.6The average for these 6 was 0.457 com-
pared to 0.262 for those sentences not selected by CNN, a
significant difference (t=3.83, p<.001). As the transcript
is from closed captions, some words and sentences are mis-
spelled and missing (e.g., Claim 6 not found in the TextGrab-
ber transcript). Note that Claim 4 spans over two sentences.
There were 9 sentences in our data that were selected for
checking by FactCheck.org.7Due to space limitation, we do
not show the text of the claims. These sentences averaged
0.558 compared to 0.261 for those not checked, a significant
difference (t=7.23, p<.00001). PolitiFact8has checked 20
facts. The average ClaimBuster score for those sentences is
0.433 compared to 0.260 for those not checked by PolitiFact,
also significant (t=6.67, p<.00001).
In addition to the claims fact-checked by FactCheck.org,
CNN and PolitiFact we also had a larger “buffet” file from
PolitiFact.9This file contained 59 claims from the debate
which PolitiFact employees marked as possible items for fact-
checking. We used ClaimBuster to rank these claims with
respect to all the sentences (1,393) in the transcript. Table 3
shows the quality of this ranking in terms of P@K, AvgP and
nDCG, in the same way we used these measures to evaluate
ClaimBuster’s ranking accuracy on past debate sentences.
Overall, sentences receiving a high ClaimBuster score were
much more likely to have been checked by professionals than
those with low scores. Most of those checked by CNN,
FactCheck.org and PolitiFact (27 of 38 or 71%) appeared
in the top 250 of 1,393 sentences. A lower percentage of
sentences associated with items in the PolitiFact“buffet” file
(53 of 83 or 64%) appeared in ClaimBuster’s top 250. This
is not surprising since these items were merely placed on the
buffet by individual employees and not necessarily selected
by the group for checking.
There were still many sentences ranked high by Claim-
Buster and not chosen for fact-checking by these organiza-
tions. Reasons may include 1) the claims were previously
made and checked; 2) they are not considered factual or
important by the checker; 3) time and resource limitations.
6http://www.cnn.com/2015/08/06/politics/republican-debate-
fact-check/
7http://www.factcheck.org/2015/08/factchecking-the-gop-
debate-late-edition/
8http://www.politifact.com/truth-o-
meter/article/2015/aug/12/20-fact-checks-republican-debate/
9PolitiFact, List of possible claims to check, Republican
presidential debate, Aug. 6, 2015.
Table 4: ClaimBuster Performance on CNN-checked claims
Claim Associated sentence(s)[From TextGrabb er] Score
1Part of this iranian deal was lifting the
international sanctions on general sulemani. 0.415
2I would go on to add – >> you don’t favor
>> i have never said that. 0.511
3A ma jority of the candidates on this stage
supported amnesty. 0.295
4Timely the medicaid is growing at one of the
lowest rates in the country. 0.534
4We went from $8 billion in the hole to $5
million in the black. 0.773
5And the mexican government is much smarter,
much sharper, much more cunning and they 0.215
send the bad ones over because they don’t want
to pay for them.
6 [Not found in the transcript] N/A
6. CONCLUSION
Live, fully-automated fact-checking may remain an unattain-
able ideal but serves as a useful guidepost for researchers
in computational journalism. Already progress on the first
steps of fact-checking has been achieved. Our ClaimBuster
tool, still imperfect, can quickly extract and order sentences
in ways that will aid in the identification of important factual
claims. But there is still much work to be done. Discrep-
ancies between the human checkers and the machine have
provided us with avenues for improvement of the algorithm
in time for upcoming 2016 debates. An even bigger step will
be the adjudication of identified check-worthy claims. A
repository of already-checked facts would be good starting
point. We are also interested in using ClaimBuster to check
content on popular social platforms where much political
information is being generated and shared. Each of these
areas are demanding and worthy of attention by the growing
field of computational journalism.
Acknowledgements This work is partially supported
by NSF grants IIS-1018865, CCF-1117369 and IIS-1408928.
Any opinions, findings, and conclusions or recommendations
expressed in this publication are those of the author(s) and
do not necessarily reflect the views of the funding agencies.
We thank Minumol Joseph for her contribution.
7. REFERENCES
[1] S. Cohen, J. T. Hamilton, and F. Turner. Computational
journalism. CACM, 54(10):66–71, Oct. 2011.
[2] S. Cohen, C. Li, J. Yang, and C. Yu. Computational journalism:
A call to arms to database researchers. In CIDR, 2011.
[3] L. Graves. Deciding What’s True: Fact-Checking Journalism
and the New Ecology of News. PhD thesis, COLUMBIA
UNIVERSITY, 2013.
[4] N. Hassan, C. Li, and M. Tremayne. Detecting check-worthy
factual claims in presidential debates. In CIKM, 2015.
[5] N. Hassan, A. Sultana, Y. Wu, G. Zhang, C. Li, J. Yang, and
C. Yu. Data in, fact out: Automated monitoring of facts by
FactWatcher. PVLDB, 7(13):1557–1560, 2014.
[6] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking
and the dynamics of the news cycle. In KDD, 2009.
[7] B. Nyhan and J. Reifler. The effect of fact-checking on elites: A
field experiment on us state legislators. American Journal of
Political Science, 59(3):628–640, 2015.
[8] J. Platt et al. Probabilistic outputs for support vector machines
and comparisons to regularized likeliho od methods. Advances
in large margin classifiers, 10(3), 1999.
[9] B. Walenz et al. Finding, monitoring, and checking claims
computationally based on structured data. In
Computation+Journalism Symposium, 2014.
[10] J. Wiebe and E. Riloff. Creating subjective and objective
sentence classifiers from unannotated texts. In CICLing. 2005.
[11] Y. Wu, P. K. Agarwal, C. Li, J. Yang, and C. Yu. Toward
computational fact-checking. In PVLDB, 2014.
[12] Y. Wu, B. Walenz, P. Li, A. Shim, E. Sonmez, P. K. Agarwal,
C. Li, J. Yang, and C. Yu. iCheck: computationally combating
lies, d-ned lies, and statistics”. In SIGMOD, 2014.
... /frai. . fact-checker might take several hours or days on any given claim (Hassan et al., 2015;Adair et al., 2017). Due to an ever-increasing amount of information online and the speed at which it spreads, relying solely on manual fact-checking is insufficient and makes automated solutions and tools that increase the efficiency of factcheckers necessary. ...
Article
Full-text available
Automated fact-checking, using machine learning to verify claims, has grown vital as misinformation spreads beyond human fact-checking capacity. Large language models (LLMs) like GPT-4 are increasingly trusted to write academic papers, lawsuits, and news articles and to verify information, emphasizing their role in discerning truth from falsehood and the importance of being able to verify their outputs. Understanding the capacities and limitations of LLMs in fact-checking tasks is therefore essential for ensuring the health of our information ecosystem. Here, we evaluate the use of LLM agents in fact-checking by having them phrase queries, retrieve contextual data, and make decisions. Importantly, in our framework, agents explain their reasoning and cite the relevant sources from the retrieved context. Our results show the enhanced prowess of LLMs when equipped with contextual information. GPT-4 outperforms GPT-3, but accuracy varies based on query language and claim veracity. While LLMs show promise in fact-checking, caution is essential due to inconsistent accuracy. Our investigation calls for further research, fostering a deeper comprehension of when agents succeed and when they fail.
... Besides human-powered systems, other lines of research investigated the usage of automatic machine learning techniques for fact-checking (Hassan et al., 2015;Thorne & Vlachos, 2018). Such techniques rely on training some machine learning algorithm on a labeled dataset which is usually built using human assessors. ...
Article
Full-text available
The increase of the amount of misinformation spread every day online is a huge threat to the society. Organizations and researchers are working to contrast this misinformation plague. In this setting, human assessors are indispensable to correctly identify, assess and/or revise the truthfulness of information items, i.e., to perform the fact-checking activity. Assessors, as humans, are subject to systematic errors that might interfere with their fact-checking activity. Among such errors, cognitive biases are those due to the limits of human cognition. Although biases help to minimize the cost of making mistakes, they skew assessments away from an objective perception of information. Cognitive biases, hence, are particularly frequent and critical, and can cause errors that have a huge potential impact as they propagate not only in the community, but also in the datasets used to train automatic and semi-automatic machine learning models to fight misinformation. In this work, we present a review of the cognitive biases which might occur during the fact-checking process. In more detail, inspired by PRISMA-a methodology used for systematic literature reviews-we manually derive a list of 221 cognitive biases that may affect human assessors. Then, we select the 39 biases that might manifest during the fact-checking process, we group them into categories, and we provide a description. Finally, we present a list of 11 countermeasures that can be adopted by researchers, practitioners, and organizations to limit the effect of the identified cognitive biases on the fact-checking activity.
... Thus, verifying the accuracy of claims and presenting users with factual and impartial evidence to support their veracity is of utmost importance. Manual fact-checking, however, is a time-consuming activity (Hassan et al., 2015). Hence, Natural Language Processing has been suggested as an effective solution for automating this process. ...
Article
Full-text available
Fighting misinformation is a challenging, yet crucial, task. Despite the growing number of experts being involved in manual fact-checking, this activity is time-consuming and cannot keep up with the ever-increasing amount of fake news produced daily. Hence, automating this process is necessary to help curb misinformation. Thus far, researchers have mainly focused on claim veracity classification. In this paper, instead, we address the generation of justifications (textual explanation of why a claim is classified as either true or false) and benchmark it with novel datasets and advanced baselines. In particular, we focus on summarization approaches over unstructured knowledge (i.e., news articles) and we experiment with several extractive and abstractive strategies. We employed two datasets with different styles and structures, in order to assess the generalizability of our findings. Results show that in justification production summarization benefits from the claim information, and, in particular, that a claim-driven extractive step improves abstractive summarization performances. Finally, we show that although cross-dataset experiments suffer from performance degradation, a unique model trained on a combination of the two datasets is able to retain style information in an efficient manner.
... The problem with this technique is that human experience is required, which costs time and money. More importantly, web-based fact-checking services only feature stories from specific areas, often politics, making detection in other areas tricky [5]. ...
Chapter
Fake news is a severe problem in today’s society. It is becoming more widespread and more difficult to detect. The use of social media has created new trends in how most people access and consume news. On one side, consumers seek out and consume news via social media because it is cheap, easy to access, and transmits information faster than traditional media. On the other side, it allows the dissemination of fake news, that is, falsified news that deliberately contains fake information. One of the most challenging aspects of detecting false news is early detection. In this context, we designed a news authenticity verification system on Twitter based on the evidential fusion of machine learning models exploiting both content-based characteristics and social context-based characteristics. Our system is equipped with a chrome extension that allows users to check, in real-time, any tweet that might be fraudulent and filter the news feed based on the reputation of the profiles. Experiments were conducted to determine whether the theory of evidence could be applied in the context of social networks. These investigations revealed that the overall duration to combine various pieces of evidence from various sources does not exceed one second. Experimental assessments demonstrate the prevalence of evidential fusion rules for the real-time detection of false information.KeywordsFake news detectionSocial mediaMachine learningEvidence theoryChrome extension
Article
Full-text available
The proliferation of online misinformation undermines societal cohesion and democratic principles. Effectively combating this issue relies on developing automatic classifiers, which require training data to achieve high classification accuracy. However, while English-language resources are abundant, other languages are often neglected, creating a critical gap in our ability to address misinformation globally. Furthermore, this lack of data in languages other than English hinders progress in social sciences such as psychology and linguistics. In response, we present GERMA, a corpus comprising over 230,000 German news articles (more than 130 million tokens) gathered from 30 websites classified as “untrustworthy” by professional fact-checkers. GERMA serves as an openly accessible repository, providing a wealth of text- and website-level data for testing hypotheses and developing automated detection algorithms. Beyond articles, GERMA includes supplementary data such as titles, publication dates, and semantic measures like keywords, topics, and lexical features. Moreover, GERMA offers domain-specific metadata, such as website quality evaluation based on factors like bias, factuality, credibility, and transparency. Higher-level metadata incorporates various metrics related to website traffic, offering a valuable tool into the analysis of online user behavior. GERMA represents a comprehensive resource for research in untrustworthy news detection, supporting qualitative and quantitative investigations in the German language.
Article
Full-text available
Misinformation about climate change poses a substantial threat to societal well-being, prompting the urgent need for effective mitigation strategies. However, the rapid proliferation of online misinformation on social media platforms outpaces the ability of fact-checkers to debunk false claims. Automated detection of climate change misinformation offers a promising solution. In this study, we address this gap by developing a two-step hierarchical model. The Augmented Computer Assisted Recognition of Denial and Skepticism (CARDS) model is specifically designed for categorising climate claims on Twitter. Furthermore, we apply the Augmented CARDS model to five million climate-themed tweets over a six-month period in 2022. We find that over half of contrarian climate claims on Twitter involve attacks on climate actors. Spikes in climate contrarianism coincide with one of four stimuli: political events, natural events, contrarian influencers, or convinced influencers. Implications for automated responses to climate misinformation are discussed.
Chapter
While there is overwhelming scientific agreement on climate change, the public has become polarized over fundamental questions such as human-caused global warming. Communication strategies to reduce polarization rarely address the underlying cause: ideologically-driven misinformation. In order to effectively counter misinformation campaigns, scientists, communicators, and educators need to understand the arguments and techniques in climate science denial, as well as adopt evidence-based approaches to neutralizing misinforming content. This chapter reviews analyses of climate misinformation, outlining a range of denialist arguments and fallacies. Identifying and deconstructing these different types of arguments is necessary to design appropriate interventions that effectively neutralize the misinformation. This chapter also reviews research into how to counter misinformation using communication interventions such as inoculation, educational approaches such as misconception-based learning, and the interdisciplinary combination of technology and psychology known as technocognition.
Article
Social media platforms have witnessed an unprecedented growth in users from rural communities in India. Many of these users are new to online information environments and are highly susceptible to misinformation. Fact-checking has the potential to reduce the proliferation and impact of misinformation; however, little is known about how fact-checking organizations in India serve rural users. To fill this gap, we conducted interviews with 12 prominent fact-checking organizations in India to understand their current practices and challenges in providing their services to rural users and the associated human and technological infrastructure they use. We discovered several measures that fact-checking organizations take to increase the reach, awareness, and relevance of fact-checked content for rural users, such as engaging with stringer networks and utilizing vernacular languages. However, fact-checking organizations also face severe challenges that limit both the scale of their work and engagement from rural users. Drawing on these findings, we provide design and policy recommendations to improve the reach, awareness, and relevance of fact-checked content for social media users in rural areas.
Chapter
Over the last decade, Social Media has been gradually shaping our world. From the Brexit to Ukraine war, passing through US election and COVID-19, there has been increasing attention on how social media affects our society. This attention has nowadays become an active research field in which researchers from different fields have proposed interdisciplinary solutions mainly aimed at fake news detection and prevention. Although this task is far to be solved.Fake news detection is intrinsically hard since we have to cope with textual data; moreover the early detection requirement, to prevent wide diffusion, makes things even harder. If we now add a dynamic component to the problem definition we can easily understand why researchers have been keeping proposing new solutions to deal with new nuances of the problem. In this so fast-changing field, it is easy for newcomers to get lost. The scope of this work is not to provide a comprehensive review of the state-of-the-art approaches but instead a quick overview of the recent trends and how current technologies try to deal with the unresolved issues that characterize this task.KeywordsMisinformationDeep LearningSocial Media
Article
Full-text available
Public figures such as politicians make claims about “facts” all the time. Journalists and citizens spend a good amount of time checking the veracity of such claims. Toward automatic fact checking, we developed tools to find check-worthy factual claims from natural language sentences. Specifically, we prepared a U.S. presidential debate dataset and built classification models to distinguish check-worthy factual claims from non-factual claims and unimportant factual claims. We also identified the most-effective features based on their impact on the classification models’ accuracy.
Conference Paper
Full-text available
Towards computational journalism, we present FactWatcher, a system that helps journalists identify data-backed, attention-seizing facts which serve as leads to news stories. FactWatcher discovers three types of facts, including situational facts, one-of-the-few facts, and prominent streaks, through a unified suite of data model, algorithm framework, and fact ranking measure. Given an appendonly database, upon the arrival of a new tuple, FactWatcher monitors if the tuple triggers any new facts. Its algorithms efficiently search for facts without exhaustively testing all possible ones. Furthermore, FactWatcher provides multiple features in striving for an end-to-end system, including fact ranking, fact-to-statement translation and keyword-based fact search.
Conference Paper
Full-text available
This paper presents the results of developing subjectivity classifiers using only unannotated texts for training. The performance rivals that of previous supervised learning approaches. In addition, we advance the state of the art in objective sentence classification, by learn- ing extraction patterns associated with objectivity and creating objec- tive classifiers that achieve substantially higher recall than previous work with comparable precision.
Conference Paper
Full-text available
The digital age has brought sweeping changes to the news media. While online consumption of news is on the rise, fewer people today read newspapers. Newspaper advertising revenues fell by a total of 23 % in 2007 and 2008, and tumbled 26 % more in 2009
Article
Full-text available
The output of a classifier should be a calibrated posterior probability to enable post-processing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce non-sparse kernel machines. Instead, we train an SVM, then train the parameters of an additional sigmoid function to map the SVM outputs into probabilities. This chapter compares classification error rate and likelihood scores for an SVM plus sigmoid versus a kernel method trained with a regularized likelihood error function. These methods are tested on three data-mining-style data sets. The SVM+sigmoid yields probabilities of comparable quality to the regularized maximum likelihood kernel method, while still retaining the sparseness of the SVM.
Article
Our news are saturated with claims of "facts" made from data.Database research has in the past focused on how to answer queries,but has not devoted much attention to discerning more subtle qualities of the resulting claims, e.g., is a claim "cherry-picking"? This paper proposes a framework that models claims based on structured data as parameterized queries. A key insight is that we can learn a lot about a claim by perturbing its parameters and seeing how its conclusion changes. This framework lets us formulate practical fact-checking tasks-reverse-engineering (often intentionally) vague claims, and countering questionable claims-as computational problems. Along with the modeling framework, we develop an algorithmic framework that enables efficient instantiations of "meta" algorithms by supplying appropriate algorithmic building blocks. We present real-world examples and experiments that demonstrate the power of our model, efficiency of our algorithms,and usefulness of their results.
Article
Does external monitoring improve democratic performance? Fact-checking has come to play an increasingly important role in political coverage in the United States, but some research suggests it may be ineffective at reducing public misperceptions about controversial issues. However, fact-checking might instead help improve political discourse by increasing the reputational costs or risks of spreading misinformation for political elites. To evaluate this deterrent hypothesis, we conducted a field experiment on a diverse group of state legislators from nine U.S. states in the months before the November 2012 election. In the experiment, a randomly assigned subset of state legislators was sent a series of letters about the risks to their reputation and electoral security if they were caught making questionable statements. The legislators who were sent these letters were substantially less likely to receive a negative fact-checking rating or to have their accuracy questioned publicly, suggesting that fact-checking can reduce inaccuracy when it poses a salient threat.
Article
Are you fed up with "lies, d---ned lies, and statistics" made up from data in our media? For claims based on structured data, we present a system to automatically assess the quality of claims (beyond their correctness) and counter misleading claims that cherry-pick data to advance their conclusions. The key insight is to model such claims as parameterized queries and consider how parameter perturbations affect their results. We demonstrate our system on claims drawn from U.S. congressional voting records, sports statistics, and publication records of database researchers.
Conference Paper
Tracking new topics, ideas, and "memes" across the Web has been an issue of considerable interest. Recent work has developed meth- ods for tracking topic shifts over long time scales, as well as abrupt spikes in the appearance of particular named entities. However, these approaches are less well suited to the identification of content that spreads widely and then fades over time scales on the order of days — the time scale at which we perceive news and events. We develop a framework for tracking short, distinctive phrases that travel relatively intact through on-line text; developing scalable algorithms for clustering textual variants of such phrases, we iden- tify a broad class of memes that exhibit wide spread and rich vari- ation on a daily basis. As our principal domain of study, we show how such a meme-tracking approach can provide a coherent repre- sentation of the news cycle — the daily rhythms in the news media that have long been the subject of qualitative interpretation but have never been captured accurately enough to permit actual quantitative analysis. We tracked 1.6 million mainstream media sites and blogs over a period of three months with the total of 90 million articles and we find a set of novel and persistent temporal patterns in the news cycle. In particular, we observe a typical lag of 2.5 hours between the peaks of attention to a phrase in the news media and in blogs respectively, with divergent behavior around the overall peak and a "heartbeat"-like pattern in the handoff between news and blogs. We also develop and analyze a mathematical model for the kinds of temporal variation that the system exhibits.
  • S Cohen
  • J T Hamilton
  • F Turner
S. Cohen, J. T. Hamilton, and F. Turner. Computational journalism. CACM, 54(10):66–71, Oct. 2011.