Conference PaperPDF Available

The Ethical Dimensions of Data Quality for Automated Fact-Checking



Automated fact-checking (AFC) has grown in popularity to address the online spread of misinformation, propaganda, and misinforma-tion about critical contemporary issues. Various natural language processing, machine learning, knowledge representation and database techniques have been used in AFC, whereas, from an end-user perspective, little attention was paid to the quality of the datasets feeding these information systems. Considering the recognised need to blend AI-based tools with journalistic values, this research proposes a practical framework for assessing and improving data quality when developing or implementing AFC systems. Drawing on an interdisciplinary approach, it contributes to understanding how to better align AI-based solutions with ethical standards in journalism and fact-checking.
The Ethical Dimensions of Data ality for Automated
Laurence Dierickx
University of Bergen
Bergen, Norway
Carl-Gustav Lindén
University of Bergen
Bergen, Norway
Andreas Lothe Opdahl
University of Bergen
Bergen, Norway
Automated fact-checking (AFC) has grown in popularity to address
the online spread of misinformation, propaganda, and misinforma-
tion about critical contemporary issues. Various natural language
processing, machine learning, knowledge representation and data-
base techniques have been used in AFC, whereas, from an end-user
perspective, little attention was paid to the quality of the datasets
feeding these information systems. Considering the recognised
need to blend AI-based tools with journalistic values, this research
proposes a practical framework for assessing and improving data
quality when developing or implementing AFC systems. Drawing
on an interdisciplinary approach, it contributes to understanding
how to better align AI-based solutions with ethical standards in
journalism and fact-checking.
automated fact-checking, datasets, data quality, ethics
Automated fact-checking (AFC) attracted growing interest in the
wake of the online spreading of misinformation, disinformation, and
propaganda on signicant issues of our contemporary world, such
as the presidential US elections, the COVID-19 pandemic, the global
warming crisis, or the Russian-Ukraine war. Since online lies spread
faster than the truth [
], automated fact-checking aims to provide
practical answers to speed up a time-consuming process when
performed manually [
]. AFC can be used for claim identication,
evidence retrieval, which consists of nding information beyond
the claim, and claim classication [48] [29] [38].
Research explored several tools and techniques based on natural
language processing, machine learning, knowledge representation
and databases, which play a pivotal role in claim detection and
verication [
]. However, the journalistic eld or the journalist as
end-users were less considered. In a systematic literature review
of papers devoted to AFC and published over the last ve years,
we only found 21 papers out of 267 that considered them. In these
works, the focus was mainly on the complementarity between the
journalist and the tool. Less attention was paid to the quality of
the datasets that feed these systems, especially from an end-user
perspective according to the tness-for-use principle, which relates
to data that adapt to the use of their nal users. Therefore, this
principle goes beyond the sole concerns of accuracy in data [44].
At the same time, there is a recognised need for embedding jour-
nalistic values within AI systems to integrate them into journalism
workows better [
] [
] [
]. AFC systems work well when the
domains of facts are restricted and on English corpus, but they are
not often scalable to real-time content spread on social media and
pre-existing fact datasets appear as insucient [
]. However, they
are a means to feed helpfully the systems, insofar as information
disorders are not solely agenda-related: for instance, conspiracy
theories do not go away once they are debunked [19].
This research aims to question the quality of the data used in AFC
systems and to dene how to blend datasets with the professional
values of their potential end-users. Hence, we have developed a
data quality assessment to provide a method to evaluate issues and
dene the levels to improve when building (or using) datasets in au-
tomated fact-checking. This framework is grounded in data science
and previous works on data quality in data-driven journalism [
From an end-user perspective, it is built on the ethical standards of
journalism and fact-checking, to contribute to align AFC system
with professional values. Therefore, it can be considered a practical
tool to infuse end-users’ values in AFC systems.
The denition of data quality is protean insofar as it encompasses a
set of complementary dimensions which were extended and rened
over time. Accuracy was approached as a measure of agreement
with an identied source [
], the level of precision and reliability of
the data [
], or as the representation of a dierent real-world state
from the one that should have been represented [
]. Scientic
literature also refers to the completeness of a given dataset, its
consistency (in terms of meeting formal requirements), timeliness
and reliability. Considering that dening data quality remains a
complex task due to the multidimensionality of the concept, an
agreement was found on the tness-for-use principle, according to
which quality data meet explicit or implicit user needs [
]. In other
words, data quality refers to data that adapt to their nal use, also
in terms of relevance and comprehensibility [53].
The rise of big data added extra layers to these concerns, as they
challenge the quality dimensions of believability, veriability and
the reputation of the data in the context of data collected online
or through sensors [
]. Beyond the correctness of the data, it is
also a matter of trusting them [
] [
]. Considering that building
trust is essential for adopting a machine learning application [42],
all of these considerations are far from trivial in the wake of the
growing development of articial intelligence systems because of
their strong dependence on data. Nonetheless, the system’s perfor-
mances also depend on the algorithm at work, which behaviour
may also depend on the intrinsic characteristic of the data espe-
cially in terms of volume and completeness [
] [
] [
]. These
concerns often remain conned to specialised research areas, and
journalistic aspects were little considered. In journalism studies,
research on data-driven journalism recognised the structuring role
Laurence Dierickx, Carl-Gustav Lindén, and Andreas Lothe Opdahl
played by computerised databases, which is probably exacerbated
by introducing AI technologies in newsrooms. In these elds, the
need for high quality data is a prerequisite because if the data are
bad or biased, the information will be bad or biased too [
] [
]. Nonetheless, aspects related to data quality have been little
addressed, although it was also considered a critical issue [
] [
Furthermore, it was also suggested that data selection and evalua-
tion should be journalistic, considering that these tasks are related
to a journalistic human expertise, while validation, standardisation
and normalisation should be programmers’ domain [32].
2.1 Building the Assessment Framework
According to the tness-for-use principle, data quality assessment
is use and context-dependent. It encompasses various strategies,
methods and techniques to identify erroneous data and measure
their impact on the processes. Its objective is to improve the overall
quality of the data [
] [
]. In this research, we dened data qual-
ity indicators that t journalistic and fact-checking ethical values,
considering that automated fact-checking systems are likely to be
used by journalists and fact-checkers to support or augment their
professional practices. Also, we considered that fact-checking ac-
tivities relate to journalism practices as a distinct sub-genre and a
form of accountable journalism [20] [36] [43].
The core ethical standards of journalism are grounded in the
social responsibility of journalism, which indistinctly refers to the
content of the news, the function of news media in society and the
responsibility of news media towards society [
]. Although ethical
journalism is rst and foremost a matter of practice, it is framed by
principles commonly acknowledged: the respect of the truth, which
means providing veried facts based on reliable sources; reporting
with accuracy; providing well-balanced information with fairness,
independence and non-partisanship [
]. Objectivity is another
standard promoted in journalism as a constitutive of professional
self-perception and identity [
]. However, this concept is regularly
criticised as it appears as an ideal, or even a myth, because it relies
on the individual subjectivity of the journalist [
] [
]. Choosing a
topic, an angle, sources, and the narrative also illustrate the impos-
sibility of objectivity insofar as it implies human and organisational
choices [45] [51] [55].
Considering that explaining these choices contributes to increas-
ing the credibility of the news and to (re) building trust with audi-
ences, transparency was presented as an alternative to the disputed
concept of objectivity [
] [
] [
]. Transparency means that jour-
nalists remain "open and explicit about their processes, methods,
limitations and assumptions" [
]: 1507. This concept gained in-
terest in the context of digital environments, seen as a means to
open the "black box" of professional practices. In data journalism,
for instance, transparency is considered a normative value that
contributes to open journalism [
]. Transparency is also at the
heart of the guidelines promoted by the international fact-checking
organisations International Fact-Checking Network (IFCN) and
European Fact-Checking Standard Network (EFCSN). Practically,
their members must be transparent about their organisational struc-
ture, funding, partnerships and agreements. They must also be
committed to non-partisanship and fairness. Last but not least, fact-
checkers must provide their narratives with all the details, methods
and sources to allow readers to replicate their work. Much more
than a discursive stance, transparency rhymes with professional
practices in fact-checking as it is a practical requirement.
Table 1: Assessment of the data quality dimensions
Dimension Verication
Level of interoperability, standardisation
Ratio accurate values/total values (mea-
sure of erroneous data)
Uniqueness (measurement of duplicate
entries and redundancies)
No encoding problems, no information
Well dened data structure (percentage of
data with consistent format and values)
Homogeneity in the format, structure,
and values
Unambiguous and explicit labelling
Correctness Identifying abnormal values
Identifying the causes of NULL values
Spelling coherence
Data documented with metadata
Compliance with metadata
The extent to which data are understand-
able by the end-user
Currentness (percentage of updated data)
Appropriate amount of data (ratio miss-
ing values/total values - ratio NULL val-
ues/total values)
Accessibility Right to use the data
Level of retrievability of the data
Unbiased data (size and representativity
of the sample)
Identication of human bias (data and/or
The extent to which the data are relevant
for the purpose
Data scarcity (fraction of data containing
relevant information)
Usability Making sense in a journalistic context
Reliability Authenticity (source)
Authority and reputation (source, anno-
Degree of believability and expertise
(data source, annotated data and anno-
tation process - annotators)
Fact-checking the source, the data, the an-
notation process, and the annotated data
The Ethical Dimensions of Data ality for Automated Fact-Checking
Table 2: Sample of Fact-Checking Datasets
Authors Description/URL
Alhindi et al.,
Multidomain dataset based on 4K+ claim–article pairs from diverse sources.
Arslan et al.,
Dataset of 23K+ statements extracted from U.S. general election presidential debates, annotated by human
Drchal et al.,
Derived from the FEVER dataset, CsFEVER contains 127K+ claims. CTKFacts contains 3K+ claims from a
corpus of more than two million Czech News Agency news reports.
Sepúlveda-Torres et al.,
Content 7K+ news items classied as Compatible, Contradiction, or Unrelated.
Samarinas et al.,
Large-scale dataset based on the FEVER dataset, used for evidence-retrieval, and MSMARCO, a collection
of large-scale datasets for deep learning.
Shahi and Nandini,
Multilingual cross-domain dataset of 5K+ fact-checked news articles on COVID-19, collected from
04/01/2020 to 01/07/2020.
Kotonya and Toni,
Dataset based on 11,8K claims collected from 5 fact-checking websites.
Sathe et al.,
Dataset of 124k+ triples consisting of a claim, context and evidence document extracted from English
Wikipedia articles and citations, and 34k+ manually written claims refuted by evidence documents.
Gupta and Srikumar,
Multilingual dataset for factual verication of naturally existing real-world claims composed of 38K+ short
2.2 Method
Data quality assessments usually consists of dening data quality
indicators and providing tools for measurement [
]. However,
data quality also depends on the design and production processes
at work to generate the data [
]. Also, in a data quality assessment,
subjective considerations intertwine with objective ones, insofar as
it reects human needs, experiences and contexts of [39].
The framework to assess data quality for automated fact-checking
is built upon three core ethical principles in journalism and fact-
checking (Table 1): the principle of "truth" relates to the data quality
dimensions of accuracy, consistency, correctness and comprehen-
sibility; the principle of "fairness" encompasses the dimensions of
timeliness, completeness, accessibility, objectivity, relevance and
usability; the principle of "transparency", as a lever for trust, is re-
lated to the reliability, credibility and veriability of the data. This
three-level segmentation assumes that telling the truth involves
the knowledge of the application domain the data refers to, that
being fair refers to unbiased and well-balanced information, and
that transparency gathers the means for remaining trustworthy.
An extensive literature review of papers, pre-prints and pro-
ceedings published between 2020 and 2022 allowed us to identify
a sample of nine datasets developed for automated fact-checking,
which are publicly available (Table 2). This sample only included
textual data because a corpus of images involves other types of con-
siderations related to the intrinsic characteristics of images in terms
of blur, noise, contrast, format and compression [
]. Nonetheless,
data quality challenges also encompass the diversity of datasets,
and the quality of annotations [
]. Google Rene was used as a
data quality tool for data proling to identify the overall data qual-
ity challenges from formal and empirical perspectives [
]. Due to
the vast amount of data to assess, we considered the Pareto prin-
ciple relevant, as "most of the errors are attributable to just a few
variables" [47]: 237.
The analysis aimed to identify the limitations or issues regarding
the ethical principles of truth, fairness and transparency. As the
purpose is not to attribute good and bad points to each examined
dataset, this analysis adopted a transversal approach.
3.1 Truth
The nine datasets of our corpus have dierent characteristics in
terms of size, domains, languages and format (JSON, CSV, TXT,
TSV), which do not seem an obstacle to reusing them. However,
cross-domain approaches (e.g., politics, sports, health) appear as the
most challenging to deal with, considering the knowledge required
to handle each domain well. Four datasets were not documented by
metadata or lacked explicit labelling. The use of a sentiment score
in one dataset was unclear, as well as the labelling used to assess the
validity of a claim. Three datasets contained NULL values, which
may have various causes and require human knowledge (e.g. the
NULL values are equal to zero, the information exists but is not
known or irrelevant to the variable). The overall understandability
Laurence Dierickx, Carl-Gustav Lindén, and Andreas Lothe Opdahl
of the datasets was not always granted because of a lack of docu-
mentation, although academic papers documented processes. As
they relied on textual data, the question of the standardisation and
harmonisation of the language arose, also in multilingual datasets.
3.2 Fairness
In terms of relevancy, the language and context-dependency of
the datasets raised the issue of using them in other languages or
national contexts. The datasets’ usability (and reusability) is also
challenged by the dimension of accessibility, as most of the datasets
did not have an attached licence. The dimension of timeliness is
also problematic for several reasons: missing dates (1 dataset), no
mention of the last update (1 dataset), and corpus collected over a
limited period (3 datasets). Hence, the currentness of the datasets
was not always guaranteed and raised questions about the rele-
vance of their reusability, despite they can be useful to fact-check
old propaganda discourses or conspiracy theories. However, the
lack of maintenance of the datasets remains an obstacle to meeting
the two ethical principles of truth and fairness since information
disorders are also a dynamic phenomenon that can vary or change
over time, and this also applies to concepts and denitions, consid-
ering that the construction of knowledge is an ongoing process. In
addition, a cross-domain approach made it dicult to assess the
completeness dimension. We also found two datasets with missing
values, with a respective proportion of 11.37% and 24.53%. Nev-
ertheless, the completeness of the datasets remained dicult to
evaluate, whether for recent or older phenomena, because there is
no absolute referral to assess it. As a corollary, the dimension of
objectivity appeared problematic when looking at the annotations
used for classication purposes: from "True" to "False", "Half-true",
"Unproven", "Contradiction", "Compatible" or "Unrelated", there
was no consensus among researchers.
3.3 Transparency
The majority of the datasets had no issues related to the source
trustworthiness, as they mostly relied on specialised fact-checking
and news websites. The pitfalls underlined in previous research
were globally avoided, considering that several potential data qual-
ity issues will likely appear with open data, user-generated data
and data from multiple sources [
]. However, three datasets used
Wikipedia as a primary source and raised questions related to their
reliability, credibility and veriability but they also questioned
the fairness principle, in terms of objectivity and relevance. In
journalism, Wikipedia is taken with caution as the content comes
from users of whom nothing is known about their expertise [
Also, the Wikipedan - or encyclopaedic - writing style diers from
journalistic writing, making it less useful for training. The same
applies to social media content used in one dataset, and it is per-
haps exacerbated by the unknown and volatile nature of the users.
The annotation processes did not appear particularly problematic.
Datasets were mostly well documented, except one with no indi-
cation on the level of the human expertise for annotations. In this
regard, research emphasised that, whether manual or automated,
annotations are inherently error-prone and that, when performed
manually, human subjective factors should also be considered [
[22] [41].
Results showed that adapting data to the ethical values of journalists
and fact-checkers does not only mean ensuring the reliability and
credibility of the data source as well as the accuracy of the data. One
of the main challenges is related to the maintenance over time in re-
gard to the dimension of actuality insofar as information disorders
are an ongoing process. However, several examined datasets might
be useful for older cases, considering that history might be repeat-
ing. Still, the question of maintenance remains critical as domains
and concepts evolve over time. Further, fact-checking requires a
critical approach toward the source of the data, including anno-
tated data. Datasets based on Wikipedia and on social media raised
questions about their fairness and trustworthiness. Acknowledging
that the relationship between journalists and AI-driven systems
is built on trust, the data that feed these systems should also be
Despite limitations due to its normative lenses and the sample
size, the data quality assessment framework developed in this re-
search aimed to provide clues to improve the overall data quality
when using technologies that rely so heavily on large volumes of
data. In many ways, the developed approach shares common con-
cerns with computer and data science, such as it is set in the FAIR
principles, which propose guidelines for improving the ndability,
accessibility, interoperability and reuse of digital assets [
]. As
end-users of AI-based systems, journalists and fact-checkers are
not always aware or informed about the data that feed the systems
they use. At the same time, their expertise in the data source’s
reliability and credibility and their knowledge of the context should
not be overlooked. Therefore, better ne-tuning AI-based systems
with their end users would strengthen collaborations and favour
cross-discipline approaches.
The research was funded by EU CEF grant number 2394203.
C.W. Anderson. 2018. Apostles of Certainty: Data Journalism and the Politics of
Doubt. Oxford University Press.
Jo Bardoel and Leen dHaenens. 2004. Media Responsibility and Accountability.
New Conceptualizations and Practices. Communications 29, 1 (2004). https:
Carlo Batini. 2009. Data Quality Assessment. In Encyclopedia of Database Systems.
Springer US, 608–612. 0-387-39940-9_107
Carlo Batini, Anisa Rula, Monica Scannapieco, and Gianluigi Viscusi. 2015. From
Data Quality to Big Data Quality. Journal of Database Management 26, 1 (2015),
Isabelle Boydens and Seth van Hooland. 2011. Hermeneutics Applied to the
Quality of Empirical Databases. Journal of Documentation 67, 2 (2011), 279–289.
Paul Bradshaw. 2017. Data journalism. In The Online Journalism Handbook.
Routledge, 250–280. 10
Meredith Broussard, Nicholas Diakopoulos, Andrea L. Guzman, Rediet Abebe,
Michel Dupagne, and Ching-Hua Chuan. 2019. Articial Intelligence and Jour-
nalism. Journalism & Mass Communication Quarterly 96, 3 (2019), 673–695.
Li Cai and Yangyong Zhu. 2015. The Challenges of Data Quality and Data
Quality Assessment in the Big Data Era. Data Science Journal 14, 0 (2015), 2. 002
Corinna Cichy and Stefan Rass. 2019. An Overview of Data Quality Frameworks.
IEEE Access 7 (2019), 24634–24648.
Stephanie Craft and Tim P. Vos. 2021. The Ethics of Transparency. In The
Routledge Companion to Journalism Ethics. Routledge, 175–183.
The Ethical Dimensions of Data ality for Automated Fact-Checking
Mark Deuze. 2005. What Is Journalism?: Professional Identity and Ideology of
Journalists Reconsidered. Journalism 6, 4 (2005), 442–464.
Nicholas Diakopoulos. 2019. Automating the News. Harvard University Press.
Laurence Dierickx. 2017. News Bot for the Newsroom: How Building Data
Quality Indicators Can Support Journalistic Projects Relying on Real-Time
Open Data. In Global Investigative Journalism Conference 2017 Academic
Track. the-newsroom- how-
building-data- quality-indicators- can-support- journalistic-projects-relying-
on-real- time-open- data/
Samuel Dodge and Lina Karam. 2016. Understanding How Image Quality Aects
Deep Neural Networks. In 8th International Conference on Quality of Multimedia
Experience (QoMEX). IEEE.
Konstantin Nicholas Dörr and Katharina Hollnbuchner. 2016. Ethical Challenges
of Algorithmic Journalism. Digital Journalism 5, 4 (2016), 404–419. https:
Lisa Ehrlinger, Verena Haunschmid, Davide Palazzini, and Christian Lettner.
2019. A DaQL to Monitor Data Quality in Machine Learning Applications. In
Lecture Notes in Computer Science. Springer International Publishing, 227–237. 030-27615- 7_17
Harald Foidl and Michael Felderer. 2019. Risk-Based Data Validation in Ma-
chine Learning-Based Software Systems. In Proceedings of the 3rd ACM SIGSOFT
International Workshop on Machine Learning Techniques for Software Quality
Evaluation. ACM.
Christopher Fox, Anany Levitin, and Thomas Redman. 1994. The Notion of Data
and its Quality Dimensions. Information Processing & Management 30, 1 (1994),
9–19. 5
Ted Goertzel. 1994. Belief in Conspiracy Theories. Political Psychology 15, 4
(1994), 731.
Lucas Graves and CW Anderson. 2020. Discipline and Promote: Building In-
frastructure and Managing Algorithms in a “Structured Journalism” Project by
Professional Fact-Checking Groups. New Media & Society 22, 2 (2020), 342–360.
Zhijiang Guo, Michael Schlichtkrull, and Andreas Vlachos. 2022. A Survey on
Automated Fact-Checking. Transactions of the Association for Computational
Linguistics 10 (2022), 178–206.
Nitin Gupta, Shashank Mujumdar, Hima Patel, Satoshi Masuda, Naveen Panwar,
Sambaran Bandyopadhyay, Sameep Mehta, Shanmukha Guttula, Shazia Afzal,
Ruhi Sharma Mittal, and Vitobha Munigala. 2021. Data Quality for Machine
Learning Tasks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Discovery & Data Mining. ACM.
[23] Kai Hafez. 2002. Journalism Ethics Revisited: A Comparison of Ethics Codes in
Europe, North Africa, the Middle East, and Muslim Asia. Political Communication
19, 2 (2002), 225–250.
Joseph F. Hair and Marko Sarstedt. 2021. Data, Measurement, and Causal
Inferences in Machine Learning: Opportunities and Challenges for Market-
ing. Journal of Marketing Theory and Practice 29, 1 (2021), 65–77. https:
YU Huh, FR Keller, TC Redman, and AR Watkins. 1990. Data Quality. Information
and Software Technology 32, 8 (1990), 559–565.
Michael Karlsson. 2020. Dispersing the Opacity of Transparency in Journalism on
the Appeal of Dierent Forms of Transparency to the General Public. Journalism
Studies 21, 13 (2020), 1795–1814.
Michael Koliska. 2022. Trust and Journalistic Transparency Online. Journalism
Studies 23, 12 (2022), 1488–1509.
Tomoko Komatsu, Marisela Gutierrez Lopez, Stephann Makri, Colin Porlezza,
Glenda Cooper, Andrew MacFarlane, and Sondess Missaoui. 2020. AI Should
Embody Our Values: Investigating Journalistic Values to Inform AI Technology
Design. In Proceedings of the 11th Nordic Conference on Human-Computer In-
teraction: Shaping Experiences, Shaping Society. ACM.
Lev Konstantinovskiy, Oliver Price, Mevan Babakar, and Arkaitz Zubiaga. 2021.
Toward Automated Factchecking: Developing an Annotation Schema and Bench-
mark for Consistent Automated Claim Detection. Digital Threats: Research and
Practice 2, 2 (2021), 1–16.
Tien Fabrianti Kusumasari and Fitria. 2016. Data Proling for Data Quality
Improvement With Openrene. In 2016 International Conference on Information
Technology Systems and Innovation (ICITSI). IEEE.
Eric Lazarski, Mahmood Al-Khassaweneh, and Cynthia Howard. 2021. Using
NLP for Fact Checking: A Survey. Designs 5, 3 (2021), 42.
Carl-Gustav Lindén. 2016. Decades of Automation in the Newsroom. Digital
Journalism 5, 2 (2016), 123–140.
Jianzheng Liu, Jie Li, Weifeng Li, and Jiansheng Wu.2016. Rethinking Big Data: A
Review on the Data Quality and Usage Issues. ISPRS Journal of Photogrammetry
and Remote Sensing 115 (2016), 134–142.
Marisela Gutierrez Lopez, Colin Porlezza, Glenda Cooper, Stephann Makri, An-
drew MacFarlane, and Sondess Missaoui. 2022. A Question of Design: Strategies
for Embedding AI-Driven Tools into Journalistic Work Routines. Digital Jour-
nalism (2022), 1–20.
Wilson Lowrey, Ryan Broussard, and Lindsey A. Sherrill. 2019. Data Journalism
and Black-Boxed Data Sets. Newspaper Research Journal 40, 1 (2019), 69–82.
Paul Mena. 2018. Principles and Boundaries of Fact-checking: Journalists’ Per-
ceptions. Journalism Practice 13, 6 (2018), 657–672.
Juan Ramón Muñoz-Torres. 2012. Truth and Objectivity in Journalism. Journalism
Studies 13, 4 (2012), 566–582.
Preslav Nakov, David Corney, Maram Hasanain, Firoj Alam, Tamer Elsayed,
Alberto Barrón-Cedeño, Paolo Papotti, Shaden Shaar, and Giovanni Da San
Martino. 2021. Automated Fact-Checking for Assisting Human Fact-Checkers. In
Proceedings of the Thirtieth International Joint Conference on Articial Intelligence.
International Joint Conferences on Articial Intelligence Organization. https:
Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data Quality Assessment.
Commun. ACM 45, 4 (2002), 211–218.
Colin Porlezza and Sergio Splendore. 2019. From Open Journalism to Closed
Data: Data Journalism in Italy. Digital Journalism 7, 9 (2019), 1230–1252. https:
Fakhitah Ridzuan, Wan Mohd Nazmee Wan Zainon, and Mohd Zairul. 2021. A
Thematic Review on Data Quality Challenges and Dimension in the Era of Big
Data. In Lecture Notes in Electrical Engineering. Springer Singapore, 725–737. 16-2406- 3_56
Keng Siau and Weiyu Wang. 2018. Building Trust in Articial Intelligence,
Machine Learning, and Robotics. Cutter Business Technology Journal 31, 2 (2018),
47–53.cial- intelligence-
machine-learning- and-robotics- 498981
Jane B Singer. 2020. Border Patrol: The Rise and Role of Fact-Checkers and
Their Challenge to Journalists’ Normative Boundaries. Journalism 22, 8 (2020),
Giri Kumar Tayi and Donald P. Ballou. 1998. Examining Data Quality. Commun.
ACM 41, 2 (1998), 54–57.
Jingrong Tong and Landong Zuo. 2019. The Inapplicability of Objectivity: Under-
standing the Work of Data Journalism. Journalism Practice 15, 2 (2019), 153–169.
Khonzodakhon Umarova and Eni Mustafaraj. 2019. How Partisanship and
Perceived Political Bias Aect Wikipedia Entries of News Sources. In Com-
panion Proceedings of The 2019 World Wide Web Conference. ACM. https:
Richard D. De Veaux and David J. Hand. 2005. How to Lie with Bad Data. Statist.
Sci. 20, 3 (2005).
Andreas Vlachos and Sebastian Riedel. 2014. Fact Checking: Task denition
and dataset construction. In Proceedings of the ACL 2014 Workshop on Language
Technologies and Computational Social Science. Association for Computational
Tim P. Vosand Stephanie Craft. 2016. The Discursive Construction of Journalistic
Transparency. Journalism Studies 18, 12 (2016), 1505–1522.
Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The Spread of True and
False News Online. Science 359, 6380 (2018), 1146–1151.
Karin Wahl-Jorgensen. 2013. Subjectivity and Story-telling in Journalism. Jour-
nalism Studies 14, 3 (2013), 305–320.
Yair Wand and Richard Y. Wang. 1996. Anchoring Data Quality Dimensions in
Ontological Foundations. Commun. ACM 39, 11 (1996), 86–95.
Richard Y. Wang and Diane M. Strong. 1996. Beyond Accuracy: What Data
Quality Means to Data Consumers. Journal of Management Information Systems
12, 4 (1996), 5–33.
Stephen J. A. Ward. 2019. Journalism Ethics. In The Handbook of Journalism
Studies. Routledge, 307–323. 20
Charlotte Wien. 2005. Dening Objectivity within Journalism. Nordicom Review
26, 2 (2005), 3–15. 2017-0255
Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Ap-
pleton, Myles Axton, Arie Baak ..., and Barend Mons. 2016. The FAIR Guiding
Principles for scientic data management and stewardship. Scientic Data 3, 1
Yazhou Yao, Jian Zhang, Fumin Shen, Li Liu, Fan Zhu, Dongxiang Zhang, and
Heng Tao Shen. 2020. Towards Automatic Construction of Diverse, High-Quality
Image Datasets. IEEE Transactions on Knowledge and Data Engineering 32, 6
(2020), 1199–1211.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Fact-checking has become increasingly important due to the speed with which both information and misinformation can spread in the modern media ecosystem. Therefore, researchers have been exploring how fact-checking can be automated, using techniques based on natural language processing, machine learning, knowledge representation, and databases to automatically predict the veracity of claims. In this paper, we survey automated fact-checking stemming from natural language processing, and discuss its connections to related tasks and disciplines. In this process, we present an overview of existing datasets and models, aiming to unify the various definitions given and identify common concepts. Finally, we highlight challenges for future research.
Full-text available
With the promise of AI, the use of emerging technologies in journalism has gained momentum. However, the question of how such technologies can be interwoven with newsroom practices, values, routines, and socio-cultural experiences is often neglected. This article investigates the ways in which AI-driven tools are permeating newswork and design strategies for blending technological capabilities with editorial requirements. We followed a multi-method approach to investigate the deployment of AI in news production at two London newsrooms: (1) a design ethnography at the BBC with journalists and technologists, and (2) interviews with journalists at The Times. Our findings show that while journalists are generally open to try AI-driven technologies that benefit their work, technologists struggle to integrate them into journalistic workflows. The consensus was that human judgement is required to make complex decisions in journalism and that journalistic values should be prioritised in AI tool design. We claim that AI tools need to fit with professional practices and values in journalism in order to be fully accepted as an editorial tool. Embedding new technologies into journalistic workflows requires therefore a close collaboration between journalists and technologists, and a sociotechnical design that blends in work routines and values.
Full-text available
Data quality is the primary concern faced by most of the organizations due to improper maintenance in the database. Data obtained from the various resources are dirty, affecting the accuracy of predicted results. There are a lot of challenges when handling Big Data because it requires well-defined and precise measurement processes. The challenges are in the characteristics of big data itself where the V’s play an important role in measuring and determining data quality. Although the issue has been discussed over 20 years, there is no guideline in identifying the important dimension of data quality being proposed to adhere with the context of Big Data. Therefore, the purpose of this systematic review is to review literature on the issue, challenges, and dimension of data quality in the era of Big Data using thematic review. This review included journal and conference proceeding papers from ACM Digital Library, Scopus, and Science Direct published between 2016 until 2020. Inclusion and exclusion processes have filtered out 21 final articles for the review. A systematic review on these 21 articles focuses on the issue, challenges, and dimension of data quality. The results of this study benefit the future study on the development of data quality dimensions and can be a guideline for the researcher to design the data quality assessment framework.
Full-text available
In recent years, disinformation and “fake news” have been spreading throughout the internet at rates never seen before. This has created the need for fact-checking organizations, groups that seek out claims and comment on their veracity, to spawn worldwide to stem the tide of misinformation. However, even with the many human-powered fact-checking organizations that are currently in operation, disinformation continues to run rampant throughout the Web, and the existing organizations are unable to keep up. This paper discusses in detail recent advances in computer science to use natural language processing to automate fact checking. It follows the entire process of automated fact checking using natural language processing, from detecting claims to fact checking to outputting results. In summary, automated fact checking works well in some cases, though generalized fact checking still needs improvement prior to widespread use.
Full-text available
In an effort to assist factcheckers in the process of factchecking, we tackle the claim detection task, one of the necessary stages prior to determining the veracity of a claim. It consists of identifying the set of sentences, out of a long text, deemed capable of being factchecked. This article is a collaborative work between Full Fact, an independent factchecking charity, and academic partners. Leveraging the expertise of professional factcheckers, we develop an annotation schema and a benchmark for automated claim detection that is more consistent across time, topics, and annotators than are previous approaches. Our annotation schema has been used to crowdsource the annotation of a dataset with sentences from UK political TV shows. We introduce an approach based on universal sentence representations to perform the classification, achieving an F1 score of 0.83, with over 5% relative improvement over the state-of-the-art methods ClaimBuster and ClaimRank. The system was deployed in production and received positive user feedback.
Transparency has become a central norm in journalism and it is deemed to increase audiences’ perceptions of credibility, legitimacy and trust in the news media. While a number of studies have sought to support these claims by primarily testing the effects of transparency on credibility perceptions, this research explores how audiences’ trust in journalism is impacted by various features and types of transparency on the news item level. Two experiments were conducted. The findings of the first experiment suggest that transparency in its current form may not increase news consumers’ trust. The second experiment explored possible explanations for the findings of the first study. The results of the second experiment indicate that while audiences value transparency in reporting, they struggle to recognize and recall the presence of transparency and transparency features within a news online article. Overall, the findings of the two studies suggest first, the need to reconceptualize how audiences perceive and process transparency information and second, to include transparency information as part of the news story.
Conference Paper
The reporting and the analysis of current events around the globe has expanded from professional, editor-lead journalism all the way to citizen journalism. Nowadays, politicians and other key players enjoy direct access to their audiences through social media, bypassing the filters of official cables or traditional media. However, the multiple advantages of free speech and direct communication are dimmed by the misuse of media to spread inaccurate or misleading claims. These phenomena have led to the modern incarnation of the fact-checker --- a professional whose main aim is to examine claims using available evidence and to assess their veracity. Here, we survey the available intelligent technologies that can support the human expert in the different steps of her fact-checking endeavor. These include identifying claims worth fact-checking, detecting relevant previously fact-checked claims, retrieving relevant evidence to fact-check a claim, and actually verifying a claim. In each case, we pay attention to the challenges and the potential impact on real-world fact-checking.
Data journalism is an emerging form of journalism, entailing the discovery of stories in data with the assistance of data algorithms. The burgeoning literature has largely interpreted the work of data journalism through the lens of objectivity. This paper, however, rejects the applicability of objectivity to data journalism. This inapplicability is the product of five factors: the extensive use of data and data algorithms in journalism; the difficulty in verifying data; the imbalance in data and data access; the uncertainty about if and to what extent data journalists can obtain sufficient knowledge of data contexts and algorithms; and their “design subjectivity” in the data processing process. Data reporting becomes a process of knowledge construction under the influence of these factors. The article argues that because of the social constructionist nature of data journalism, serving the public interest and democracy is a more appropriate principle than objectivity for data journalism. It suggests shifting academic attention from celebrating objectivity in data journalism to examining the epistemology of data journalists, the factors influencing data journalists’ formation of knowledge in reporting, their defence of cultural authority, and the democratic meanings of data reports in future research. Such understanding also has implications for data journalism pedagogy and practice.