Conference PaperPDF Available

Automatic Deception Detection: Methods for Finding Fake News

Authors:

Abstract

This research surveys the current state-of-the-art technologies that are instrumental in the adoption and development of fake news detection. " Fake news detection " is defined as the task of categorizing news along a continuum of veracity, with an associated measure of certainty. Veracity is compromised by the occurrence of intentional deceptions. The nature of online news publication has changed, such that traditional fact checking and vetting from potential deception is impossible against the flood arising from content generators, as well as various formats and genres. The paper provides a typology of several varieties of veracity assessment methods emerging from two major categories – linguistic cue approaches (with machine learning), and network analysis approaches. We see promise in an innovative hybrid approach that combines linguistic cue and machine learning, with network-based behavioral data. Although designing a fake news detector is not a straightforward problem, we propose operational guidelines for a feasible fake news detecting system.
Automatic Deception Detection: Methods for Finding Fake
News
Niall J. Conroy, Victoria L. Rubin, and Yimin Chen
Language and Information Technology Research Lab (LIT.RL)
Faculty of Information and Media Studies
University of Western Ontario, London, Ontario, CANADA
nconroy1@uwo.ca, vrubin@uwo.ca, ychen582@uwo.ca
!
ABSTRACT
This research surveys the current state-of-the-art
technologies that are instrumental in the adoption and
development of fake news detection. “Fake news detection”
is defined as the task of categorizing news along a
continuum of veracity, with an associated measure of
certainty. Veracity is compromised by the occurrence of
intentional deceptions. The nature of online news
publication has changed, such that traditional fact checking
and vetting from potential deception is impossible against
the flood arising from content generators, as well as various
formats and genres.
The paper provides a typology of several varieties of
veracity assessment methods emerging from two major
categories linguistic cue approaches (with machine
learning), and network analysis approaches. We see promise
in an innovative hybrid approach that combines linguistic
cue and machine learning, with network-based behavioral
data. Although designing a fake news detector is not a
straightforward problem, we propose operational guidelines
for a feasible fake news detecting system.
Keywords
Deception detection, fake news detection, veracity
assessment, news verification, methods, automation, SVM,
knowledge networks, predictive modelling, fraud
INTRODUCTION
News verification aims to employ technology to identify
intentionally deceptive news content online, and is an
important issue within certain streams of library and
information science (LIS). Fake news detection is defined
as the prediction of the chances of a particular news article
(news report, editorial, expose, etc.) being intentionally
deceptive (Rubin, Conroy & Chen, 2015). Tools aim to
mimic certain filtering tasks which have, to this point, been
the purview of journalists and other publishers of traditional
news content. The proliferation of user-generated content,
and Com put er Media ted Com mun i cat ion (CM C )
technologies such as blogs, Twitter, and other social media
have the capacity of news delivery mechanisms on a mass
scale— yet much of the information is of questionable
veracity (Ciampaglia, Shiralkar, Rocha, Bollen, Menczer &
Flammini, 2015). Establishing the reliability of information
online is a daunting but critical challenge. Four decades of
deception detection research has helped us learn about how
well humans are able detect lies in text. The findings show
we are not so good at it. In fact, just 4% better than chance,
based on a meta-analysis of more than 200 experiments
(Bond & DePaulo, 2006). This problem has led researchers
and technical developers to look at several automated ways
of assessing the truth value of potentially deceptive text
based on the properties of the content and the patterns of
computer-mediated communication .
Structured datasets are easier to verify than non-structured
(or semi-structured) data such as texts. When we know the
language domain (e.g., insurance claims or health-related
news) we can make better guesses about the nature and use
of deception. Semi-structured non-domain specific web
data come in many formats and demand flexible methods
for veracity verification. For some time, however, the
development and evaluation of different methods have
remained in isolated corners, relatively unknown in LIS.
More recently, efforts of methodological cross-pollination
and hybrid approaches have produced promising results
(Rubin et al., 2015A). The range of journalistic practices
and available news sources (see Rubin et al. (2015B) for an
overview) demand consideration of multiple methods since
one approach often addresses known weaknesses in another.
How then is it possible to gauge the veracity of online
news?
This paper provides researchers with a map of the current
landscape of veracity (or deception) assessment methods,
their major classes and goals, all with the aim of proposing
a hybrid approach to system design. These methods have
emerged from separate development streams, utilizing
disparate techniques. In this survey, two major categories of
methods emerge: 1. Linguistic Approaches in which the
content of deceptive messages is extracted and analyzed to
associate language patterns with deception; and 2. Network
Approaches in which network information, such as message
metadata or structured knowledge network queries can be
harnessed to provide aggregate deception measures. Both
forms typically incorporate machine learning techniques for
training classifiers to suit the analysis. It is incumbent upon
researchers to understand these different areas, yet no
known typology of methods exists in the current literature.
The goal is to provide a survey of the existing research
while proposing a hybrid approach, which utilizes the most
eff e c t i v e dec e p t i o n dete c t i o n meth o d s for th e
implementation of a fake news detection tool.
!
ASIST 2015, November 6-10, 2015, St. Louis, MO, USA.
Copyright © 2015 Niall J. Conroy, Victoria L. Rubin & Yimin Chen
LINGUISTIC APPROACHES
Most liars use their language strategically to avoid being
caught. In spite of the attempt to control what they are
saying, language “leakage” occurs with certain verbal
aspects that are hard to monitor such as frequencies and
patterns of pronoun, conjunction, and negative emotion
word usage (Feng & Hirst, 2013). The goal in the linguistic
approach is to look for such instances of leakage or, so
called “predictive deception cues” found in the content of a
message.
Data Representation
Perhaps the simplest method of representing texts is the
“bag of words” approach, which regards each word as a
single, equally significant unit. In the bag of words
approach, individual words or “n-grams” (multiword)
frequencies are aggregated and analyzed to reveal cues of
deception. Further tagging of words into respective lexical
cu es for e xam ple, pa rts of s pee ch or “s hal low
syntax (Hancock & Markowitz, 2013), affec ti ve
dimensions (Vrij, 2006), or location-based words (Hancock,
et al, 2013) are all ways of providing frequency sets to
reveal linguistic cues of deception.
The simplicity of this representation also leads to its biggest
shortcoming. In addition to relying exclusively on
language, the method relies on isolated n-grams, often
divorced from useful context information. In this method,
any resolution of ambiguous word sense remains non-
existent (Larcker & Zakolyukina 2012). Many deception
detection researchers have found this method useful in
tandem with different, complementary analysis (Zhang,
Fan, Zeng & Liu, 2012; Lary, Nikitov & Stone, 2010; Ott,
Cardi, & Hancock, 2013), several of which are discussed in
the remainder of this proposal.
Deep Syntax
Analysis of word use is often not enough in predicting
deception. Deeper language structures (syntax) have been
analyzed to predict instances of deception. Deep syntax
analysis is implemented through Probability Context Free
Grammars (PCFG). Sentences are transformed to a set of
rewrite rules (a parse tree) to describe syntax structure, for
example noun and verb phrases, which are in turn rewritten
by their syntactic constituent parts (Feng, Banerjee & Choi,
2012). The final set of rewrites produces a parse tree with a
certain probability assigned. This method is used to
distinguish rule categories (lexicalized, unlexicalized,
parent nodes, etc.) for deception detection with 85-91%
accuracy (depending on the rule category used) (Feng et al.,
2012).
Third-party tools, such as the Stanford Parser (de Marneffe,
MacCartney, Manning, 2006; Rahangdale & Agrawa,
2014), AutoSlog-TS syntax analyzer (Oraby, Reed,
Compton, Riloff, Walker, & Whittaker, 2015) and others
assist in the automation. Alone, syntax analysis might not
be sufficiently capable of identifying deception, and studies
often combine this approach with other linguistic or
network analysis techniques (e.g., Feng et al., 2012; Feng &
Hirst, 2013).
Semantic Analysis
As an alternative to deception cues, signals of truthfulness
have also been analyzed and achieved by characterizing the
degree of compatibility between a personal experience (e.g.,
a hotel review) as compared to a content “profile” derived
from a collection of analogous data. This approach extends
the n-gram plus syntax model by incorporating profile
compatibility features, showing the addition significantly
improves classification performance. (Feng & Hirst, 2013).
The intuition is that a deceptive writer with no experience
with an event or object (e.g., never visited the hotel in
question) may include contradictions or omission of facts
present in profiles on similar topics. For product reviews, a
writer of a truthful review is more likely to make similar
comments about aspects of the product as other truthful
reviewers. Extracted content from key words consists of
attribute:descriptor pair. By aligning profiles and the
description of the writers personal experience, veracity
assessment is a function of the compatibility scores: 1.
Compatibility with the existence of some distinct aspect
(eg. an art museum near the hotel); 2. Compatibility with
the description of some general aspect, such as location or
serv ic e. Pr ed iction of f al se ho od is shown to be
approximately 91% accurate with this method.
Although demonstrated useful in the above context of
reviews, this method has so far been restricted to the
domain of application. There are two potential limitations
in this method: the ability to determine alignment between
Figure 1: Fact-checking statements. (a) Structured information about President
Obama contained in the “infoboxes” of Wikipedia articles. (b) Shortest knowl-
edge graph path returned for the false statement “Barack Obama is a Muslim”.
The path traverses high-degree nodes representing generic entities, such as
Canada, and is assigned a low truth value. (Ciampiaglia et al., 2015)
attributes and descriptors depends on a sufficient amount of
mined content for profiles, and the challenge of correctly
associating descriptors with extracted attributes.
Rhetorical Structure and Discourse Analysis
At the discourse level, deception cues present themselves
both in CMC communication and in news content. A
description of discourse can be achieved through the
Rhetorical Structure Theory (RST) analytic framework, that
identifies instances of rhetoric relations between linguistic
elements. Systematic differences between deceptive and
truthful messages in terms of their coherence and structure
has been combined with a Vector Space Model (VSM) that
assesses each message’s position in multi-dimensional RST
space with respect to its distance to truth and deceptive
centers (Rubin & Lukoianova, 2014). At this level of
linguistic analysis, the prominent use of certain rhetorical
relations can be indicative of deception. Tools to automate
rhetorical classification are becoming available, although
not yet employed in the context of veracity assessment.
Classifiers
Sets of word and category frequencies are useful for
subsequent automated numerical analysis. One common use
is for the training of “classifiers” as in Support Vector
Machines (SVM) (Zhang et al., 2012) and Naïve Bayesian
models (Oraby et al., 2015). Simply put, when a
mathematical model is sufficiently trained from pre-coded
examples in one of two categories, it can predict instances
of future deception on the basis of numeric clustering and
distances. The use of different clustering methods and
distance functions between data points shape the accuracy
of SVM (Strehl, Ghosh & Mooney, 2000), which invites
new experimentation on the net effect of these variables.
Naïve Bayes algorithms make classifications based on
accumulated evidence of the correlation between a given
variable (e.g., syntax) and the other variables present in the
model (Mihalcea & Strapparava, 2009).
The classification of sentiment (Pang & Lee, 2008; Ott et
al., 2013) is based on the underlying intuition that deceivers
use unintended emotional communication, judgment or
evaluation of affective state (Hancock, Woodworth, &
Porter, 2011). Likewise, syntactic patterns may be used in
distinguishing feeling from fact-based arguments by
associating learned patterns of argumentation style classes.
In studies of business communication, performance is
significantly better than a random guess by 16%, and the
language of deceptive executives exhibits fewer non-
extreme positive emotions (Larcker & Zakolyukina, 2012).
Compari son b etween hum an judgement and SVM
classifiers showed 86% performance accuracy on negative
deceptive opinion spam (Ott et al., 2013). Fake negative
reviewers over-produced negative emotion terms relative to
the truthful reviews. These were deemed not the result of
“leakage cues” from the emotional distress of lying, but
exaggerations of the sentiment deceivers are trying to
convey.
These linguistic approaches all rely on language usage and
its analysis, and are promising when used in hybrid
approaches. However, findings emerging from topic-
specific studies (product reviews, business) may have
limited generalizability towards real-time veracity detection
of news.
NETWORK APPROACHES
Innovative and varied, using network properties and
behavior are ways to complement content-based approaches
that rely on deceptive language and leakage cues to predict
deception. As real-time content on current events is
in cre asi ngl y pr oli fer ate d thro ugh mic ro- blo ggi ng
applications such as Twitter, deception analysis tools are all
the more important.
Linked data
The use of knowledge networks may represent a significant
step towards scalable computational fact-checking methods.
For certain data, false “factual statements” can represent a
form of deception since they can be extracted and examined
alongside findable statements about the known world. This
approach leverages an existing body of collective human
knowledge to assess the truth of new statements. The
method depends on querying existing knowledge networks,
or publicly available structured data, such as DBpedia
ontology, or the Google Relation Extraction Corpus
(GREC).
The inherently structured data network of entities is
connected through a predicate relationship. Fact checking
can be effectively reduced to a simple network analysis
problem: the computation of the simple shortest path (see
Figure 1). Queries based on extracted fact statements are
assigned semantic proximity as a function of the transitive
relationship between subject and predicate via other nodes.
The closer the nodes, the higher the likelihood that a
particular subject-predicate-object statement is true.
There are several so-called ‘network effect’ variables that
are exploited to derive truth probabilities (Ciampaglia et al.,
2015), so the outlook for exploiting structured data
repositories for fact-checking remains promising. From the
short list of existing published work in this area, results
using sample facts from four different subject areas range
from 61% to 95%. Success was measured based on whether
the machine was able to assign higher true values to true
statements than to false ones (Ciampaglia, et al., 2015). A
problem with this method, however, rests in the fact that
statements must reside in a pre-existing knowledge base.
Social Network Behavior
Authentication of identity on social media is paramount to
the notion of trust. The proliferation of news in the form of
current events through mass technologies like micro-blogs
invites ways of ascertaining the difference between fake
and genuine content. Outside of the analysis of content
comes the use of metadata and telltale behavior of
questionable sources (Chu, Gianvecchio, Wang & Jajodia,
2010). The recent use of twitter in influencing political
perceptions (Cook et al., 2013) is one scenario where
certain data, namely the inclusion of hyperlinks or
associated metadata, can be compiled to establish veracity
assessments. Centering resonance analysis (CRA), a mode
of network-based text analysis, represents the content of
large sets of texts by identifying the most important words
that link other words in the network. This was employed by
Papacharissi & Oliviera to identify content patterns in posts
about Egypt’s elections (2012). Combining sentiment and
behaviour studies have demonstrated the contention that
sentiment-focused reviews from singleton contributors
significantly affects online ranking (Wu, Greene, Smyth &
Cunningham, 2010), and that this is an indicator of
“shilling” or contributing fake reviews to artificially distort
a ranking.
CONCLUSION
Linguistic and network-based approaches have shown high
accuracy results in classification tasks within limited
domains. This discussion drafts a basic typology of
methods available for further refinement and evaluation,
and provides a basis for the design of a comprehensive fake
news detection tool. Techniques arising from disparate
approaches may be utilized together in a hybrid system,
whose features are summarized:
Linguistic processing should be built on multiple layers
from word/lexical analysis to highest discourse-level
analysis for maximum performance.
As a viable alternative to strictly content-based
approaches, network behavior should be combined to
incorporate the ‘trust’ dimension by identifying credible
sources.
Tools should be designed to augment human judgement,
not replace it. Relations between machine output and
methods should be transparent.
Contributions in the form of publicly available gold
standard datasets should be in linked data format to assist
in up-to-date fact checking.
ACKNOWLEDGMENTS
This research has been funded by the Government of
Canada Social Sciences and Humanities Research Council
(SSHRC) Insight Grant (#435-2015-0065) awarded to Dr.
Rubin for the project entitled Digital Deception Detection:
Identifying Deliberate Misinformation in Online News.
REFERENCES
Ciampaglia, G., Shiralkar, P., Rocha, L., Bollen, J. Menczer, F., &
Flammini, A. (2015). Computational fact checking from
knowledge networks.
Chen, Y., Conroy, N. J., & Rubin, V. L. (2015). News in an Online
World: The Need for an “Automatic Crap Detector”. In The
Proceedings of the Association for Information Science and
Technology Annual Meeting (ASIST2015), Nov. 6-10, St. Louis.
Chu, Z., Gianvecchio, S., Wang, H., & Jajodia, S. (2010). Who is
tweeting on Twitter: Human, Bot, or Cyborg? in the
Proceedings of the 26th Annual Computer Security Applications
Conference, ACSAC ’10, pp. 21-30.
Cook, D., Waugh, B., Abdipanab, M, Hashemi, O., Rahman, S.
(2013). Twitter Deception and Influence: Issues of Identity,
Slacktivism and Puppetry
de Marneffe, M., MacCartney, B. & Manning, C. (2006).
Generating typed dependency parses from phrase structure
parses. In Proceedings of the 5th International Conference on
Language Resources and Evaluation (LREC 2006).
Hancock, J., Woodworth, M. & Porter, S. (2011). Hungry like a
wolf: A word pattern analysis of the language of psychopaths.
Legal and Criminological Psychology. 113.
Hancock, J. & Markowitz, D. (2014). Linguistic Traces of a
Scientific Fraud: The Case of Diederik Stapel. PLoS ONE,9(8)
Feng, V. & Hirst, G. (2013) Detecting deceptive opinion with
profile compatibility.
Feng, S., Banerjee, R. & Choi, Y. (2012). Syntactic Stylometry for
Deception Detection. 50th Annual Meeting of the Association
for Computational Linguistics. Association for Computational
Linguistics, 171–175.
Larcker, D., Zakolyukina, A. (2012). Detecting Deceptive
Discussions in Conference Calls. Journal of Accounting
Research, 50(2), 495540.
Mihalcea, R. & Strapparava, C. (2009). The Lie Detector:
Explorations in the Automatic Recognition of Deceptive
Language. Proceedings of the ACL-IJCNLP Conference Short
Papers, pp. 309–312,
Oraby, S., Reed, L., Compton, R., Riloff, E., Walker, M. &
Whittaker, S. (2015). And That’s A Fact: Distinguishing Factual
and Emotional Argumentation in Online Dialogue
Ott, M., Cardie, C. & Hancock, J. (2013). Negative Deceptive
Opinion Spam. Proceedings of NAACLHLT. pp. 497–501,
Pang, B. & Lee, L. (2008). Opinion mining and sentiment
analysis. Foundations and Trends in Information Retrieval,
2(1-2), pp. 1–135.
Papacharissi, Z. & Oliveira, M. (2012).The Rhythms of News
Storytelling on #Egypt. Journal of Communication. 62. pp.
266–282.
Rahangdale, A. & Agrawa, A. (2014). Information extraction using
discourse analysis from newswires. International Journal of
Information Technology Convergence and Services. 4(3), pp.
21-30.
Rubin, V., Conroy, N. & Chen, Y. (2015)A. Towards News
Verification: Deception Detection Methods for News Discourse.
Hawaii International Conference on System Sciences.
Rubin, V. L. Chen, Y.,& Conroy, N. J. (2015)B. Deception
Detection for News: Three Types of Fakes. In The Proceedings
of the Association for Information Science and Technology
Annual Meeting (ASIST2015), Nov. 6-10, St. Louis.
Rubin, V. & Lukoianova, T. (2014). Truth and deception at the
rhetorical structure level. Journal of the American Society for
Information Science and Technology, 66(5).DOI: 10.1002/asi.
23216 ·
Strehl, A. Ghosh, J. & Mooney, R. (2000). Impact of Similarity
Measures on Web-page Clustering. AAAI Technical Report
WS-00-01.
Wu, G., Greene, D. Smyth, B. & Cunningham P. (2010).
Distortion as a Validation Criterion in the Identification of
Suspicious Reviews. 1st Workshop on Social Media Analytics.
Zhang, H., Fan, Z., Zeng, J. & Liu, Q. (2012). An Improving
De cep ti on De tec ti on Me thod in Co mpute r- Media ted
Communication. Journal of Networks, 7 (11),
... Furthermore, the spread of inaccurate information has the potential to weaken democratic procedures by establishing an impact on public opinion and electoral results. Additionally, it can worsen inequality in society and contribute to an atmosphere of fear and mistrust [2]. Moreover, the spreading of false information can result in adverse economic effects, including financial loss for enterprises and investors that depend on precise data for their decision-making processes. ...
... The rapid growth of communication technology and the widespread use of smart devices have led to a significant increase in data traffic, resulting in the creation of enormous volumes of data every second [2]. ...
Conference Paper
Full-text available
In an era where misleading information may quickly circulate on digital news channels, it is crucial to have efficient and trustworthy methods to detect and reduce the impact of misinformation. This research proposes an innovative framework that combines Natural Language Processing (NLP), Reinforcement Learning (RL), and Blockchain technologies to precisely detect and minimize the spread of false information in news articles on social media. The framework starts by gathering a variety of news items from different social media sites and performing preprocessing on the data to ensure its quality and uniformity. NLP methods are utilized to extract complete linguistic and semantic characteristics, effectively capturing the subtleties and contextual aspects of the language used. These features are utilized as input for a RL model. This model acquires the most effective tactics for detecting and mitigating the impact of false material by modeling the intricate dynamics of user engagements and incentives on social media platforms. The integration of blockchain technology establishes a decentralized and transparent method for storing and verifying the accuracy of information. The Blockchain component guarantees the unchangeability and safety of verified news records, while encouraging user engagement for detecting and fighting false information through an incentive system based on tokens. The suggested framework seeks to provide a thorough and resilient solution to the problems presented by misinformation in social media articles.
... In response to these challenges, machine learning (ML) and deep learning (DL) techniques have emerged as powerful tools for detecting deceptive content on social media platforms (Zhou & Zafarani, 2020). ML algorithms, such as Support Vector Machines (SVM), Random Forests, and Logistic Regression, have been employed to classify content based on features extracted from textual, visual, and network data (Conroy, Rubin, & Chen, 2015). These models are highly effective at identifying patterns in the data that distinguish between legitimate and deceptive content, but they often rely on handcrafted features, which can limit their adaptability to new types of misinformation. ...
Preprint
Full-text available
Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review identified key biases across the ML lifecycle: selection bias due to non-representative sampling, inadequate handling of class imbalance, insufficient linguistic preprocessing (e.g., negations), and inconsistent hyperparameter tuning. Although models such as Support Vector Machines (SVM), Random Forests, and Long Short-Term Memory (LSTM) networks showed strong potential, over-reliance on accuracy as an evaluation metric in imbalanced data settings was a common flaw. The review highlights the need for improved data preprocessing (e.g., resampling techniques), consistent hyperparameter tuning, and the use of appropriate metrics like precision, recall, F1 score, and AUROC. Addressing these limitations can lead to more reliable and generalizable ML/DL models for detecting deceptive content, ultimately contributing to the reduction of misinformation on social media.
... Consequently, automated detection of fake news has become an issue of great concern in recent years [2,4,5]. Early fake news detection works focused on text-only [6] or image-only [7] content analysis, without fully considering the potential correlation between the two. For example, in some fake news cases, real images may be accompanied by false textual content, or correct textual content is used to describe doctored images. ...
Article
Full-text available
Fake news detection is growing in importance as a key topic in the information age. However, most current methods rely on pre-trained small language models (SLMs), which face significant limitations in processing news content that requires specialized knowledge, thereby constraining the efficiency of fake news detection. To address these limitations, we propose the FND-LLM Framework, which effectively combines SLMs and LLMs to enhance their complementary strengths and explore the capabilities of LLMs in multimodal fake news detection. The FND-LLM framework integrates the textual feature branch, the visual semantic branch, the visual tampering branch, the co-attention network, the cross-modal feature branch and the large language model branch. The textual feature branch and visual semantic branch are responsible for extracting the textual and visual information of the news content, respectively, while the co-attention network is used to refine the interrelationship between the textual and visual information. The visual tampering branch is responsible for extracting news image tampering features. The cross-modal feature branch enhances inter-modal complementarity through the CLIP model, while the large language model branch utilizes the inference capability of LLMs to provide auxiliary explanation for the detection process. Our experimental results indicate that the FND-LLM framework outperforms existing models, achieving improvements of 0.7%, 6.8% and 1.3% improvements in overall accuracy on Weibo, Gossipcop, and Politifact, respectively.
Article
Full-text available
With the ever-increasing volume and variety of data generated, organisations have to ensure their truthfulness and reliability. This paper provides overview of current research on managing data veracity in a business environment where misinformation is growing. A literature analysis from 2002 to 2023 identified three major themes: methods for ensuring data validity, data processing and optimisation, and data veracity in sustainability performance. In addition, the study highlights the gaps in the current research and proposes future research directions to help develop a better understanding of the themes and organisational implications. The study concludes that data veracity is crucial for future organisational research. Nevertheless, further work is required to refine the definition of data veracity to incorporate ‘truthfulness’ better, understand human capabilities to support it, examine firms’ governance of truthfulness and measure data veracity for social impact. The implications of these findings for data management and the development of relevant theories are discussed.
Article
Fake news poses a significant threat to societies worldwide, including in Hausa-speaking regions, where misinformation is rapidly disseminated via social media. The lack of NLP resources tailored to this language exacerbated the problem of fake news in the Hausa language. While extensive research has been conducted on counterfeit news detection in languages such as English, little attention has been paid to languages like Hausa, leaving a significant portion of the global population vulnerable to misinformation. Traditional machine-learning approaches often fail to perform well in low-resource settings due to insufficient training data and linguistic resources. This study aims to develop a robust model for detecting fake news in the Hausa language by leveraging transfer learning techniques with adaptive fine-tuning. A dataset of over 6,600 news articles, including both fake and truthful articles, was collected from various sources between January 2022 and December 2023. Cross-lingual transfer Learning (XLT) was employed to adapt pre- trained models for the low-resource Hausa language. The model was fine-tuned and evaluated using performance metrics such as accuracy, precision, recall, F-score, AUC-ROC, and PR curves. Results demonstrated a high accuracy rate in identifying fake news, with significant improvements in detecting misinformation within political and world news categories. This study addresses the gap in Hausa- language natural language processing (NLP) and contributes to the fight against misinformation in Nigeria. The findings are relevant for developing AI- driven tools to curb fake news dissemination in African languages.
Article
Full-text available
: Fake news has spread more widely over the past few years. The development of social media andinternet websites has fueled the spread of fake news, causing it to mix with accurate information. The majority ofstudies on Fake News Detection FND were in English, but recent attention has been focused on Arabic. However,there aren't many studies on Arabic fake news detection. In this work, a new Arabic fake news detection approachhas been proposed using Arabic dataset publically available and a translated English fake news dataset into Arabic.A new model Text-CNNs based on 1D Convolution Neural Networks CNNs has been used for classification realand fake news. Extensive experimental results on the Arabic fake news dataset show that our proposed approachprovided high detection accuracy about (99.67%), Precision (99.45), Recall (99.65) and F1-score (99.50).
Conference Paper
Full-text available
Widespread adoption of internet technologies has changed the way that news is created and consumed. The current online news environment is one that incentivizes speed and spectacle in reporting, at the cost of fact-checking and verification. The line between user generated content and traditional news has also become increasingly blurred. This poster reviews some of the professional and cultural issues surrounding online news and argues for a two-pronged approach inspired by Hemingway's " automatic crap detector " (Manning, 1965) in order to address these problems: a) proactive public engagement by educators, librarians, and information specialists to promote digital literacy practices; b) the development of automated tools and technologies to assist journalists in vetting, verifying, and fact-checking, and to assist news readers by filtering and flagging dubious information.
Conference Paper
Full-text available
A fake news detection system aims to assist users in detecting and filtering out varieties of potentially deceptive news. The prediction of the chances that a particular news item is intentionally deceptive is based on the analysis of previously seen truthful and deceptive news. A scarcity of deceptive news, available as corpora for predictive modeling, is a major stumbling block in this field of natural language processing (NLP) and deception detection. This paper discusses three types of fake news, each in contrast to genuine serious reporting, and weighs their pros and cons as a corpus for text analytics and predictive modeling. Filtering, vetting, and verifying online information continues to be essential in library and information science (LIS), as the lines between traditional news and online information are blurring.
Conference Paper
Full-text available
We investigate the characteristics of factual and emotional argumentation styles observed in online debates. Using an annotated set of FACTUAL and FEELING debate forum posts, we extract patterns that are highly correlated with factual and emotional arguments, and then apply a bootstrapping methodology to find new patterns in a larger pool of unanno-tated forum posts. This process automatically produces a large set of patterns representing linguistic expressions that are highly correlated with factual and emotional language. Finally , we analyze the most discriminating patterns to better understand the defining characteristics of factual and emotional arguments.
Article
Full-text available
This paper proposes Natural language based Discourse Analysis method used for extracting information from the news article of different domain. The Discourse analysis used the Rhetorical Structure theory which is used to find coherent group of text which are most prominent for extracting information from text. RST theory used the Nucleus-Satellite concept for finding most prominent text from the text document. After Discourse analysis the text analysis has been done for extracting domain related object and relates this object. For extracting the information knowledge based system has been used which consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is evaluated according gold-of-art analysis and human decision for extracted information.
Conference Paper
Full-text available
News verification is a process of determining whether a particular news report is truthful or deceptive. Deliberately deceptive (fabricated) news creates false conclusions in the readers' minds. Truthful (authentic) news matches the writer's knowledge. How do you tell the difference between the two in an automated way? To investigate this question, we analyzed rhetorical structures, discourse constituent parts and their coherence relations in deceptive and truthful news sample from NPR's "Bluff the Listener". Subsequently, we applied a vector space model to cluster the news by discourse feature similarity, achieving 63% accuracy. Our predictive model is not significantly better than chance (56% accuracy), though comparable to average human lie detection abilities (54%). Methodological limitations and future improvements are discussed. The long-term goal is to uncover systematic language differences and inform the core methodology of the news verification system.
Article
Full-text available
When scientists report false data, does their writing style reflect their deception? In this study, we investigated the linguistic patterns of fraudulent (N = 24; 170,008 words) and genuine publications (N = 25; 189,705 words) first-authored by social psychologist Diederik Stapel. The analysis revealed that Stapel's fraudulent papers contained linguistic changes in science-related discourse dimensions, including more terms pertaining to methods, investigation, and certainty than his genuine papers. His writing style also matched patterns in other deceptive language, including fewer adjectives in fraudulent publications relative to genuine publications. Using differences in language dimensions we were able to classify Stapel's publications with above chance accuracy. Beyond these discourse dimensions, Stapel included fewer co-authors when reporting fake data than genuine data, although other evidentiary claims (e.g., number of references and experiments) did not differ across the two article types. This research supports recent findings that language cues vary systematically with deception, and that deception can be revealed in fraudulent scientific discourse.
Article
Full-text available
There is a lack of clarity within the social media domain about the number of discrete participants. Influence and measurement within new media is skewed towards the biggest numbers, resulting in fake tweets, sock puppets, and a range of force multipliers such as botnets, application programming interfaces (APIs), and cyborgs. Social media metrics are sufficiently manipulated away from authentic discrete usage so that the trustworthiness of identity, narrative, and authority are constantly uncertain. Elections, social causes, political agendas and new modes of online governance can now be influenced by a range of virtual entities that can cajole and redirect opinions without affirming identity or allegiance. Using the 2013 Australian Federal Election as a case study, this study demonstrates the need to increase legitimacy and validity in micro-blogging forms of new media and the need for multi-factor authentication.
Article
The rising influence of user-generated online reviews (Cone, 2011) has led to growing incentive for businesses to solicit and manufacture DECEPTIVE OPINION SPAM-fictitious reviews that have been deliberately written to sound authentic and deceive the reader. Recently, Ott et al. (2011) have introduced an opinion spam dataset containing gold standard deceptive positive hotel reviews. However, the complementary problem of negative deceptive opinion spam, intended to slander competitive offerings, remains largely unstudied. Following an approach similar to Ott et al. (2011), in this work we create and study the first dataset of deceptive opinion spam with negative sentiment reviews. Based on this dataset, we find that standard n-gram text categorization techniques can detect negative deceptive opinion spam with performance far surpassing that of human judges. Finally, in conjunction with the aforementioned positive review dataset, we consider the possible interactions between sentiment and deception, and present initial results that encourage further exploration of this relationship.
Article
Online deception is disrupting our daily life, organizational process, and even national security. Existing deception detection approaches followed a traditional paradigm by using a set of cues as antecedents, and used a variety of data sets and common classification models to detect deception, which were demonstrated to be an accurate technique, but the previous results also showed the necessity to expand the deception feature set in order to improve the accuracy. In our study, we propose a novel feature selection method of the combination of CHI statistics and hypothesis testing, and achieve the accuracy level of 86% and F-measure of 0.84 by using the novel feature sets and SVM classification models, which exceeds the previous experiment results.
Article
This paper furthers the development of methods to distinguish truth from deception in textual data. We use rhetorical structure theory (RST) as the analytic framework to identify systematic differences between deceptive and truthful stories in terms of their coherence and structure. A sample of 36 elicited personal stories, self-ranked as truthful or deceptive, is manually analyzed by assigning RST discourse relations among each story's constituent parts. A vector space model (VSM) assesses each story's position in multidimensional RST space with respect to its distance from truthful and deceptive centers as measures of the story's level of deception and truthfulness. Ten human judges evaluate independently whether each story is deceptive and assign their confidence levels (360 evaluations total), producing measures of the expected human ability to recognize deception. As a robustness check, a test sample of 18 truthful stories (with 180 additional evaluations) is used to determine the reliability of our RST-VSM method in determining deception. The contribution is in demonstration of the discourse structure analysis as a significant method for automated deception detection and an effective complement to lexicosemantic analysis. The potential is in developing novel discourse-based tools to alert information users to potential deception in computer-mediated texts.