Conference PaperPDF Available

Automatic Deception Detection: Methods for Finding Fake News

Authors:

Abstract

This research surveys the current state-of-the-art technologies that are instrumental in the adoption and development of fake news detection. " Fake news detection " is defined as the task of categorizing news along a continuum of veracity, with an associated measure of certainty. Veracity is compromised by the occurrence of intentional deceptions. The nature of online news publication has changed, such that traditional fact checking and vetting from potential deception is impossible against the flood arising from content generators, as well as various formats and genres. The paper provides a typology of several varieties of veracity assessment methods emerging from two major categories – linguistic cue approaches (with machine learning), and network analysis approaches. We see promise in an innovative hybrid approach that combines linguistic cue and machine learning, with network-based behavioral data. Although designing a fake news detector is not a straightforward problem, we propose operational guidelines for a feasible fake news detecting system.
Automatic Deception Detection: Methods for Finding Fake
News
Niall J. Conroy, Victoria L. Rubin, and Yimin Chen
Language and Information Technology Research Lab (LIT.RL)
Faculty of Information and Media Studies
University of Western Ontario, London, Ontario, CANADA
nconroy1@uwo.ca, vrubin@uwo.ca, ychen582@uwo.ca
!
ABSTRACT
This research surveys the current state-of-the-art
technologies that are instrumental in the adoption and
development of fake news detection. “Fake news detection”
is defined as the task of categorizing news along a
continuum of veracity, with an associated measure of
certainty. Veracity is compromised by the occurrence of
intentional deceptions. The nature of online news
publication has changed, such that traditional fact checking
and vetting from potential deception is impossible against
the flood arising from content generators, as well as various
formats and genres.
The paper provides a typology of several varieties of
veracity assessment methods emerging from two major
categories linguistic cue approaches (with machine
learning), and network analysis approaches. We see promise
in an innovative hybrid approach that combines linguistic
cue and machine learning, with network-based behavioral
data. Although designing a fake news detector is not a
straightforward problem, we propose operational guidelines
for a feasible fake news detecting system.
Keywords
Deception detection, fake news detection, veracity
assessment, news verification, methods, automation, SVM,
knowledge networks, predictive modelling, fraud
INTRODUCTION
News verification aims to employ technology to identify
intentionally deceptive news content online, and is an
important issue within certain streams of library and
information science (LIS). Fake news detection is defined
as the prediction of the chances of a particular news article
(news report, editorial, expose, etc.) being intentionally
deceptive (Rubin, Conroy & Chen, 2015). Tools aim to
mimic certain filtering tasks which have, to this point, been
the purview of journalists and other publishers of traditional
news content. The proliferation of user-generated content,
and Com put er Media ted Com mun i cat ion (CM C )
technologies such as blogs, Twitter, and other social media
have the capacity of news delivery mechanisms on a mass
scale— yet much of the information is of questionable
veracity (Ciampaglia, Shiralkar, Rocha, Bollen, Menczer &
Flammini, 2015). Establishing the reliability of information
online is a daunting but critical challenge. Four decades of
deception detection research has helped us learn about how
well humans are able detect lies in text. The findings show
we are not so good at it. In fact, just 4% better than chance,
based on a meta-analysis of more than 200 experiments
(Bond & DePaulo, 2006). This problem has led researchers
and technical developers to look at several automated ways
of assessing the truth value of potentially deceptive text
based on the properties of the content and the patterns of
computer-mediated communication .
Structured datasets are easier to verify than non-structured
(or semi-structured) data such as texts. When we know the
language domain (e.g., insurance claims or health-related
news) we can make better guesses about the nature and use
of deception. Semi-structured non-domain specific web
data come in many formats and demand flexible methods
for veracity verification. For some time, however, the
development and evaluation of different methods have
remained in isolated corners, relatively unknown in LIS.
More recently, efforts of methodological cross-pollination
and hybrid approaches have produced promising results
(Rubin et al., 2015A). The range of journalistic practices
and available news sources (see Rubin et al. (2015B) for an
overview) demand consideration of multiple methods since
one approach often addresses known weaknesses in another.
How then is it possible to gauge the veracity of online
news?
This paper provides researchers with a map of the current
landscape of veracity (or deception) assessment methods,
their major classes and goals, all with the aim of proposing
a hybrid approach to system design. These methods have
emerged from separate development streams, utilizing
disparate techniques. In this survey, two major categories of
methods emerge: 1. Linguistic Approaches in which the
content of deceptive messages is extracted and analyzed to
associate language patterns with deception; and 2. Network
Approaches in which network information, such as message
metadata or structured knowledge network queries can be
harnessed to provide aggregate deception measures. Both
forms typically incorporate machine learning techniques for
training classifiers to suit the analysis. It is incumbent upon
researchers to understand these different areas, yet no
known typology of methods exists in the current literature.
The goal is to provide a survey of the existing research
while proposing a hybrid approach, which utilizes the most
eff e c t i v e dec e p t i o n dete c t i o n meth o d s for th e
implementation of a fake news detection tool.
!
ASIST 2015, November 6-10, 2015, St. Louis, MO, USA.
Copyright © 2015 Niall J. Conroy, Victoria L. Rubin & Yimin Chen
LINGUISTIC APPROACHES
Most liars use their language strategically to avoid being
caught. In spite of the attempt to control what they are
saying, language “leakage” occurs with certain verbal
aspects that are hard to monitor such as frequencies and
patterns of pronoun, conjunction, and negative emotion
word usage (Feng & Hirst, 2013). The goal in the linguistic
approach is to look for such instances of leakage or, so
called “predictive deception cues” found in the content of a
message.
Data Representation
Perhaps the simplest method of representing texts is the
“bag of words” approach, which regards each word as a
single, equally significant unit. In the bag of words
approach, individual words or “n-grams” (multiword)
frequencies are aggregated and analyzed to reveal cues of
deception. Further tagging of words into respective lexical
cu es for e xam ple, pa rts of s pee ch or “s hal low
syntax (Hancock & Markowitz, 2013), affec ti ve
dimensions (Vrij, 2006), or location-based words (Hancock,
et al, 2013) are all ways of providing frequency sets to
reveal linguistic cues of deception.
The simplicity of this representation also leads to its biggest
shortcoming. In addition to relying exclusively on
language, the method relies on isolated n-grams, often
divorced from useful context information. In this method,
any resolution of ambiguous word sense remains non-
existent (Larcker & Zakolyukina 2012). Many deception
detection researchers have found this method useful in
tandem with different, complementary analysis (Zhang,
Fan, Zeng & Liu, 2012; Lary, Nikitov & Stone, 2010; Ott,
Cardi, & Hancock, 2013), several of which are discussed in
the remainder of this proposal.
Deep Syntax
Analysis of word use is often not enough in predicting
deception. Deeper language structures (syntax) have been
analyzed to predict instances of deception. Deep syntax
analysis is implemented through Probability Context Free
Grammars (PCFG). Sentences are transformed to a set of
rewrite rules (a parse tree) to describe syntax structure, for
example noun and verb phrases, which are in turn rewritten
by their syntactic constituent parts (Feng, Banerjee & Choi,
2012). The final set of rewrites produces a parse tree with a
certain probability assigned. This method is used to
distinguish rule categories (lexicalized, unlexicalized,
parent nodes, etc.) for deception detection with 85-91%
accuracy (depending on the rule category used) (Feng et al.,
2012).
Third-party tools, such as the Stanford Parser (de Marneffe,
MacCartney, Manning, 2006; Rahangdale & Agrawa,
2014), AutoSlog-TS syntax analyzer (Oraby, Reed,
Compton, Riloff, Walker, & Whittaker, 2015) and others
assist in the automation. Alone, syntax analysis might not
be sufficiently capable of identifying deception, and studies
often combine this approach with other linguistic or
network analysis techniques (e.g., Feng et al., 2012; Feng &
Hirst, 2013).
Semantic Analysis
As an alternative to deception cues, signals of truthfulness
have also been analyzed and achieved by characterizing the
degree of compatibility between a personal experience (e.g.,
a hotel review) as compared to a content “profile” derived
from a collection of analogous data. This approach extends
the n-gram plus syntax model by incorporating profile
compatibility features, showing the addition significantly
improves classification performance. (Feng & Hirst, 2013).
The intuition is that a deceptive writer with no experience
with an event or object (e.g., never visited the hotel in
question) may include contradictions or omission of facts
present in profiles on similar topics. For product reviews, a
writer of a truthful review is more likely to make similar
comments about aspects of the product as other truthful
reviewers. Extracted content from key words consists of
attribute:descriptor pair. By aligning profiles and the
description of the writers personal experience, veracity
assessment is a function of the compatibility scores: 1.
Compatibility with the existence of some distinct aspect
(eg. an art museum near the hotel); 2. Compatibility with
the description of some general aspect, such as location or
serv ic e. Pr ed iction of f al se ho od is shown to be
approximately 91% accurate with this method.
Although demonstrated useful in the above context of
reviews, this method has so far been restricted to the
domain of application. There are two potential limitations
in this method: the ability to determine alignment between
Figure 1: Fact-checking statements. (a) Structured information about President
Obama contained in the “infoboxes” of Wikipedia articles. (b) Shortest knowl-
edge graph path returned for the false statement “Barack Obama is a Muslim”.
The path traverses high-degree nodes representing generic entities, such as
Canada, and is assigned a low truth value. (Ciampiaglia et al., 2015)
attributes and descriptors depends on a sufficient amount of
mined content for profiles, and the challenge of correctly
associating descriptors with extracted attributes.
Rhetorical Structure and Discourse Analysis
At the discourse level, deception cues present themselves
both in CMC communication and in news content. A
description of discourse can be achieved through the
Rhetorical Structure Theory (RST) analytic framework, that
identifies instances of rhetoric relations between linguistic
elements. Systematic differences between deceptive and
truthful messages in terms of their coherence and structure
has been combined with a Vector Space Model (VSM) that
assesses each message’s position in multi-dimensional RST
space with respect to its distance to truth and deceptive
centers (Rubin & Lukoianova, 2014). At this level of
linguistic analysis, the prominent use of certain rhetorical
relations can be indicative of deception. Tools to automate
rhetorical classification are becoming available, although
not yet employed in the context of veracity assessment.
Classifiers
Sets of word and category frequencies are useful for
subsequent automated numerical analysis. One common use
is for the training of “classifiers” as in Support Vector
Machines (SVM) (Zhang et al., 2012) and Naïve Bayesian
models (Oraby et al., 2015). Simply put, when a
mathematical model is sufficiently trained from pre-coded
examples in one of two categories, it can predict instances
of future deception on the basis of numeric clustering and
distances. The use of different clustering methods and
distance functions between data points shape the accuracy
of SVM (Strehl, Ghosh & Mooney, 2000), which invites
new experimentation on the net effect of these variables.
Naïve Bayes algorithms make classifications based on
accumulated evidence of the correlation between a given
variable (e.g., syntax) and the other variables present in the
model (Mihalcea & Strapparava, 2009).
The classification of sentiment (Pang & Lee, 2008; Ott et
al., 2013) is based on the underlying intuition that deceivers
use unintended emotional communication, judgment or
evaluation of affective state (Hancock, Woodworth, &
Porter, 2011). Likewise, syntactic patterns may be used in
distinguishing feeling from fact-based arguments by
associating learned patterns of argumentation style classes.
In studies of business communication, performance is
significantly better than a random guess by 16%, and the
language of deceptive executives exhibits fewer non-
extreme positive emotions (Larcker & Zakolyukina, 2012).
Compari son b etween hum an judgement and SVM
classifiers showed 86% performance accuracy on negative
deceptive opinion spam (Ott et al., 2013). Fake negative
reviewers over-produced negative emotion terms relative to
the truthful reviews. These were deemed not the result of
“leakage cues” from the emotional distress of lying, but
exaggerations of the sentiment deceivers are trying to
convey.
These linguistic approaches all rely on language usage and
its analysis, and are promising when used in hybrid
approaches. However, findings emerging from topic-
specific studies (product reviews, business) may have
limited generalizability towards real-time veracity detection
of news.
NETWORK APPROACHES
Innovative and varied, using network properties and
behavior are ways to complement content-based approaches
that rely on deceptive language and leakage cues to predict
deception. As real-time content on current events is
in cre asi ngl y pr oli fer ate d thro ugh mic ro- blo ggi ng
applications such as Twitter, deception analysis tools are all
the more important.
Linked data
The use of knowledge networks may represent a significant
step towards scalable computational fact-checking methods.
For certain data, false “factual statements” can represent a
form of deception since they can be extracted and examined
alongside findable statements about the known world. This
approach leverages an existing body of collective human
knowledge to assess the truth of new statements. The
method depends on querying existing knowledge networks,
or publicly available structured data, such as DBpedia
ontology, or the Google Relation Extraction Corpus
(GREC).
The inherently structured data network of entities is
connected through a predicate relationship. Fact checking
can be effectively reduced to a simple network analysis
problem: the computation of the simple shortest path (see
Figure 1). Queries based on extracted fact statements are
assigned semantic proximity as a function of the transitive
relationship between subject and predicate via other nodes.
The closer the nodes, the higher the likelihood that a
particular subject-predicate-object statement is true.
There are several so-called ‘network effect’ variables that
are exploited to derive truth probabilities (Ciampaglia et al.,
2015), so the outlook for exploiting structured data
repositories for fact-checking remains promising. From the
short list of existing published work in this area, results
using sample facts from four different subject areas range
from 61% to 95%. Success was measured based on whether
the machine was able to assign higher true values to true
statements than to false ones (Ciampaglia, et al., 2015). A
problem with this method, however, rests in the fact that
statements must reside in a pre-existing knowledge base.
Social Network Behavior
Authentication of identity on social media is paramount to
the notion of trust. The proliferation of news in the form of
current events through mass technologies like micro-blogs
invites ways of ascertaining the difference between fake
and genuine content. Outside of the analysis of content
comes the use of metadata and telltale behavior of
questionable sources (Chu, Gianvecchio, Wang & Jajodia,
2010). The recent use of twitter in influencing political
perceptions (Cook et al., 2013) is one scenario where
certain data, namely the inclusion of hyperlinks or
associated metadata, can be compiled to establish veracity
assessments. Centering resonance analysis (CRA), a mode
of network-based text analysis, represents the content of
large sets of texts by identifying the most important words
that link other words in the network. This was employed by
Papacharissi & Oliviera to identify content patterns in posts
about Egypt’s elections (2012). Combining sentiment and
behaviour studies have demonstrated the contention that
sentiment-focused reviews from singleton contributors
significantly affects online ranking (Wu, Greene, Smyth &
Cunningham, 2010), and that this is an indicator of
“shilling” or contributing fake reviews to artificially distort
a ranking.
CONCLUSION
Linguistic and network-based approaches have shown high
accuracy results in classification tasks within limited
domains. This discussion drafts a basic typology of
methods available for further refinement and evaluation,
and provides a basis for the design of a comprehensive fake
news detection tool. Techniques arising from disparate
approaches may be utilized together in a hybrid system,
whose features are summarized:
Linguistic processing should be built on multiple layers
from word/lexical analysis to highest discourse-level
analysis for maximum performance.
As a viable alternative to strictly content-based
approaches, network behavior should be combined to
incorporate the ‘trust’ dimension by identifying credible
sources.
Tools should be designed to augment human judgement,
not replace it. Relations between machine output and
methods should be transparent.
Contributions in the form of publicly available gold
standard datasets should be in linked data format to assist
in up-to-date fact checking.
ACKNOWLEDGMENTS
This research has been funded by the Government of
Canada Social Sciences and Humanities Research Council
(SSHRC) Insight Grant (#435-2015-0065) awarded to Dr.
Rubin for the project entitled Digital Deception Detection:
Identifying Deliberate Misinformation in Online News.
REFERENCES
Ciampaglia, G., Shiralkar, P., Rocha, L., Bollen, J. Menczer, F., &
Flammini, A. (2015). Computational fact checking from
knowledge networks.
Chen, Y., Conroy, N. J., & Rubin, V. L. (2015). News in an Online
World: The Need for an “Automatic Crap Detector”. In The
Proceedings of the Association for Information Science and
Technology Annual Meeting (ASIST2015), Nov. 6-10, St. Louis.
Chu, Z., Gianvecchio, S., Wang, H., & Jajodia, S. (2010). Who is
tweeting on Twitter: Human, Bot, or Cyborg? in the
Proceedings of the 26th Annual Computer Security Applications
Conference, ACSAC ’10, pp. 21-30.
Cook, D., Waugh, B., Abdipanab, M, Hashemi, O., Rahman, S.
(2013). Twitter Deception and Influence: Issues of Identity,
Slacktivism and Puppetry
de Marneffe, M., MacCartney, B. & Manning, C. (2006).
Generating typed dependency parses from phrase structure
parses. In Proceedings of the 5th International Conference on
Language Resources and Evaluation (LREC 2006).
Hancock, J., Woodworth, M. & Porter, S. (2011). Hungry like a
wolf: A word pattern analysis of the language of psychopaths.
Legal and Criminological Psychology. 113.
Hancock, J. & Markowitz, D. (2014). Linguistic Traces of a
Scientific Fraud: The Case of Diederik Stapel. PLoS ONE,9(8)
Feng, V. & Hirst, G. (2013) Detecting deceptive opinion with
profile compatibility.
Feng, S., Banerjee, R. & Choi, Y. (2012). Syntactic Stylometry for
Deception Detection. 50th Annual Meeting of the Association
for Computational Linguistics. Association for Computational
Linguistics, 171–175.
Larcker, D., Zakolyukina, A. (2012). Detecting Deceptive
Discussions in Conference Calls. Journal of Accounting
Research, 50(2), 495540.
Mihalcea, R. & Strapparava, C. (2009). The Lie Detector:
Explorations in the Automatic Recognition of Deceptive
Language. Proceedings of the ACL-IJCNLP Conference Short
Papers, pp. 309–312,
Oraby, S., Reed, L., Compton, R., Riloff, E., Walker, M. &
Whittaker, S. (2015). And That’s A Fact: Distinguishing Factual
and Emotional Argumentation in Online Dialogue
Ott, M., Cardie, C. & Hancock, J. (2013). Negative Deceptive
Opinion Spam. Proceedings of NAACLHLT. pp. 497–501,
Pang, B. & Lee, L. (2008). Opinion mining and sentiment
analysis. Foundations and Trends in Information Retrieval,
2(1-2), pp. 1–135.
Papacharissi, Z. & Oliveira, M. (2012).The Rhythms of News
Storytelling on #Egypt. Journal of Communication. 62. pp.
266–282.
Rahangdale, A. & Agrawa, A. (2014). Information extraction using
discourse analysis from newswires. International Journal of
Information Technology Convergence and Services. 4(3), pp.
21-30.
Rubin, V., Conroy, N. & Chen, Y. (2015)A. Towards News
Verification: Deception Detection Methods for News Discourse.
Hawaii International Conference on System Sciences.
Rubin, V. L. Chen, Y.,& Conroy, N. J. (2015)B. Deception
Detection for News: Three Types of Fakes. In The Proceedings
of the Association for Information Science and Technology
Annual Meeting (ASIST2015), Nov. 6-10, St. Louis.
Rubin, V. & Lukoianova, T. (2014). Truth and deception at the
rhetorical structure level. Journal of the American Society for
Information Science and Technology, 66(5).DOI: 10.1002/asi.
23216 ·
Strehl, A. Ghosh, J. & Mooney, R. (2000). Impact of Similarity
Measures on Web-page Clustering. AAAI Technical Report
WS-00-01.
Wu, G., Greene, D. Smyth, B. & Cunningham P. (2010).
Distortion as a Validation Criterion in the Identification of
Suspicious Reviews. 1st Workshop on Social Media Analytics.
Zhang, H., Fan, Z., Zeng, J. & Liu, Q. (2012). An Improving
De cep ti on De tec ti on Me thod in Co mpute r- Media ted
Communication. Journal of Networks, 7 (11),
... A relevant, recently established research topic is on detecting fake news. In particular, "fake news detection" is defined as the task of categorising news along a continuum of veracity, with an associated measure of certainty; veracity is compromised by the occurrence of intentional deceptions [1]. The state-of-the-art methods for detecting the spread of fake news can be coarsely classified into two categories. ...
... These are just a few examples of Freud's works applying Psychoanalysis techniques out of the psychoanalytical setting of a patient and an analyst. 1. It is a characteristic of the Hysteric discourse to be charged with emotion. ...
Preprint
Full-text available
This research investigates the effective incorporation of the human factor and user perception in text-based interactive media. In such contexts, the reliability of user texts is often compromised by behavioural and emotional dimensions. To this end, several attempts have been made in the state of the art, to introduce psychological approaches in such systems, including computational psycholinguistics, personality traits and cognitive psychology methods. In contrast, our method is fundamentally different since we employ a psychoanalysis-based approach; in particular, we use the notion of Lacanian discourse types, to capture and deeply understand real (possibly elusive) characteristics, qualities and contents of texts, and evaluate their reliability. As far as we know, this is the first time computational methods are systematically combined with psychoanalysis. We believe such psychoanalytic framework is fundamentally more effective than standard methods, since it addresses deeper, quite primitive elements of human personality, behaviour and expression which usually escape methods functioning at "higher", conscious layers. In fact, this research is a first attempt to form a new paradigm of psychoanalysis-driven interactive technologies, with broader impact and diverse applications. To exemplify this generic approach, we apply it to the case-study of fake news detection; we first demonstrate certain limitations of the well-known Myers-Briggs Type Indicator (MBTI) personality type method, and then propose and evaluate our new method of analysing user texts and detecting fake news based on the Lacanian discourses psychoanalytic approach.
... At the lexicon level, the main tasks involve measuring different frequency statistics, which one can do with approaches such as a bag of words model or a unigram-bigram model (Jin et al., 2014;Zhou et al., 2020). For syntax, one explores the frequency with which different parts of speech appear in text to measure low-level syntax operations and probabilistic context-free grammar parse trees for deep syntax operations (Conroy et al., 2015;Feng et al., 2012;Pérez-Rosas et al., 2017). At the semantic level, one can assign different frequencies to lexicons or phrases that fall into different psycho-linguistic categories that one can leverage with Linguistic Inquiry and Word Count (LIWC) (Bond et al., 2017;Jordan et al., 2018). ...
... Word usage, part-of-speech tags, syntax, and bag-of-word approaches are used to learn the patterns of disinformation messages (Feng et al., 2012;Markowitz & Hancock, 2016). Since rule-based methods rely primarily on n-grams or syntactical analysis, contextual meaning in the word sequence may not be captured (Conroy et al., 2015). Creating and maintaining hand-crafted rules is time-consuming and lacks generalizability. ...
Article
Full-text available
The spreading of disinformation in social media threatens cybersecurity and undermines market efficiency. Detecting disinformation is challenging due to large volumes of social media content and a rapidly changing environment. This research developed and validated a theory-based, novel deep-learning approach (called TRNN) to disinformation detection. Grounded in social and psychological theories, TRNN uses deep-learning and data-centric augmentation to enhance disinformation detection in financial social media. Temporal and contextual information is encoded as specific knowledge about human-validated disinformation, which was identified from our unique collection of 745,139 financial social media messages about four U.S. high-tech company stocks and their fine-grained trading data. TRNN uses multiple series of long short-term memory (LSTM) recurrent neurons to learn dynamic and hidden patterns to support disinformation detection. Our experimental findings show that TRNN significantly outperformed widely-used machine learning techniques in terms of precision, recall, F-score and accuracy, achieving consistently better classification performance in disinformation detection. A case study of Apple Inc.’s stock price movement demonstrates the potential usability of TRNN for secure knowledge management. The research contributes to developing novel approach and model, producing new information systems artifacts and dataset, and providing empirical findings of detecting online disinformation.
... More clearly, the fake news detection pipeline requires the following steps for English-like analytical languages: First the labeled dataset is encoded with a vector space model to be able to train a classifier and then it is pre-processed with simple tasks such as stop-word removal. After this step, a feature selection mechanism may be preferred or not (Conroy et al., 2015). Then the obtained dataset used to train various machine learning algorithms such as NB, LR, SVM and RF. ...
Article
Full-text available
The increasing usage of social media and internet generates a significant amount of information to be analyzed from various perspectives. In particular, fake news is defined as the false news that is presented as factual news. Fake news are in general fabricated toward a manipulation aim. Fake news identification is in general a natural language analysis problem and machine learning algorithms are emerged as automated predictors. Well-known machine learning algorithms such as Naïve Bayes (NB) and Random Forest (RF) are successfully used for fake-news identification problem. Turkish is a morphologically rich language and it has agglutinative complexity that requires dense language pre-processing steps and feature selection. Recent neural language models such as Bidirectional Encoder Representations from Transformers (BERT) proposes an opportunity for Turkish-like morphologically rich languages a relatively straightforward pipeline in the solution of natural language problems. In this work, we compared NB, RF, Support Vector Machine (SVM), Naïve Bayes Multinomial (NBM) and Logistics Regression (LR) on top of correlation based feature selection and newly proposed Turkish-BERT (BERTurk) to identify Turkish fake news. And we obtained 99.90 % accuracy in fake news identification which is a highly efficient model without substantial language pre-processing tasks.
... This this research stated that it is known that most liars hide their identity not to be caught. The research acknowledged the difficulty of designing a hybrid fake news detector [14]. ...
Preprint
Full-text available
Fake news existed ever since there was news, from rumors to printed media then radio and television. Recently, the information age, with its communications and Internet breakthroughs, exacerbated the spread of fake news. Additionally, aside from e-Commerce, the current Internet economy is dependent on advertisements, views and clicks, which prompted many developers to bait the end users to click links or ads. Consequently, the wild spread of fake news through social media networks has impacted real world issues from elections to 5G adoption and the handling of the Covid- 19 pandemic. Efforts to detect and thwart fake news has been there since the advent of fake news, from fact checkers to artificial intelligence-based detectors. Solutions are still evolving as more sophisticated techniques are employed by fake news propagators. In this paper, R code have been used to study and visualize a modern fake news dataset. We use clustering, classification, correlation and various plots to analyze and present the data. The experiments show high efficiency of classifiers in telling apart real from fake news.
... Many studies have focused on the detection and classification of fake news on social media platforms such as Facebook and Twitter (Allcott, 2017). At the conceptual level, fake news is divided into different types, then the knowledge is expanded to generalize machine learning (ML) models in multiple fields (Conroy, 2015) (Jwa, 2019) By using machine learning, fake news can be detected easily and automatically (Khan, 2019) (Vedova, 2018) Once someone posts fake news, the machine learning algorithm will check the content of the post and detect it as fake news. Several researchers are trying to find the best machine learning classifier to detect fake news (Kurasinski, 2020). ...
Article
The massive amounts of information and data released by social media have aroused more interest due to the multiple application areas they can use. In fact, misinformation and disinformation campaigns are becoming more common around the world. The explosive growth of fake news and its erosion has increased the need of detection and intervention. Therefore, identifying fake news on online social media is crucial to allow users knowing the information truth and not fall into rumours and misinformation. This paper considers the text format fake news detection methods, and at the same time elaborates the existence and reasons of fake news. In this paper, our main purpose is to identify the optimal machine learning model for profiling fake news from social media posts content. However, to achieve this goal, we need to compare many powerful machine learning models on a huge dataset, then we could identify the optimal model to profile fake news. In any case, the results that we have obtained are very important and with very high precision.
Article
The rapid dissemination of Internet technologies makes it easier for people to live in terms of access to information. However, in addition to these positive aspects of the internet, negative effects cannot be ignored. The most important of these is to deceive people who have access to information whose reliability is controversial through social media. Deception, in general, aims to direct the thoughts of the people on a particular subject and create a social perception for a specific purpose. The detection of this phenomenon is becoming more and more important due to the enormous increase in the number of people using social networks. Although some researchers have recently proposed techniques for solving the problem of deception detection, there is a need to design and use high-performance systems in terms of different evaluation metrics. In this study, the problem of deception detection in online social networks is modeled as a classification problem and a methodology that detects misleading contents in social networks using text mining and machine learning algorithms is proposed. In this method, since the content is text-based, text mining processes are performed and unstructured data sets are converted to structured data sets. Then supervised machine learning algorithms are adapted and applied to the structured data sets. In this paper, real public data sets are used and Support Vector Machine, k-Nearest Neighbor (k-NN), Naive Bayes, Random Forest, Decision Trees, Gradient Boosted Trees, and Logistic Regression algorithms are compared in terms of many different metrics.
Conference Paper
Full-text available
Widespread adoption of internet technologies has changed the way that news is created and consumed. The current online news environment is one that incentivizes speed and spectacle in reporting, at the cost of fact-checking and verification. The line between user generated content and traditional news has also become increasingly blurred. This poster reviews some of the professional and cultural issues surrounding online news and argues for a two-pronged approach inspired by Hemingway's " automatic crap detector " (Manning, 1965) in order to address these problems: a) proactive public engagement by educators, librarians, and information specialists to promote digital literacy practices; b) the development of automated tools and technologies to assist journalists in vetting, verifying, and fact-checking, and to assist news readers by filtering and flagging dubious information.
Conference Paper
Full-text available
A fake news detection system aims to assist users in detecting and filtering out varieties of potentially deceptive news. The prediction of the chances that a particular news item is intentionally deceptive is based on the analysis of previously seen truthful and deceptive news. A scarcity of deceptive news, available as corpora for predictive modeling, is a major stumbling block in this field of natural language processing (NLP) and deception detection. This paper discusses three types of fake news, each in contrast to genuine serious reporting, and weighs their pros and cons as a corpus for text analytics and predictive modeling. Filtering, vetting, and verifying online information continues to be essential in library and information science (LIS), as the lines between traditional news and online information are blurring.
Conference Paper
Full-text available
We investigate the characteristics of factual and emotional argumentation styles observed in online debates. Using an annotated set of FACTUAL and FEELING debate forum posts, we extract patterns that are highly correlated with factual and emotional arguments, and then apply a bootstrapping methodology to find new patterns in a larger pool of unanno-tated forum posts. This process automatically produces a large set of patterns representing linguistic expressions that are highly correlated with factual and emotional language. Finally , we analyze the most discriminating patterns to better understand the defining characteristics of factual and emotional arguments.
Article
Full-text available
This paper proposes Natural language based Discourse Analysis method used for extracting information from the news article of different domain. The Discourse analysis used the Rhetorical Structure theory which is used to find coherent group of text which are most prominent for extracting information from text. RST theory used the Nucleus-Satellite concept for finding most prominent text from the text document. After Discourse analysis the text analysis has been done for extracting domain related object and relates this object. For extracting the information knowledge based system has been used which consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is evaluated according gold-of-art analysis and human decision for extracted information.
Conference Paper
Full-text available
News verification is a process of determining whether a particular news report is truthful or deceptive. Deliberately deceptive (fabricated) news creates false conclusions in the readers' minds. Truthful (authentic) news matches the writer's knowledge. How do you tell the difference between the two in an automated way? To investigate this question, we analyzed rhetorical structures, discourse constituent parts and their coherence relations in deceptive and truthful news sample from NPR's "Bluff the Listener". Subsequently, we applied a vector space model to cluster the news by discourse feature similarity, achieving 63% accuracy. Our predictive model is not significantly better than chance (56% accuracy), though comparable to average human lie detection abilities (54%). Methodological limitations and future improvements are discussed. The long-term goal is to uncover systematic language differences and inform the core methodology of the news verification system.
Article
Full-text available
When scientists report false data, does their writing style reflect their deception? In this study, we investigated the linguistic patterns of fraudulent (N = 24; 170,008 words) and genuine publications (N = 25; 189,705 words) first-authored by social psychologist Diederik Stapel. The analysis revealed that Stapel's fraudulent papers contained linguistic changes in science-related discourse dimensions, including more terms pertaining to methods, investigation, and certainty than his genuine papers. His writing style also matched patterns in other deceptive language, including fewer adjectives in fraudulent publications relative to genuine publications. Using differences in language dimensions we were able to classify Stapel's publications with above chance accuracy. Beyond these discourse dimensions, Stapel included fewer co-authors when reporting fake data than genuine data, although other evidentiary claims (e.g., number of references and experiments) did not differ across the two article types. This research supports recent findings that language cues vary systematically with deception, and that deception can be revealed in fraudulent scientific discourse.
Article
Full-text available
There is a lack of clarity within the social media domain about the number of discrete participants. Influence and measurement within new media is skewed towards the biggest numbers, resulting in fake tweets, sock puppets, and a range of force multipliers such as botnets, application programming interfaces (APIs), and cyborgs. Social media metrics are sufficiently manipulated away from authentic discrete usage so that the trustworthiness of identity, narrative, and authority are constantly uncertain. Elections, social causes, political agendas and new modes of online governance can now be influenced by a range of virtual entities that can cajole and redirect opinions without affirming identity or allegiance. Using the 2013 Australian Federal Election as a case study, this study demonstrates the need to increase legitimacy and validity in micro-blogging forms of new media and the need for multi-factor authentication.
Article
The rising influence of user-generated online reviews (Cone, 2011) has led to growing incentive for businesses to solicit and manufacture DECEPTIVE OPINION SPAM-fictitious reviews that have been deliberately written to sound authentic and deceive the reader. Recently, Ott et al. (2011) have introduced an opinion spam dataset containing gold standard deceptive positive hotel reviews. However, the complementary problem of negative deceptive opinion spam, intended to slander competitive offerings, remains largely unstudied. Following an approach similar to Ott et al. (2011), in this work we create and study the first dataset of deceptive opinion spam with negative sentiment reviews. Based on this dataset, we find that standard n-gram text categorization techniques can detect negative deceptive opinion spam with performance far surpassing that of human judges. Finally, in conjunction with the aforementioned positive review dataset, we consider the possible interactions between sentiment and deception, and present initial results that encourage further exploration of this relationship.
Article
Online deception is disrupting our daily life, organizational process, and even national security. Existing deception detection approaches followed a traditional paradigm by using a set of cues as antecedents, and used a variety of data sets and common classification models to detect deception, which were demonstrated to be an accurate technique, but the previous results also showed the necessity to expand the deception feature set in order to improve the accuracy. In our study, we propose a novel feature selection method of the combination of CHI statistics and hypothesis testing, and achieve the accuracy level of 86% and F-measure of 0.84 by using the novel feature sets and SVM classification models, which exceeds the previous experiment results.
Article
This paper furthers the development of methods to distinguish truth from deception in textual data. We use rhetorical structure theory (RST) as the analytic framework to identify systematic differences between deceptive and truthful stories in terms of their coherence and structure. A sample of 36 elicited personal stories, self-ranked as truthful or deceptive, is manually analyzed by assigning RST discourse relations among each story's constituent parts. A vector space model (VSM) assesses each story's position in multidimensional RST space with respect to its distance from truthful and deceptive centers as measures of the story's level of deception and truthfulness. Ten human judges evaluate independently whether each story is deceptive and assign their confidence levels (360 evaluations total), producing measures of the expected human ability to recognize deception. As a robustness check, a test sample of 18 truthful stories (with 180 additional evaluations) is used to determine the reliability of our RST-VSM method in determining deception. The contribution is in demonstration of the discourse structure analysis as a significant method for automated deception detection and an effective complement to lexicosemantic analysis. The potential is in developing novel discourse-based tools to alert information users to potential deception in computer-mediated texts.