Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper furthers the development of methods to distinguish truth from deception in textual data. We use rhetorical structure theory (RST) as the analytic framework to identify systematic differences between deceptive and truthful stories in terms of their coherence and structure. A sample of 36 elicited personal stories, self-ranked as truthful or deceptive, is manually analyzed by assigning RST discourse relations among each story's constituent parts. A vector space model (VSM) assesses each story's position in multidimensional RST space with respect to its distance from truthful and deceptive centers as measures of the story's level of deception and truthfulness. Ten human judges evaluate independently whether each story is deceptive and assign their confidence levels (360 evaluations total), producing measures of the expected human ability to recognize deception. As a robustness check, a test sample of 18 truthful stories (with 180 additional evaluations) is used to determine the reliability of our RST-VSM method in determining deception. The contribution is in demonstration of the discourse structure analysis as a significant method for automated deception detection and an effective complement to lexicosemantic analysis. The potential is in developing novel discourse-based tools to alert information users to potential deception in computer-mediated texts.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For RQ2, based on the Rhetorical Structure Theory (RST) (Mann & Thompson, 1988), there are multi-types of rhetorical relations (e.g., Contrast and Elaboration) between EDUs. These functional relations could describe the hierarchical discourse structure of the text and may reveal the underlying authenticity (Rubin & Lukoianova, 2015). In this paper, we explore the EDU structures from two views. ...
... Previous fake news detection methods mostly ignored the structural feature in the way of news text representation. Text structures could reflect potential pattern of fake news, which are not easy to be discovered and confronted by news forgers (Rubin & Lukoianova, 2015). For structure-based methods, Zhou, Jain, Phoha & Zafarani (2020) captured the writing style of fake news from Lexicon, syntax, semantics, and discourse level. ...
Preprint
Since fake news poses a serious threat to society and individuals, numerous studies have been brought by considering text, propagation and user profiles. Due to the data collection problem, these methods based on propagation and user profiles are less applicable in the early stages. A good alternative method is to detect news based on text as soon as they are released, and a lot of text-based methods were proposed, which usually utilized words, sentences or paragraphs as basic units. But, word is a too fine-grained unit to express coherent information well, sentence or paragraph is too coarse to show specific information. Which granularity is better and how to utilize it to enhance text representation for fake news detection are two key problems. In this paper, we introduce Elementary Discourse Unit (EDU) whose granularity is between word and sentence, and propose a multi-EDU-structure awareness model to improve text representation for fake news detection, namely EDU4FD. For the multi-EDU-structure awareness, we build the sequence-based EDU representations and the graph-based EDU representations. The former is gotten by modeling the coherence between consecutive EDUs with TextCNN that reflect the semantic coherence. For the latter, we first extract rhetorical relations to build the EDU dependency graph, which can show the global narrative logic and help deliver the main idea truthfully. Then a Relation Graph Attention Network (RGAT) is set to get the graph-based EDU representation. Finally, the two EDU representations are incorporated as the enhanced text representation for fake news detection, using a gated recursive unit combined with a global attention mechanism. Experiments on four cross-source fake news datasets show that our model outperforms the state-of-the-art text-based methods.
... Thematic analysis is useful here for two reasons. First, previous studies show that the coherence between units of discourse (such as sentences) in a document is useful for determining its veracity [6,7]. Second, analysis of thematic deviation can identify general characteristics of fake news that persist across multiple news domains. ...
... The coherence of a story may be indicative of its veracity. For example, [7] demonstrated this by applying Rhetorical Structure Theory [15] to study the discourse of deceptive stories posted online. They found that a major distinguishing characteristic of deceptive stories is that they are disjunctive. ...
Preprint
Full-text available
The spread of fake news remains a serious global issue; understanding and curtailing it is paramount. One way of differentiating between deceptive and truthful stories is by analyzing their coherence. This study explores the use of topic models to analyze the coherence of cross-domain news shared online. Experimental results on seven cross-domain datasets demonstrate that fake news shows a greater thematic deviation between its opening sentences and its remainder.
... Thematic analysis is useful here for two reasons. First, previous studies show that the coherence between units of discourse (such as sentences) in a document is useful for determining its veracity [6,7]. Second, analysis of thematic deviation can identify general characteristics of fake news that persist across multiple news domains. ...
... The coherence of a story may be indicative of its veracity. For example, [7] demonstrated this by applying Rhetorical Structure Theory [15] to study the discourse of deceptive stories posted online. They found that a major distinguishing characteristic of deceptive stories is that they are disjunctive. ...
Chapter
Full-text available
The spread of fake news remains a serious global issue; understanding and curtailing it is paramount. One way of differentiating between deceptive and truthful stories is by analyzing their coherence. This study explores the use of topic models to analyze the coherence of cross-domain news shared online. Experimental results on seven cross-domain datasets demonstrate that fake news shows a greater thematic deviation between its opening sentences and its remainder.
... Following this strategy Pérez-Rosas et al. [6], in addition to systematizing the process of creating a corpus that can be used to train fake news detection systems, they propose the use of features such as n-grams encoded as TF-IDF vectors, features extracted using the Linguistic Inquiry and Word Count (LIWC) [7] text analysis tool, syntactic features created from grammar production rules that are also encoded as TF-IDF values, and features associated with readability such as the number of complex or long words, number of paragraphs, etc., using a linear SVM as classifier. Other types of features used in the detection of misleading content compare the coherence and structure of the discourse between misleading and true narratives [8]. Sentiment analysis scoring of news content or social media posts [9] is another method that can give clues about misleading content and suspicious authors such as bots, that are frequently linked to the proliferation of fake news. ...
Conference Paper
Full-text available
This article describes the different approaches used by the NLPIR-UNED team 1 in the CLEF2022 Check-That! Lab to tackle the Task 3-English. The goal of this task is to determine the veracity of the main claim made in a news article. It is a multi-class classification problem with four possible values: true, partially false, false, and other. For this task, we have evaluated three different approaches. The first has been based on a Longformer transformer model that supports larger input sequence sizes than other transformer models such as BERT. The second approach uses transformer models where an extension of the training set has been carried out. The last approach uses an ensemble classifier composed of a transformer model fed with the sequence of words of the article to be evaluated, and a feed forward neural network fed with features related, among other things, to the number of named entities in the article, and features extracted using the LIWC text analysis tool. With this last approach, we have made our main submission reaching the second position among the twenty-five participating teams.
... Son et al. (2018) conduct causal explanations in physical and mental health building on discourse parsing; Zakharov et al. (2020) develop a novel discourse annotation schema that reflects a hierarchy of discursive strategies for the analysis of contentious and polarizing discussions. Discourse parsing also is leveraged in sentiment analysis (Bhatia et al., 2015), identify authorship (Ferracane et al., 2017) and detect deception (Rubin and Lukoianova, 2015). In this paper, we build up a new comparable discourse parsing tool for the contribution to social media content analysis in the discourse aspect. ...
... Other methods of detecting fake news are related to statistical analysis of speech (Hancock et al. 2013) which identifies the speech of specific groups. Rhetorical structure theory (Rubin -Lukoianova 2015) or RST is involved to identify the difference between valid and fake texts with the use of the Vector Space Model (Venkat -Amogh 2018) ...
Article
Full-text available
Fake news, deceptive information, and conspiracy theories are part of our everyday life. It is really hard to distinguish between false and valid information. As contemporary people receive the majority of information from electronic publications, in many cases fake information can seriously harm people’s health or economic status. This article will analyze the question of how up-to-date information technology can help detect false information. Our proposition is that today we do not have a perfect solution to identify fake news. There are quite a few methods employed for the discrimination of fake and valid information, but none of them is perfect. In our opinion, the reason is not in the weaknesses of the algorithms, but in the underlying human and social aspects.
... Existing studies have investigated textual information present in a news article for fake news detection by analysing either the linguistic styles or lexical features [19]. Some have also experimented with the sentences present in the text with BiLSTM [12] and vector space model [22]. In addition to text, a news article also comprises of visual information that can be leveraged for fake news detection. ...
Conference Paper
Full-text available
The paradigm shift in the consumption of news via online platforms has cultivated the growth of digital journalism. Contrary to traditional media, lowering entry barriers and enabling everyone to be part of content creation have disabled the concept of centralized gatekeeping in digital journalism. This in turn has triggered the production of fake news. Current studies have made a significant effort towards multimodal fake news detection with less emphasis on exploring the discordance between the different multimedia present in a news article. We hypothesize that fabrication of either modality will lead to dissonance between the modalities, and resulting in misrepresented, misinterpreted and misleading news. In this paper, we inspect the authenticity of news coming from on-line media outlets by exploiting relationship (discordance) between the textual and multiple visual cues. We develop an inter-modality discordance based fake news detection framework to achieve the goal. The modal-specific discriminative features are learned, employing the cross-entropy loss and a modified version of contrastive loss that explores the inter-modality discordance. To the best of our knowledge, this is the first work that leverages information from different components of the news article (i.e., headline, body, and multiple images) for multimodal fake news detection. We conduct extensive experiments on the real-world datasets to show that our approach outperforms the state-of-the-art by an average F1-score of 6.3%.
... In the content-based approach, researchers model the content of the news, such as headline or body text. Some of the contentbased approaches utilize stylometry, psycholinguistic properties, and rhetorical relations of the text [30,31,33]. Besides the linguistic features, visual features are also employed, which are usually classified as multi-modal approaches [17,22,44]. ...
Preprint
Full-text available
Fake news, false or misleading information presented as news, has a great impact on many aspects of society, such as politics and healthcare. To handle this emerging problem, many fake news detection methods have been proposed, applying Natural Language Processing (NLP) techniques on the article text. Considering that even people cannot easily distinguish fake news by news content, these text-based solutions are insufficient. To further improve fake news detection, researchers suggested graph-based solutions, utilizing the social context information such as user engagement or publishers information. However, existing graph-based methods still suffer from the following four major drawbacks: 1) expensive computational cost due to a large number of user nodes in the graph, 2) the error in sub-tasks, such as textual encoding or stance detection, 3) loss of rich social context due to homogeneous representation of news graphs, and 4) the absence of temporal information utilization. In order to overcome the aforementioned issues, we propose a novel social context aware fake news detection method, Hetero-SCAN, based on a heterogeneous graph neural network. Hetero-SCAN learns the news representation from the heterogeneous graph of news in an end-to-end manner. We demonstrate that Hetero-SCAN yields significant improvement over state-of-the-art text-based and graph-based fake news detection methods in terms of performance and efficiency.
... În cazul abordărilor lingvisticii computaționale informația este supusă unei statistici pe n-grame [6]. Propozițiile sunt transformate în forme mai avansate de reprezentare a informațiilor (cum ar fi arbori de decizie), se analizează probabilitățile de identificare a anomaliilor [3], se face un test semantic [2], se determină în acest context relațiile între elementele lingvistice, toate acestea contribuind la depistarea adevărului sau înșelăciunii [7]. În plus, pot fi utilizați clasificatorii SVM, clasificatorii de tip Bayesian Naïve [8] și rețelele neuronale [9]. ...
Article
The article contains details on technologies for assessing the credibility of information on the Web. Special attention is paid to social networks and to the most important aspects of the distribution of incredible information on the Internet. The paper analyzes the basic features of several tools for verifying the credibility of the Web sources. Given that Web tools mostly check the content of sites, but not whether the Web address of the site is real, Web address verification technologies have been researched. Necessary suggestions were made in checking the site before you start reading the information on the Web.
... A variety of fake news identification techniques, which use machine learning technology, have been implemented and evaluated. Researchers have demonstrated the capability to identify fake news articles based on features such as the word choice used in text [64], rhetoric structure [65], and the sentiment revealed by the text [66]. Sentiment analysis, which is sometimes referred to as opinion mining, attempts to characterize the writer's view of the article's subject. ...
Preprint
Full-text available
Expert systems have been used to enable computers to make recommendations and decisions. This paper presents the use of a machine learning trained expert system (MLES) for phishing site detection and fake news detection. Both topics share a similar goal: to design a rule-fact network that allows a computer to make explainable decisions like domain experts in each respective area. The phishing website detection study uses a MLES to detect potential phishing websites by analyzing site properties (like URL length and expiration time). The fake news detection study uses a MLES rule-fact network to gauge news story truthfulness based on factors such as emotion, the speaker's political affiliation status, and job. The two studies use different MLES network implementations, which are presented and compared herein. The fake news study utilized a more linear design while the phishing project utilized a more complex connection structure. Both networks' inputs are based on commonly available data sets.
... 108 NLP technique was used on data collected from 54 news stories. 109 ...
Article
Sometimes, unverified information is disseminated as if it is true information on social media sites. Most of the times, it goes viral and affects the belief of people and their emotions. Rumors and fake news are the most popular form of false and unconfirmed information. Such news must be identified quickly for preventing its negative impact on society. In the last decade, operational procedures for rumors and false news detection came into existence. This paper provides a holistic view of different web waves from web 1.0 to web 5.0 and their usages. Further, taxonomy describes various malicious information contents at different stages. It discusses features used for classification, publicly available datasets, the rumor detection methods proposed during web 1.0, 2.0, 3.0, 4.0, 5.0 periods, and comprehensive analysis related to various methods and techniques. Numerous research gaps and future directions are illustrated to make online information more trustworthy for knowledge sharing and decision‐making purposes.
... Two observations, as per different studies, primarily drove their work towards extracting hierarchical structures in analyzing news: (1) Typically, fake news is generated by combining separate, disjoint news parts, and (2) using hierarchical structures for document representations produce better results in prediction tasks (where the entire text is treated as the main predictor variable). Their work was found to outperform other baseline models such as N-grams, RST [15], BiGRNN-CNN (Bi-directional General Neural Network-Convolutional Neural Network), and LSTM (Long Short Term Memory) in accuracy. HDSF is based on the BiLSTM (Bidirectional Long-Short Term Model) [5,16]. ...
Preprint
Full-text available
The growing prevalence of counterfeit stories on the internet has fostered significant interest towards fast and scalable detection of fake news in the machine learning community. While several machine learning techniques for this purpose have emerged, we observe that there is a need to evaluate the impact of noise on these techniques' performance, where noise constitutes news articles being mistakenly labeled as fake (or real). This work takes a step in that direction, where we examine the impact of noise on a state-of-the-art, structural model based on BiLSTM (Bidirectional Long-Short Term Model) for fake news detection, Hierarchical Discourse-level Structure for Fake News Detection by Karimi and Tang (Reference no. 9).
... Typical detection techniques use either text-based linguistic features (Potthast et al., 2018) or visual features (Gupta et al., 2013). Overall, fake news detection methods fall into two groups: knowledgebased models based on fact-checking news articles using external sources (WuYou et al., 2014), and style-based models, which leverage linguistic features capturing writing style (Rubin and Lukoianova, 2015). Many studies such as (Wang, 2017) (Mitra and Gilbert, 2015) (Shu et al., 2020a) incorporate publicly available datasets, providing a basis for detailed analysis of fake news and detection methods. ...
Preprint
The proliferation of fake news, i.e., news intentionally spread for misinformation, poses a threat to individuals and society. Despite various fact-checking websites such as PolitiFact, robust detection techniques are required to deal with the increase in fake news. Several deep learning models show promising results for fake news classification, however, their black-box nature makes it difficult to explain their classification decisions and quality-assure the models. We here address this problem by proposing a novel interpretable fake news detection framework based on the recently introduced Tsetlin Machine (TM). In brief, we utilize the conjunctive clauses of the TM to capture lexical and semantic properties of both true and fake news text. Further, we use the clause ensembles to calculate the credibility of fake news. For evaluation, we conduct experiments on two publicly available datasets, PolitiFact and GossipCop, and demonstrate that the TM framework significantly outperforms previously published baselines by at least $5\%$ in terms of accuracy, with the added benefit of an interpretable logic-based representation. Further, our approach provides higher F1-score than BERT and XLNet, however, we obtain slightly lower accuracy. We finally present a case study on our model's explainability, demonstrating how it decomposes into meaningful words and their negations.
... Writing style is extracted and utilized to measure the credibility of news by some methods. [34] employs rhetorical structure theory to evaluate the authenticity in discourse level. [25], [26] capture the sentiment and readability of the news content to access the extent of falsehood. ...
Preprint
Full-text available
The explosive growth of fake news along with destructive effects on politics, economy, and public safety has increased the demand for fake news detection. Fake news on social media does not exist independently in the form of an article. Many other entities, such as news creators, news subjects, and so on, exist on social media and have relationships with news articles. Different entities and relationships can be modeled as a heterogeneous information network (HIN). In this paper, we attempt to solve the fake news detection problem with the support of a news-oriented HIN. We propose a novel fake news detection framework, namely Adversarial Active Learning-based Heterogeneous Graph Neural Network (AA-HGNN) which employs a novel hierarchical attention mechanism to perform node representation learning in the HIN. AA-HGNN utilizes an active learning framework to enhance learning performance, especially when facing the paucity of labeled data. An adversarial selector will be trained to query high-value candidates for the active learning framework. When the adversarial active learning is completed, AA-HGNN detects fake news by classifying news article nodes. Experiments with two real-world fake news datasets show that our model can outperform text-based models and other graph-based models when using less labeled data benefiting from the adversarial active learning. As a model with generalizability, AA-HGNN also has the ability to be widely used in other node classification-related applications on heterogeneous graphs.
... (1) Content-based methods, which often rely on unique writing styles or language features in news content (e.g., lexical features, syntactic features, and topic features) [97,98]. For example, Castillo et al. [99] calculate a series of linguistic features to evaluate Twitters' credibility, including the average number of words, URL links, the number of positive words etc. Potthast et al. [100] propose a meta-learning model to detect fake news, which utilizes differences in writing styles between the truth and fake news. ...
Article
Full-text available
The widespread fake news in social networks is posing threats to social stability, economic development, and political democracy, etc. Numerous studies have explored the effective detection approaches of online fake news, while few works study the intrinsic propagation and cognition mechanisms of fake news. Since the development of cognitive science paves a promising way for the prevention of fake news, we present a new research area called Cognition Security (CogSec), which studies the potential impacts of fake news on human cognition, ranging from misperception, untrusted knowledge acquisition, targeted opinion/attitude formation, to biased decision making, and investigates the effective ways for fake news debunking. CogSec is a multidisciplinary research field that leverages the knowledge from social science, psychology, cognition science, neuroscience, AI and computer science. We first propose related definitions to characterize CogSec and review the literature history. We further investigate the key research challenges and techniques of CogSec, including humancontent cognition mechanism, social influence and opinion diffusion, fake news detection, and malicious bot detection. Finally, we summarize the open issues and future research directions, such as the cognition mechanism of fake news, influence maximization of fact-checking information, early detection of fake news, fast refutation of fake news, and so on.
... A RST discourse parser outputs a discourse tree for a document, which can potentially benefit document-level NLP tasks [132]. In addition to derive features or rules from the discourse trees [133][134][135], we are interested in incorporating discourse tree structure in representation learning. ...
Article
Full-text available
Neural network based deep learning methods aim to learn representations of data and have produced state-of-the-art results in many natural language processing (NLP) tasks. Discourse parsing is an important research topic in discourse analysis, aiming to infer the discourse structure and model the coherence of a given text. This survey covers text-level discourse parsing, shallow discourse parsing and coherence assessment. We first introduce the basic concepts and traditional approaches, and then focus on recent advances in discourse structure oriented representation learning. We also introduce a trend of discourse structure aware representation learning that is to exploit discourse structures or discourse objectives for learning representations of sentences and documents for specific applications or for general purpose. Finally, we present a brief summary of the progress and discuss several future directions.
... information from the content [19]. For example, [20] proposed the text analysis tool (LIWC) based on linguistic inquiry and word count, and in the same year, the sentence relevance research based on rhetorical structure theory (RST) was proposed [21]. In 2018, the convolutional neural network (CNN) [22,23] can also automatically learn content features to detect rumors. ...
Preprint
Disinformation has long been regarded as a severe social problem, where fake news is one of the most representative issues. What is worse, today's highly developed social media makes fake news widely spread at incredible speed, bringing in substantial harm to various aspects of human life. Yet, the popularity of social media also provides opportunities to better detect fake news. Unlike conventional means which merely focus on either content or user comments, effective collaboration of heterogeneous social media information, including content and context factors of news, users' comments and the engagement of social media with users, will hopefully give rise to better detection of fake news. Motivated by the above observations, a novel detection framework, namely graph comment-user advanced learning framework (GCAL) is proposed in this paper. User-comment information is crucial but not well studied in fake news detection. Thus, we model user-comment context through network representation learning based on heterogeneous graph neural network. We conduct experiments on two real-world datasets, which demonstrate that the proposed joint model outperforms 8 state-of-the-art baseline methods for fake news detection (at least 4% in Accuracy, 7% in Recall and 5% in F1). Moreover, the proposed method is also explainable.
... În cazul abordărilor lingvisticii computaționale informația este supusă unei statistici pe n-grame [6]. Propozițiile sunt transformate în forme mai avansate de reprezentare a informațiilor (cum ar fi arbori de decizie), se analizează probabilitățile de identificare a anomaliilor [3], se face un test semantic [2], se determină în acest context relațiile între elementele lingvistice, toate acestea contribuind la depistarea adevărului sau înșelăciunii [7]. În plus, pot fi utilizați clasificatorii SVM, clasificatorii de tip Bayesian Naïve [8] și rețelele neuronale [9]. ...
Article
Full-text available
INTRODUCERE Odată cu dezvoltarea mediilor on-line și cu apariția Internetului, considerat un spațiu democratic, a devenit mult mai simplu pentru oricine să se exprime liber, ori-când și oricum. Dincolo de instituțiile de presă, fiecare cu politicile editoriale proprii și cu diverși factori care decid dacă un eveniment poate sau nu poate fi trans-format în știre, au apărut și multe site-uri care se pre-zintă în mediul on-line ca produse media, postând diverse conținuturi care pun însă la îndoială credibilitatea acestora. Iată de ce, atunci când suntem în punctul de a accepta sau de a respinge informații noi, ar trebui să ne întrebăm care este originea și reputația sursei. În așa-numita "epoca reputației", aprecierile critice ar trebui să fie direcționate nu către conținutul informa-țiilor, ci mai degrabă către rețeaua de socializare care a conturat acel conținut și care i-a oferit o anumită poziție meritat sau nemeritat în sistemul nostru de cunoaștere. Or, nu fiecare utilizator este capabil să facă o analiză și o definire a conținuturilor în credibile sau necredibile. Rețelele de social media sunt cele responsabile în primul rând de conținuturile distribuite și redistribuite zilnic de utilizatori, fără a fi verificate, creând deseori sentimente de panică și revoltă. În acest articol sunt cercetate anumite aspecte în evaluarea credibilității informației de pe Web. Por-nind de la scopul preconizat, lucrarea este structurată în câteva secțiuni. Inițial în articol sunt prezentate cele mai relevante aspecte ale rețelelor de socializare, care reprezintă cele mai frecvente instrumente de distribu-ire a informației pe Internet. Totodată, pornind de la exemplul rețelelor de socializare este definită noțiunea de credibilitate a informației și sunt tratate anumite as-pecte ale acesteia. În continuare sunt prezentate câteva instrumente de verificare a credibilității surselor Web. Ținând cont de faptul că instrumentele în mare parte verifică conținutul de pe site-uri, dar nu și dacă adre-sa Web a site-ului este reală, s-au cercetat tehnologiile de verificare a adreselor Web. Cu atât mai importantă este verificarea adreselor Web cu cât prelucrarea in-formației de pe site, de cele mai dese ori nestructurată, necesită un efort sporit. SPECIFICUL REȚELELOR MEDIA SOCIALE Interesul pentru rețelele sociale crește continuu. De cele mai multe ori acestea constituie platforme pentru scrierea și distribuirea informațiilor textuale (gânduri, opinii, experiențe etc.), a conținuturilor TOOLS FOR VERIFICATION OF THE FALSE INFORMATION DISTRIBUTED ON WEB Summary. The article contains details on technologies for assessing the credibility of information on the Web. Special attention is paid to social networks and to the most important aspects of the distribution of incredible information on the Internet. The paper analyzes the basic features of several tools for verifying the credibility of the Web sources. Given that Web tools mostly check the content of sites, but not whether the Web address of the site is real, Web address verification technologies have been researched. Necessary suggestions were made in checking the site before you start reading the information on the Web.
... Feng and Hirst [5] used semantic analysis looking at the pairs of 'object: descriptive' to find contradictions in the text above Feng's first deep syntax model for further development. Rubin, Lukoianova and Tatiana [6] analyzed speech structure using a vector space model with similar success. Ciampa glia et al. [7] use networks that resemble language patterns that require an existing knowledge base. ...
Conference Paper
Full-text available
This paper discusses the use of natural language processing techniques to identify 'false stories', that is, misleading stories from unreliable or authoritative sources. Using data obtained from Signal Media and a list of sources from OpenSources.co, using bi-grams term frequency-inverse document frequency (TF-IDF) and free language grammar (PCFG) access to a total of approximately-11,000.. We test our database on multi-phase algorithms Vector Support Machines, Stochastic Gradient Decreases, Gradient Enlargement, Determined Decision Trees, and Random Forests. We found that the TF-IDF bi-grams included in the Stochastic Gradient Descent model identifies unreliable sources with 77.2% accuracy, and PCFGs have little effect on memory.
... In the intersection of the semantic and linguistic approach, rhetorical-base detection is used by some studies. Rubin used rhetorical structure theory (RST) as the analytic framework to identify systematic differences between deceptive and truthful stories in terms of their coherence and structure in her study [19]. Accordingly, in a study focused on discourse level, rhetorical structures are used as vector space modelling applicants for predicting whether a report is truthful or deceptive for English news [20]. ...
Preprint
Full-text available
With the digitization of media, an immense amount of news data has been generated by online sources, including mainstream media outlets as well as social networks. However, the ease of production and distribution resulted in circulation of fake news as well as credible, authentic news. The pervasive dissemination of fake news has extreme negative impacts on individuals and society. Therefore, fake news detection has recently become an emerging topic as an interdisciplinary research field that is attracting significant attention from many research disciplines, including social sciences and linguistics. In this study, we propose a method primarily based on lexicons including a scoring system to facilitate the detection of the fake news in Turkish. We contribute to the literature by collecting a novel, large scale, and credible dataset of Turkish news, and by constructing the first fake news detection lexicon for Turkish.
... can distinguish the validity of news [31]. A deep network model, such as convolutional neural network (CNN), has also been applied in the field of fake news classification [32]. ...
Article
Full-text available
With the rapid development of the Internet, social media has become a convenient online platform for users to obtain information, express opinions, and communicate with each other. Users are keen to participate in discussions on hot topics and exchange opinions on social media. A lot of fake news has also arisen at this moment. However, existing fake news detection methods have the problem of relying too much on textual features. Textual features are easy to be tampered with and deceive the detector; thus, it is difficult to distinguish fake news only by relying on textual features. To address the challenge, we propose a fake news detection method based on the diffusion growth rate (Delta-G). To identify the real and fake news, Delta-G uses graph convolutional networks to extract the diffusion structure features and then adopts the long-short-term memory networks to extract the growth rate features on time series. In the experiments, Delta-G is verified on two news datasets, Twitter and Weibo. Compared with the three detection methods of decision tree classifier, support vector machines with a propagation tree kernel, and RvNN, the accuracy of the Delta-G on the two datasets is improved by an average of 5% or more, which is better than all the baselines.
... They used an SVM classifier for obtaining high accuracy on the UNBiased (UNB) dataset. Rubin and Lukoianova (2015) used rhetorical structure theory (RST) for the classification of real and fake stories. 3. Propagation-based: approach utilizes the propagation pattern for the news classification as real or fake (Monti et al. 2019). ...
Article
Full-text available
Online Social Network (OSN) is one of the biggest platforms that spread real and fake news. Many OSN users spread malicious data, fake news, and hoaxes using fake or social bot account for business, political and entertainment purposes. These accounts are also used to spread malicious URLs, viruses and malware. This paper proposes UCred (User Credibility) model to classify user accounts as fake or real. This model uses the combined results of RoBERT (Robustly optimized BERT), Bi-LSTM (Bidirectional LSTM) and RF (Random Forest) for the classification of profile. The output generated from all three techniques is fed into the voting classifier to improve the classification accuracy compared to state-of-the-art approaches. The proposed UCred model gives 98.96% accuracy, notably higher than the state-of-the-art model.
... The theory has been used in news in order to identify contents such as description of discourse, detection of fake news ( Della Vedova et al., 2018;Shu et al., 2017. Some of these studies used machine learning as a research technic (Han & Metha, 2019;Kraus & Feuerriegel, 2019;Prasanna, 2019, Rubin & Lukoianova, 2014Rȃdescu, 2020;Rubin et al., 2015). Rhetorical Structure Theory has been also used for understanding deception in customer complaints (Pisarevskaya et al, 2019), detection of fake online reviews (Popoola, 2017) In this respect, Information Processing Theory, Interpersonal Deception Theory and Warranting Theory address research questions one and two, while Rhetorical Structure Theory addresses research question three of this study. ...
Article
Against the uncertainty caused by the information overload in the online world, consumers can benefit greatly by reading online product reviews before making their online purchases. However, some of the reviews are written deceptively to manipulate purchasing decisions. The purpose of present study is to determine which feature combination is most effective in fake review detection among the features of sentiment scores, topic distributions, cluster distributions and bag of words. In this study, additional feature combinations to a sentiment analysis are searched to examine the critical problem of fake reviews made to influence the decision-making process using review from amazon.com dataset. Results of the study points that behavior-related features play an important role in fake review classifications when jointly used with text-related features. Verified purchase is the only behavior related feature used comparatively with other text-related features.
... In both cases, the problem of detection is crucial, and addressable only by considering the epiphenomenon-the explicit expression of false information (Allcott and Gentzkow 2017, p. 213;Shu et al. 2017, p. 23) or hatred through the most visible and prototypical indicators (Aldwairi and Alwahedi 2018;Jin et al. 2016;, such as false titles, the relationship between title and body text (Shu et al. 2017), or the use of slurs (Davidson et al. 2017). However, one of the greatest vulnerabilities acknowledged in the automatic detection systems is the lack of correspondence between expression and meaning (Pisarevskaya 2017;Rubin and Lukoianova 2015). In case of "fake news" and hate speech, the detection instruments used in computational linguistics are insufficient for capturing the complexity of the phenomenon. ...
Article
Full-text available
An argumentation profile is defined as a methodological instrument for analyzing argumentative discourse considering distinct and interrelated dimensions: the types of argument used, their quality, and the emotions triggered. Walton’s theoretical contributions are developed as a coherent analytical and multifaceted toolbox for capturing these aspects. Argumentation schemes are used to detect and quantify the types of argument. Fallacy analysis and the assessment of the implicit premises retrieved through the schemes allow evaluating arguments. Finally, the frequency of emotive words signals the most common emotions aroused. This method is illustrated through a corpus of argumentative tweets of three politicians.
... Linguistic approaches consider analysis based on the content of the article. At early stages, predictive deception cues in the content are captured by a bag-of-words approach (Ott et al., 2013), semantic analysis (Feng & Hirst, 2013), and rhetorical analysis (Rubin & Lukoianova, 2015); these cues are often used to train ML models such as support vector machines (SVM) (Zhang et al., 2012). However, it is difficult to improve prediction accuracy solely based on linguistic features (Ruchansky et al., 2017). ...
... The first type uses news article content features. Early studies found that bag-of-words and syntactic features are straight and effective; see Rubin and Lukoianova (2015), Hassan et al. (2015), andAfroz et al. (2012). Then, with the development of neural networks in natural language processing, semantic features were utilized to improve classification performance, such as Song et al. (2018) and Zubiaga et al. (2018). ...
Article
Full-text available
The detection of fake news has become essential in recent years. This paper presents a new technique that is highly effective in identifying fake news articles. We assume a scenario where the relationship between a news article and a statement has already been classified as either agreeing or disagreeing with the statement, being uncertain about it, or being unrelated to it. Using this information, we focus on selecting the news articles that are most likely to be fake. We propose two models: the first one uses only the agree and disagree classifications; the second uses a subjective opinions based model that can also handle the uncertain cases. Our experiments on a real-world dataset (the Fake News Challenge 1 dataset) and a simulated dataset validate that both proposed models achieve state-of-the-art performance. Furthermore, we show which model to use in different scenarios to get the best performance.
... There are two approaches that have been employed so far for building style-based fake news detection model -(i). deception-based approach (it detects misleading assertions or claims in news information by using either deep network models such as CNN to detect Deep syntax [64] or rhetorical structure theory to detect Rhetorical structure [65] in order to check the news veracity) or (ii). objectivity-based approach (it identifies style signs like hyperpartisan styles and yellow journalism such as click-bait that may suggest a deterioration in the objectivity of news material in order to mislead users). ...
Preprint
Full-text available
Social media platforms has become widely popular among netizens to share views and news. This can be attributed to easy affordability of digital devices, low cost Internet and free subscription to social media platforms. Individuals find social media platforms quite appealing where they can find like-minded people to exchange views and news. Studies have shown that a less credible person is more likely to propagate fake news in order to fulfill their objectives of any form - seeking attention, gaining financial benefits or influencing political views. Hence, fake news detection on social media has become one of the most envisaged research topics in recent times. Fake news detection on social media can be done through various methods based on sources, transmission, styles, and knowledge. The major contribution of this survey paper is to provide a comparative study of major computational methods for fake news detection on social media. The objective of this study is to help researchers carry further research using the details and the research gaps presented in the paper.
Article
Full-text available
Nowadays social media is one of the important medium of sharing thoughts and opinions of the individual due to its easy access and also it provides an opportunity to the malicious user to post deliberately fabricated false content to influence people for creating controversies, playing with public emotions, etc. The spread of contaminated information such as Rumours, Hoax, Accidental misinformation, etc. over the web is becoming an emergency situation that can have a very harmful impact on society and individuals. In this paper, we have developed an automated system “Hoax-News Inspector” for the detection of fake news that propagates through the web and social media in the form of text. To distinguish fake and real reports on an early basis, we identified prominent features by exploring two sets of attributes that lead to information spread: Article/post-content-based features, Sentiment based features and the mixture of both called as Hybrid features. The proposed algorithm is trained and tested on the self-generated dataset as well as one of the popular existing datasets Liar. It has been found that the proposed algorithm gives the best results using the Random Forest classifier with an accuracy of 95% by considering all sets of features. Detecting and verifying news have many practical applications for business markets, news consumers, and time-sensitive services, which generally help to minimize the spread of false information. Our proposed system Hoax News-Inspector can automatically collect fabricated news data and classify it into binary classes Fake or Real, which later benefits further research for predicting and understanding Fake news.
Article
The problem of fake news detection is becoming increasingly interesting for several research fields. Different approaches have been proposed, based on either the content of the news itself or the context and properties of its spread over time, specifically on social media. In the literature, it does not exist a widely accepted general-purpose dataset for fake news detection, due to the complexity of the task and the increasing ability to produce fake news appearing credible in particular moments. In this paper, we propose a methodology to collect and label news pertinent to specific topics and subjects. Our methodology focuses on collecting data from social media about real-world events which are known to trigger fake news. We propose a labelling method based on crowdsourcing that is fast, reliable, and able to approximate expert human annotation. The proposed method exploits both the content of the data (i.e., the texts) and contextual information about fake news for a particular real-world event. The methodology is applied to collect and annotate the Notre-Dame Fire Dataset and to annotate part of the PHEME dataset. Evaluation is performed with fake news classifiers based on Transformers and fine-tuning. Results show that context-based annotation outperforms traditional crowdsourcing out-of-context annotation.
Chapter
Automated detection of text with misrepresentations such as fake reviews is an important task for online reputation management. We form the Ultimate Deception Dataset that consists of customer complaints—emotionally charged texts, which include descriptions of problems customers experienced with certain businesses. Typically, in customer complaints, either customer describes company representative lying, or they lie themselves. The Ultimate Deception Dataset includes almost 3 000 complaints in the personal finance domain and provides clear ground truth based on available factual knowledge about the financial domain. Among them, four hundred texts were manually tagged. Experiments were performed in order to explore the links between implicit cues of the rhetoric structure of texts and the validity of arguments, and also how truthful/deceptive are these texts. We confirmed that communicative discourse trees are essential to detect various forms of misrepresentation in text, achieving 76% F1 on the Ultimate Deception Dataset. We believe that this accuracy is sufficient to assist a manual curation of a CRM environment towards having high-quality, trusted content. Recognizing hypocrisy in customer communication concerning his impression with the company or hypocrisy in customer attitude is fairly important for proper tackling and retaining customers. We collect a dataset of sentences with hypocrisy and learn to detect it relying on syntactic, semantic and discourse-level features and also web mining to correlate contrasting entities. The sources are customer complaints, samples of texts with hypocrisy on the web and tweets tagged as hypocritical. We propose an iterative procedure to grow the training dataset and achieve the detection F1 above 80%, which is expected to be satisfactory for integration into a CRM platform. We conclude this section with the detection of a rumor and misinformation in web document where discourse analysis is also helpful.
Article
This article sets out to offer an overview and a review of the latest linguistic research into fake news. To this end, the authors put forward a critical discussion of the paradigms and instruments deployed over the past decade to analyze and identify this textual (micro)genre, from natural language processing techniques to critical discourse analysis. The conclusion of our study is that a proper understanding of the fake news phenomenon can only be achieved by bringing together qualitative and quantitative methods.
Chapter
Chapter 1 frames the problem of deceptive, inaccurate, and misleading information in the digital media content and information technologies as an infodemic. Mis- and disinformation proliferate online, yet the solution remains elusive and many of us run the risk of being woefully misinformed in many aspects of our lives including health, finances, and politics. Chapter 1 untangles key research concepts—infodemic, mis- and disinformation, deception, “fake news,” false news, and various types of digital “fakes.” A conceptual infodemiological framework, the Rubin (2019) Misinformation and Disinformation Triangle, posits three minimal interacting factors that cause the problem—susceptible hosts, virulent pathogens, and conducive environments. Disrupting interactions of these factors requires greater efforts in educating susceptible minds, detecting virulent fakes, and regulating toxic environments. Given the scale of the problem, technological assistance as inevitable. Human intelligence can and should be, at least in part, enhanced with an artificial one. We require systematic analyses that can reliably and accurately sift through large volumes of data. Such assistance comes from artificial intelligence (AI) applications that use natural language processing (NLP) and machine learning (ML). These fields are briefly introduced and AI-enabled tasks for detecting various “fakes” are laid out. While AI can assist us, the ultimate decisions are obviously in our own minds. An immediate starting point is to verify suspicious information with simple digital literacy steps as exemplified here. Societal interventions and countermeasures that help curtail the spread of mis- and disinformation online are discussed throughout this book.
Article
Since fake news poses a serious threat to society and individuals, numerous studies have been brought by considering text, propagation and user profiles. Due to the data collection problem, these methods based on propagation and user profiles are less applicable in the early stages. A good alternative method is to detect news based on text as soon as they are released, and a lot of text-based methods were proposed, which usually utilized words, sentences or paragraphs as basic units. But, word is a too fine-grained unit to express coherent information well, sentence or paragraph is too coarse to show specific information. Which granularity is better and how to utilize it to enhance text representation for fake news detection are two key problems. In this paper, we introduce Elementary Discourse Unit (EDU) whose granularity is between word and sentence, and propose a multi-EDU-structure awareness model to improve text representation for fake news detection, namely EDU4FD. For the multi-EDU-structure awareness, we build the sequence-based EDU representations and the graph-based EDU representations. The former is gotten by modeling the coherence between consecutive EDUs with TextCNN that reflect the semantic coherence. For the latter, we first extract rhetorical relations to build the EDU dependency graph, which can show the global narrative logic and help deliver the main idea truthfully. Then a Relation Graph Attention Network (RGAT) is set to get the graph-based EDU representation. Finally, the two EDU representations are incorporated as the enhanced text representation for fake news detection, using a gated recursive unit combined with a global attention mechanism. Experiments on four cross-source fake news datasets show that our model outperforms the state-of-the-art text-based methods. Our results suggest that considering EDU and its structural features could enhance the text representation for fake news detection.
Chapter
Social media has greatly enabled people to participate in online activities at an unprecedented rate. However, this unrestricted access also exacerbates the spread of misinformation and fake news which cause confusion and chaos if not detected in a timely manner. Given the rapidly evolving nature of news events and the limited amount of annotated data, state-of-the-art systems on fake news detection face challenges for early detection. In this work, we exploit multiple weak signals from different sources from user engagements with contents (referred to as weak social supervision), and their complementary utilities to detect fake news. We jointly leverage limited amount of clean data along with weak signals from social engagements to train a fake news detector in a meta-learning framework which estimates the quality of different weak instances. Experiments on real-world datasets demonstrate that the proposed framework outperforms state-of-the-art baselines for early detection of fake news without using any user engagements at prediction time.
Article
Recently, the term “fake news” has been broadly and extensively utilized for disinformation, misinformation, hoaxes, propaganda, satire, rumors, click-bait, and junk news. It has become a serious problem around the world. We present a new system, FaNDS, that detects fake news efficiently. The system is based on several concepts used in some previous works but in a different context. There are two main concepts: an Inconsistency Graph and Energy Flow. The Inconsistency Graph contains news items as nodes and inconsistent opinions between them for edges. Energy Flow assigns each node an initial energy and then some energy is propagated along the edges until the energy distribution on all nodes converges. To illustrate FaNDS we use the original data from the Fake News Challenge (FNC-1). First, the data has to be reconstructed in order to generate the Inconsistency Graph. The graph contains various subgraphs with well-defined shapes that represent different types of connections between the news items. Then the Energy Flow method is applied. The nodes with high energy are the candidates for being fake news. In our experiments, all these were indeed fake news as we checked each using several reliable web sites. We compared FaNDS to several other fake news detection methods and found it to be more sensitive in discovering fake news items.
Article
With the widespread use of online social media, we have witnessed that fake news causes enormous distress and inconvenience to people's social life. Although previous studies have proposed rich machine learning methods for identifying fake news in social media, the task of detecting fake news in emerging news events/domains remains a challenging problem due to the wide range of news topics on social media as well as the evolution and variation of fake news contents in the web. In this study, we propose an approach which we term “domain-adversarial and graph-attention neural network” (DAGA-NN) model to address the challenge. Its main advantage is that, in a text environment with multiple events/domains, only partial domain sample data are needed to train the model to achieve accurate cross-domain fake news detection in those domains with few (or even no) samples, which makes up for the limitations of traditional machine learning in fake news detection tasks due to news content evolution or cross-domain identification (where there is no sample data). Extensive experiments were conducted on two multimedia datasets of Twitter and Weibo, and the results showed that the proposed model was very effective in detecting fake news across events/domains.
Chapter
Fact-checking is a task to capture the relation between a claim and evidence (premise) to decide this claim’s truth. Detecting the factuality of claim, as in fake news, depending only on news knowledge, e.g., evidence text, is generally inadequate since fake news is intentionally written to mislead readers. Most of the previous models on this task rely on claim and evidence argument as input for their model, where sometimes the systems fail to detect the relation, particularly for ambiguate information. This study aims to improve fact-checking task by incorporating warrant as a bridge between the claim and the evidence, illustrating why this evidence supports this claim, i.e., If the warrant links between the claim and the evidence then the relation is supporting, if not it is either irrelevant or attacking, so warrants are applicable only for supporting the claim. To solve the problem of gap semantic between claim evidence pair, A model that can detect the relation based on existing extracted warrants from structured data is developed. For warrant selection, knowledge-based prediction and style-based prediction models are merged to capture more helpful information to infer which warrant represents the best bridges between claim and evidence. Picking a reasonable warrant can help alleviate the evidence ambiguity problem if the proper relation cannot be detected. Experimental results show that incorporating the best warrant to fact-checking model improves the performance of fact-checking.
Article
Legislative communication frames how constituents perceive politicians’ successes. However, most research on legislative communication focuses on Congressional or Senatorial email correspondence, without considering the importance of presidential emails or the way politicians frame their failures. Existing work on legislative communication also tends to analyze the documents in isolation, leaving open the opportunity to analyze the networked effect of information flows. To fill this gap, we analyze a year of 1600 Daily content – The Official White House email style newsletter created by the Obama Administration and subsequently adopted after the Trump administration took office. In doing so, we identify the central frames the Trump White House relied on leading up to the 2020 election and the media sources used to legitimize these claims. Drawing on frequency counts, structural topic modeling, and qualitative content analysis, our data reveal the important role electoral communication plays in framing current events and the extent to which email is an essential node in the right-wing media ecosystem.
Chapter
The advent of numerous social networking websites in the twenty-first century has provided an easy outlet for people across the globe through widely available devices such as smartphones. While this has empowered people belonging to different walks of society to post content on topics ranging from current affairs to history, it is not easy to ascertain the content’s veracity. Traditional news media has experts in the domain who have the ability to fact-check the content presented in the news. However, given the enormous amount of social media posts every day, an average human being who is exposed to the content faces difficulty in differentiating false information from real. This has brought researchers’ interest in the automated detection of fake news. In this chapter, we will discuss the features that are used to identify fake news and different categories of fake news detection techniques. We also outline the datasets available for fake news detection and provide the directions for further reading.
Book
This book presents new and innovative current discoveries in social networking which contribute enough knowledge to the research community. The book includes chapters presenting research advances in social network analysis and issues emerged with diverse social media data. The book also presents applications of the theoretical algorithms and network models to analyze real-world large-scale social networks and the data emanating from them as well as characterize the topology and behavior of these networks. Furthermore, the book covers extremely debated topics, surveys, future trends, issues, and challenges.
Chapter
Lately, on account of the thriving development of online relational associations, fake news for various businesses, what is more, political purposes have been appearing in tremendous numbers what is more, extensive in the online world. The development of beguiling information in standard access news sources, for instance, electronic media channels, news web diaries, and online papers have made it attempting to perceive reliable news sources, therefore growing the prerequisite for computational contraptions prepared to give pieces of information into the faithful nature of online substance ( Poovaraghan et al. (2019) Fake news accuracy using naive bayes classifier. Int J Recent Technol Eng ). This paper considers past and current systems for fake news distinguishing proof in text-based associations while determining how and why fake news exists regardless.
Article
This study tackles the fake news phenomenon during the pandemic from a critical thinking perspective. It addresses the lack of systematic criteria by which to fact-check the grey area of misinformation. As a preliminary step, drawing from fallacy theory, we define what type of fake news convey misinformation. Through a data data driven approach, we then identify 10 fallacious strategies which flag misinformation and we provide a deterministic analysis method by which to recognize them. An annotation study of over 220 news articles about COVID-19 fact-checked by Snopes shows that (i) the strategies work as indicators of misinformation (ii) they are related to digital media affordances (iii) and they can be used as the backbone of more informative fact-checkers’ ratings. The results of this study are meant to help citizens to become their own fact-checkers through critical thinking and digital activism.
Article
Journalism has always remained a vital constituent of our society and journalists play a key role in making people aware of the happenings and developments in society. This spread of information enables shaping the ideologies, orientations and thoughts of individuals as well as the society. Contrary to this, the spread of misinformation or fake news leads to detrimental consequences. With the advent of social media, the menace of fake news has become grievous due to the unrestrained propagation of information and difficulty to track several accounts operated by humans or bots. This menace can be mitigated through data science approaches by combining artificial intelligence with statistics and domain-based knowledge. In this paper, a survey of works aimed at characterization, feature extraction and subsequent detection of fake news has been conducted from a data science perspective. Along with it, an analysis of the 8 renowned fake news detection repositories has been presented. Furthermore, through a case study on tweets related to COVID-19 pandemic, the factors behind the spread of misinformation during critical times, distinguishing between factual and emotional tweets and viable approaches to restrain fake news has been enunciated.
Article
Full-text available
The detection of hate speech and fake news in political discourse is at the same time a crucial necessity for democratic societies and a challenge for several areas of study. However, most of the studies have focused on what is explicitly stated: false article information , language expressing hatred, derogatory expressions. This paper argues that the explicit dimension of manipulation is only one e and the least problematic e of the risks of political discourse. The language of the unsaid is much more dangerous and incom-parably more difficult to detect, hidden in different types of fallacies and inappropriate uses of emotive language. Through coding scheme developed by integrating instruments drawn from argumentation theory and pragmatics, a corpus of argumentative tweets published by 4 politicians (Matteo Salvini, Donald Trump, Jair Bolsonaro, and Joseph Biden) within 6 months from their taking office is analyzed, detecting the types of argument, the fallacies, and the uses and misuses of "emotive words." This coding results in the argu-mentation profiles of the speakers, which are compared statistically to show their different implicit strategies and deceptive tactics.
Article
This article narrows the gap between physical sensing systems that measure physical signals and social sensing systems that measure information signals by (i) defining a novel algorithm for extracting information signals (building on results from text embedding) and (ii) showing that it increases the accuracy of truth discovery—the separation of true information from false/manipulated one. The work is applied in the context of separating true and false facts on social media, such as Twitter and Reddit, where users post predominantly short microblogs. The new algorithm decides how to aggregate the signal across words in the microblog for purposes of clustering the miscroblogs in the latent information signal space, where it is easier to separate true and false posts. Although previous literature extensively studied the problem of short text embedding/representation, this article improves previous work in three important respects: (1) Our work constitutes unsupervised truth discovery, requiring no labeled input or prior training. (2) We propose a new distance metric for efficient short text similarity estimation, we call Semantic Subset Matching , that improves our ability to meaningfully cluster microblog posts in the latent information signal space. (3) We introduce an iterative framework that jointly improves miscroblog clustering and truth discovery. The evaluation shows that the approach improves the accuracy of truth-discovery by 6.3%, 2.5%, and 3.8% (constituting a 38.9%, 14.2%, and 18.7% reduction in error, respectively) in three real Twitter data traces.
Data
Full-text available
Information Manipulation is an umbrella term we use for a variety of distortions that occur in the process of transmitting information in the information channel (between human agents via artifacts and various presentation formats). Extending the classical Shannon-Weaver's model of information transmission, we consider alternative outcomes of the transmission – loss of fidelity of the information (on the receiver's end). We distinguished twelve salient factors that manipulation varieties differ by (such as intentionality to deceive, accuracy, and social acceptability) to provide an abstract framework and conceptualize various permutations.
Article
Full-text available
Recent improvements in effectiveness and accuracy of the emerging field of automated deception detection and the associated potential of language technologies have triggered increased interest in mass media and general public. Computational tools capable of alerting users to potentially deceptive content in computer–mediated messages are invaluable for supporting undisrupted, computer– mediated communication and information practices, credibility assessment and decision–making. The goal of this ongoing research is to inform creation of such automated capabilities. In this study we elicit a sample of 90 computer–mediated personal stories with varying levels of deception. Each story has 10 associated human deception level judgments, confidence scores, and explanations. In total, 990 unique respondents participated in the study. Three approaches are taken to the data analysis of the sample: human judges, linguistic detection cues, and machine learning. Comparable to previous research results, human judgments achieve 50–63 percent success rates, depending on what is considered deceptive. Actual deception levels negatively correlate with their confident judgment as being deceptive (r = -0.35, df = 88, ρ = 0.008). The highest-performing machine learning algorithms reach 65 percent accuracy. Linguistic cues are extracted, calculated, and modeled with logistic regression, but are found not to be significant predictors of deception level, confidence score, or an authors' ability to fool a reader. We address the associated challenges with error analysis. The respondents' stories and explanations are manually content–analyzed and result in a faceted deception classification (theme, centrality, realism, essence, self–distancing) and a stated perceived cue typology. Deception detection remains novel, challenging, and important in natural language processing, machine learning, and the broader library information science and technology community.
Conference Paper
Full-text available
In this paper, we develop an RST-style text-level discourse parser, based on the HILDA discourse parser (Hernault et al., 2010b). We significantly improve its tree-building step by incorporating our own rich linguistic features. We also analyze the difficulty of extending traditional sentence-level discourse parsing to text-level parsing by comparing discourse-parsing performance under different discourse conditions.
Article
Full-text available
This paper argues that big data can possess different characteristics, which affect its quality. Depending on its origin, data processing technologies, and methodologies used for data collection and scientific discoveries, big data can have biases, ambiguities, and inaccuracies which need to be identified and accounted for to reduce inference errors and improve the accuracy of generated insights. Big data veracity is now being recognized as a necessary property for its utilization, complementing the three previously established quality dimensions (volume, variety, and velocity), But there has been little discussion of the concept of veracity thus far. This paper provides a roadmap for theoretical and empirical definitions of veracity along with its practical implications. We explore veracity across three main dimensions: 1) objectivity/subjectivity, 2) truthfulness/deception, 3) credibility/implausibility – and propose to operationalize each of these dimensions with either existing computational tools or potential ones, relevant particularly to textual data analytics. We combine the measures of veracity dimensions into one composite index – the big data veracity index. This newly developed veracity index provides a useful way of assessing systematic variations in big data quality across datasets with textual information. The paper contributes to the big data research by categorizing the range of existing tools to measure the suggested dimensions, and to Library and Information Science (LIS) by proposing to account for heterogeneity of diverse big data, and to identify information quality dimensions important for each big data type.
Article
Full-text available
The Information Manipulation Classification Theory offers a systematic approach to understanding the differences and similarities among various types of information manipulation (such as falsification, exaggeration, concealment, misinformation or hoax). We distinguish twelve salient factors that manipulation varieties differ by (such as intentionality to deceive, accuracy, and social acceptability) to provide an abstract framework and conceptualize various permutations. Each variety then is represented as a set of features in the twelve-dimensional space. Our contributions are two-fold. In Library and Information Science (LIS) literature, a nuanced understanding of information manipulation varieties and their inter-relation lends greater awareness and sophistication to the ways we think about information and information literacy. For Natural Language Processing (NLP), the model identifies salient features for each manipulation variety, creates a potential for automated recognition and adaptability from deception detection technology to identification of other information manipulation varieties based on similarities.
Article
Full-text available
Rhetorical Structure Theory (RST) is a theory of text organization that has led to areas of application beyond discourse analysis and text generation, its original goals. In this article, we review the most important applications in several areas: discourse analysis, theoretical linguistics, psycholinguistics, and computational linguistics. We also provide a list of resources useful for work within the RST framework. The present article is a complement to our review of the theoretical aspects of the theory (Taboada and Mann, 2006).
Article
Full-text available
This study investigated changes in both the liar's and the conversational partner's linguistic style across truthful and deceptive dyadic communication in a synchronous text-based setting. An analysis of 242 transcripts revealed that liars produced more words, more sense-based words (e.g., seeing, touching), and used fewer self-oriented but more other-oriented pronouns when lying than when telling the truth. In addition, motivated liars avoided causal terms when lying, whereas unmotivated liars tended to increase their use of negations. Conversational partners also changed their behavior during deceptive conversations, despite being blind to the deception manipulation. Partners asked more questions with shorter sentences when they were being deceived, and matched the liar's linguistic style along several dimensions. The linguistic patterns in both the liar and the partner's language use were not related to deception detection, suggesting that partners were unable to use this linguistic information to improve their deception detection accuracy.
Article
Full-text available
Interpersonal deception theory (IDT) represents a merger of interpersonal communication and deception principles designed to better account for deception in interactive contexts. At the same time, it bas the potential to enlighten theories related to (a) credibility and truthful communication and (b) interpersonal communication. Presented here are key definitions, assumptions related to the critical attributes and key features of interpersonal communication and deception, and 18 general propositions from which specific testable hypotheses can be derived. Research findings relevant to the propositions are also summarized.
Article
Full-text available
Rhetorical Structure Theory is a descriptive theory of a major aspect of the organization of natural text. It is a linguistically useful method for describing natural texts, characterizing their structure primarily in terms of relations that hold between parts of the text. This paper establishes a new definitional foundation for RST. The paper also examines three claims of RST: the predominance of nucleus/satellite structural patterns, the functional basis of hierarchy, and the communicative role of text structure.
Article
Full-text available
Clustering algorithms are exploratory data analysis tools that have proved to be essential for gaining valuable insights on various as-pects and relationships of the underlying systems. In this paper we present gCLUTO, a stand-alone clustering software package which serves as an easy-to-use platform that combines clustering algo-rithms along with a number of analysis, reporting, and visualization tools to aid in interactive exploration and clustering-driven analysis of large datasets. gCLUTO provides a wide-range of algorithms that operate either directly on the original feature-based representation of the objects or on the object-to-object similarity graphs and are capable of analyzing different types of datasets and finding clusters with different characteristics. In addition, gCLUTO implements a project-oriented work-flow that eases the process of data analysis.
Article
Full-text available
Article
Full-text available
The detection of deception is a promising but challenging task. A systematic discussion of automated Linguistics Based Cues (LBC) to deception has rarely been touched before. The experiment studied the effectiveness of automated LBC in the context of text-based asynchronous computer mediated communication (TA-CMC). Twenty-seven cues either extracted from the prior research or created for this study were clustered into nine linguistics constructs: quantity, diversity, complexity, specificity, expressivity, informality, affect, uncertainty, and nonimmediacy. A test of the selected LBC in a simulated TA-CMC experiment showed that: (1) a systematic analysis of linguistic information could be useful in the detection of deception; (2) some existing LBC were effective as expected, while some others turned out in the opposite direction to the prediction of the prior research; and (3) some newly discovered linguistic constructs and their component LBC were helpful in differentiating deception from truth.
Article
Full-text available
We examined the hypothesis that reliable verbal indicators of deception exist in the interrogation context. Participants were recruited for a study addressing security effectiveness and either committed a theft to test the effectiveness of a new security guard or carried out a similar but innocuous task. They then provided either (1) a truthful alibi, (2) a partially deceptive account, (3) a completely false alibi, or (4) a truthful confession regarding the theft to an interrogator hired for the purpose of investigating thefts with a monetary incentive for convincing the interrogator of their truthfulness. Results indicated that only 3 out of the 18 (16.7%) clues tested significantly differentiated the truthful and deceptive accounts. All 3 clues were derived from the Statement Validity Analysis (SVA) technique (amount of detail reported, coherence, and admissions of lack of memory). Implications for credibility assessment in forensic interrogations are discussed.
Conference Paper
Full-text available
Our goal is to use natural language proc- essing to identify deceptive and non- deceptive passages in transcribed narra- tives. We begin by motivating an analy- sis of language-based deception that relies on specific linguistic indicators to discover deceptive statements. The indi- cator tags are assigned to a document us- ing a mix of automated and manual methods. Once the tags are assigned, an interpreter automatically discriminates between deceptive and truthful state- ments based on tag densities. The texts used in our study come entirely from "real world" sources—criminal state- ments, police interrogations and legal tes- timony. The corpus was hand-tagged for the truth value of all propositions that could be externally verified as true or false. Classification and Regression Tree techniques suggest that the approach is feasible, with the model able to identify 74.9% of the T/F propositions correctly. Implementation of an automatic tagger with a large subset of tags performed well on test data, producing an average score of 68.6% recall and 85.3% preci-
Article
Full-text available
In 1992, DeLone and McLean suggested that the dependent variable for information systems (IS) research is IS Success. Their research resulted in the widely cited DeLone and McLean (D&M) IS Success Model, in which System Quality, Information Quality, Use, User Satisfaction, Individual Impact, and Organizational Impact are distinct, but related dimensions of IS success. Since the original IS Success Model was published, research has developed a better understanding of IS success. Meanwhile, comprehensive and integrative research on the variables that influence IS success has been lacking. Therefore, we examine the literature on the independent variables that affect IS success. After examining over 600 articles, we focused our attention on integrating the findings of over 140 studies. In this research, we identify 43 specific variables posited to influence the different dimensions of IS success, and we organize these success factors into five categories based on the Leavitt Diamond of Organizational Change: task characteristics, user characteristics, social characteristics, project characteristics, and organizational characteristics. Next, we identify 15 success factors that have consistently been found to influence IS success: Enjoyment, Trust, User Expectations, Extrinsic Motivation, IT Infrastructure, Task Compatibility, Task Difficulty, Attitudes Toward Technology, Organizational Role, User Involvement, Relationship with Developers, Domain Expert Knowledge, Management Support, Management Processes, and Organizational Competence. Finally, we highlight gaps in our knowledge of success factors and propose a road map for future research.
Article
Full-text available
Discourse structures have a central role in several computational tasks, such as question-answering or dialogue generation. In particular, the framework of the Rhetorical Structure Theory (RST) offers a sound formalism for hierarchical text organization. In this article, we present HILDA, an implemented discourse parser based on RST and Support Vector Machine (SVM) classification. SVM classifiers are trained and applied to discourse segmentation and relation labeling. By combining labeling with a greedy bottom-up tree building approach, we are able to create accurate discourse trees in linear time complexity. Importantly, our parser can parse entire texts, whereas the publicly available parser SPADE (Soricut and Marcu 2003) is limited to sentence level analysis. HILDA outperforms other discourse parsers for tree structure construction and discourse relation labeling. For the discourse parsing task, our system reaches 78.3% of the performance level of human annotators. Compared to a state-of-the-art rule-based discourse parser, our system achieves an performance increase of 11.6%.
Article
This entry defines the concepts of information credibility and cognitive authority, introduces the key terms and dimensions of each, and discusses major theoretical frameworks tested and proposed in library and information science (LIS) and related fields. It also lays out the fundamental notions of credibility and cognitive authority in historical contexts to trace the evolution of the understanding and enhancement of the two concepts. This entry contends that the assessment of information credibility and cognitive authority is a ubiquitous human activity given that people constantly make decisions and selections based on values of information in a variety of information seeking and use contexts. It further contends that information credibility and cognitive authority assessment can be seen as an ongoing and iterative process rather than a discrete information evaluation event. The judgments made in assessment processes are highly subjective given their dependence on individuals??? accumulated beliefs, existing knowledge, and prior experiences. The conclusion of this entry suggests the need for more research by emphasizing the contributions that credibility and cognitive authority research can make to the field of LIS.
Conference Paper
In this paper, we present initial experiments in the recognition of deceptive language. We introduce three data sets of true and lying texts collected for this purpose, and we show that automatic classification is a viable technique to distinguish between truth and falsehood as expressed in language. We also introduce a method for class-based feature analysis, which sheds some light on the features that are characteristic for deceptive text.
Article
Deception is a reasonably common part of daily life that society sometimes demonstrates a degree of acceptance of, and occasionally people are very willing to be deceived. But can a computer identify deception and distinguish it from that which is not deceptive? We explore deception in various guises, differentiating it from lies, and highlighting the influence of medium and message in both deception and its detection. Our investigations to date have uncovered disagreements relating to the measurements of such cues, and variations in interpretations, as could be problematic in building a deception detection system.
Deception detection remains novel, challenging, and important in natural language processing, machine learning, and the broader LIS community. Computational tools capable of alerting users to potentially deceptive content in computer-mediated messages are invaluable for supporting undisrupted, computer-mediated communication, information seeking, credibility assessment and decision making. The goal of this ongoing research is to inform creation of such automated capabilities. In this study we elicit a sample of 90 computer-mediated personal stories with varying levels of deception. Each story has 10 associated human judgments, confidence scores, and explanations. In total, 990 unique respondents participated in the study. Three analytical approaches are applied: human judgment accuracy, linguistic cue detection, and machine learning. Comparable to previous research results, human judges achieve 50–63% success rates. Actual deception levels negatively correlate with their confident judgments as being deceptive (r= −0.35, df=88, p=0.008). The best-performing machine learning algorithms reach 65% accuracy. Linguistic cues are extracted, calculated, and modeled with logistic regression, but are found not to be significant predictors of deception level or confidence score. We address the associated challenges with error analysis of the respondents' stories, and prose a faceted deception classification (theme, centrality, realism, essence, distancing) as well as a typology for stated perceived cues for deception detection (world knowledge, logical contradiction, linguistic evidence, and intuitive sense).
This paper extends information quality (IQ) assessment methodology by arguing that veracity/deception should be one of the components of intrinsic IQ dimensions. Since veracity/deception differs contextually from accuracy and other well-studied components of intrinsic IQ, the inclusion of veracity/deception in the set of IQ dimensions has its own contribution to the assessment and improvement of IQ. Recently developed software to detect deception in textual information represents the ready-to-use IQ assessment (IQA) instruments. The focus of the paper is on the specific IQ problem related to deceptive messages and affected information activities as well as IQA instruments (or tools) of detecting deception to improve IQ. In particular, the methodology of automated deception detection in written communication provides the basis for measuring veracity/deception dimension and demonstrates no overlap with other intrinsic IQ dimensions. Considering several known deception types (such as falsification, concealment and equivocation), we emphasize that the IQA deception tools are primarily suitable for falsification. Certain types of deception strategies cannot be spotted automatically with the existing IQA instruments based on underlying linguistic differences between truth-tellers and liars. We propose the potential avenues for the future development of the automated instruments to detect deception taking into account the theoretical, methodological and practical aspects and needs. Blending multidisciplinary research on Deception Detection with the one on IQ in Library and Information Science (LIS) and Management Information Systems (MIS), the paper contributes to IQA and its improvement by adding one more dimension, veracity/deception, to intrinsic IQ.
Article
An empirical study is described that derives the dimensionality of the concept of information. The resulting information structure was found to be in agreement with the structures suggested in the literature. Additionally, subject evaluations of three distinct report formats were determined using the derived dimensions of information. A graphical format was found to be preferred over both a tabular format and a bar chart format.
Deception in computer-mediated communication is defined as a message knowingly and intentionally transmitted by a sender to foster a false belief or conclusion by the perceiver. Stated beliefs about deception and deceptive messages or incidents are content analyzed in a sample of 324 computer-mediated communications. Relevant stated beliefs are obtained through systematic sampling and querying of the blogosphere based on 80 English words commonly used to describe deceptive incidents. Deception is conceptualized broader than lying and includes a variety of deceptive strategies: falsification, concealment (omitting material facts) and equivocation (dodging or skirting issues). The stated beliefs are argued to be valuable toward the creation of a unified multi-faceted ontology of deception, stratified along several classificatory facets such as (1) contextual domain (e.g., personal relations, politics, finances & insurance), (2) deception content (e.g., events, time, place, abstract notions), (3) message format (e.g., a complaint: they lied to us, a victim story: I was lied to or tricked, or a direct accusation: you're lying), and (4) deception variety, each tied to particular verbal cues (e.g., misinforming, scheming, misrepresenting, or cheating). The paper positions automated deception detection within the field of library and information science (LIS), as a feasible natural language processing (NLP) task. Key findings and important constructs in deception research from interpersonal communication, psychology, criminology, and language technology studies are synthesized into an overview. Deception research is juxtaposed to several benevolent constructs in LIS research: trust, credibility, certainty, and authority.
Article
Organizations spend millions of dollars on information systems to improve organizational or individual performance, but objective measures of system success are extremely difficult to achieve. For this reason, many MIS researchers (and potentially MIS practitioners) rely on user evaluations of systems as a surrogate for MIS success. However, these measures have been strongly criticized as lacking strong theoretical underpinnings. Furthermore, empirical evidence of their efficacy is surprisingly weak. Part of the explanation for the theoretical and empirical problems with user evaluations is that they are really a measurement technique rather than a single theoretical construct. User evaluations are elicited beliefs or attitudes about something, and they have been used to measure a variety of different "somethings." What is needed for user evaluations to be an effective measure of IS success is the identification of some specific user evaluation construct, defined within a theoretical perspective that can usefully link underlying systems to their relevant impacts. We propose task-technology fit (TTF) as such a user evaluation construct. The TTF perspective views technology as a means by which a goal-directed individual performs tasks. TTF focuses on the degree to which systems characteristics match user task needs. We posit that higher task-technology fit will result in better performance. Further, we posit that users can successfully evaluate task-technology fit. This latter proposition is strongly supported in a survey of 259 users in 9 companies.
Article
We examined accuracy in detecting the truths and lies of 10 videotaped students who offered their opinions on the death penalty or smoking in public. Student lie detectors were randomly assigned to either the individual condition, where they reported their veracity judgments and confidence independently, or the small group condition, where they recorded their judgments privately and then deliberated with 5 other students before making a consensus judgment of lie, truth, or hung. Results indicated that small group judgments were more accurate than individual judgments when judging deceptive but not truthful communication. Small group individuals also reported greater confidence in their abilities after the task. Finally, groups with a greater number of hung judgments were more accurate, likely due to their employing hung judgments for the most difficult to judge stimulus communicators. These results raise implications for real life group judgments, particularly in light of the increasing availability of technology.
Article
Deception detection is an essential skill in careers such as law enforcement and must be accomplished accurately. However, humans are not very competent at determining veracity without aid. This study examined automated text-based deception detection which attempts to overcome the shortcomings of previous credibility assessment methods. A real-world, high-stakes sample of statements was collected and analyzed. Several different sets of linguistic-based cues were used as inputs for classification models. Overall accuracy rates of up to 74% were achieved, suggesting that automated deception detection systems can be an invaluable tool for those who must assess the credibility of text.
Article
Hierarchic document clustering has been widely applied to information retrieval (IR) on the grounds of its potential improved effectiveness over inverted file search (IFS). However, previous research has been inconclusive as to whether clustering does bring improvements. In this paper we take the view that if hierarchic clustering is applied to search results (query-specific clustering), then it has the potential to increase the retrieval effectiveness compared both to that of static clustering and of conventional IFS. We conducted a number of experiments using five document collections and four hierarchic clustering methods. Our results show that the effectiveness of query-specific clustering is indeed higher, and suggest that there is scope for its application to IR.
Conference Paper
We present an implicit discourse relation classifier in the Penn Discourse Treebank (PDTB). Our classifier considers the con- text of the two arguments, word pair infor- mation, as well as the arguments' internal constituent and dependency parses. Our results on the PDTB yields a significant 14.1% improvement over the baseline. In our error analysis, we discuss four chal- lenges in recognizing implicit relations in the PDTB.
Conference Paper
In this paper, we present initial experi- ments in the recognition of deceptive lan- guage. We introduce three data sets of true and lying texts collected for this purpose, and we show that automatic classification is a viable technique to distinguish be- tween truth and falsehood as expressed in language. We also introduce a method for class-based feature analysis, which sheds some light on the features that are charac- teristic for deceptive text. You should not trust the devil, even if he tells the truth. - Thomas of Aquin (medieval philosopher)
Article
Poor data quality (DQ) can have substantial social and economic impacts. Although firms are improving data quality with practical approaches and tools, their improvement efforts tend to focus narrowly on accuracy. We believe that data consumers have a much broader data quality conceptualization than IS professionals realize. The purpose of this paper is to develop a framework that captures the aspects of data quality that are important to data consumers.A two-stage survey and a two-phase sorting study were conducted to develop a hierarchical framework for organizing data quality dimensions. This framework captures dimensions of data quality that are important to data consumers. Intrinsic DQ denotes that data have quality in their own right. Contextual DQ highlights the requirement that data quality must be considered within the context of the task at hand. Representational DQ and accessibility DQ emphasize the importance of the role of systems. These findings are consistent with our understanding that high-quality data should be intrinsically good, contextually appropriate for the task, clearly represented, and accessible to the data consumer.Our framework has been used effectively in industry and government. Using this framework, IS managers were able to better understand and meet their data consumers' data quality needs. The salient feature of this research study is that quality attributes of data are collected from data consumers instead of being defined theoretically or based on researchers' experience. Although exploratory, this research provides a basis for future studies that measure data quality along the dimensions of this framework.
Article
It is well known, of course, that the assessment of this month's economic activity will improve with the passage of time. The same situation exists for many of the inputs to managerial and strategic decision processes. Information regarding some situation or activity at a fixed point in time becomes better with the passage of time. However, as a consequence of the dynamic nature of many environments, the information also becomes less relevant over time. This balance between using current but inaccurate information or accurate but outdated information we call the accuracy-timeliness tradeoff. Through analysis of a generic family of environments, procedures are suggested for reducing the negative consequences of this tradeoff. In many of these situations, rather general knowledge concerning relative weights and shapes of functions is sufficient to determine optimizing strategies. Copyright © 1995 Institute for Operations Research and the Management Sciences.