Table 1 - uploaded by James W. Pennebaker
Content may be subject to copyright.
Contexts in source publication
Context 1
... data record includes the file name, 17 standard linguistic dimensions (e.g., word count, percentage of pronouns, articles), 25 word categories tapping psychological constructs (e.g., affect, cognition), 10 dimensions related to "relativity" (time, space, motion), and 19 personal concern categories (e.g., work, home, leisure activities). A complete list of the standard LIWC2001 scales is included in Table 1. ...
Context 2
... of the 74 default LIWC2001 categories is composed of a list of dictionary words that define that scale. Table 1 provides a comprehensive list of the default LIWC2001 dictionary categories, scales, sample scale words, and relevant scale word counts. ...
Context 3
... LIWC output and judges' ratings, Pearson correlational analyses were performed to test LIWC's external validity. Results, presented in Table 1, reveal that the LIWC scales and judges' ratings are highly correlated. These findings suggest that LIWC successfully measures positive and negative emotions, a number of cognitive strategies, several types of thematic content, and various language composition elements. ...
Context 4
... findings suggest that LIWC successfully measures positive and negative emotions, a number of cognitive strategies, several types of thematic content, and various language composition elements. As can be seen in Table 1, two LIWC-judge correlations are presented. The first, Judge 1, is based on overall ratings of the entire essay set (210 total essays across conditions). ...
Similar publications
Citations
... For example, people who score high on extraversion generally use more social words, show more positive emotions, and tend to write more words but fewer large words [6,15]. Linguistic Inquiry and Word Count (LIWC) [16] has been one of the most widely used tools for analyzing word use. Emotional dictionaries are also frequently mentioned [17,18], as emotional experience has proven to be a key factor in personality analysis [19]. ...
... Mairesse Features comprise LIWC [16] features, MRC [47] features, and prosodic and utterance-type features. The LIWC dictionary annotates a word or a prefix with multiple categories involving parts of speech, emotions, society, and the environment, while the MRC psycholinguistic database provides unique annotations based on syntactic and psychological information. ...
Detecting personalities in social media content is an important application of personality psychology. Most early studies apply a coherent piece of writing to personality detection, but today, the challenge is to identify dominant personality traits from a series of short, noisy social media posts. To this end, recent studies have attempted to individually encode the deep semantics of posts, often using attention-based methods, and then relate them, or directly assemble them into graph structures. However, due to the inherently disjointed and noisy nature of social media content, constructing meaningful connections remains challenging. While such methods rely on well-defined relationships between posts, effectively capturing these connections in fragmented and sparse content is non-trivial, particularly under limited supervision or noisy input. To tackle this, we draw inspiration from the scanning reading technique—commonly recommended for efficiently processing large volumes of information—and propose an index attention mechanism as a solution. This mechanism leverages prior psycholinguistic knowledge as an “index” to guide attention, thereby enabling more effective information fusion across scattered semantic signals. Building on this idea, we introduce the Index Attention Network (IAN)—a novel framework designed to infer personality labels by performing targeted information fusion over deep semantic representations of individual posts. Through a series of experiments, IAN achieved state-of-the-art performance on the Kaggle dataset and performance comparable to graph convolutional networks (GCN) on the Pandora dataset. Notably, IAN delivered an average improvement of 13% in terms of macro-F1 scores with the Kaggle dataset. The code for IAN is available at GitHub: https://github.com/Once2gain/IAN.
... To assess the accuracy of the tool, the authors built a dataset which consists of a mixed textual corpus comprising more than 2 million words ranging over 4500 documents. They further compute Pearson correlations between the results produced by Empath with the ones obtained with Linguistic Inquiry and Word Count (LIWC), which has been extensively validated in the literature 53 We proceed to label the posts and comments retrieved from each subreddit by adopting the following protocol: first, we check if Empath recognizes in the text any of the emotions in F ; if not, the text is discarded; otherwise, we consider as label(s) the union of the emotions detected by both Empath and Google-T5, eventually considering the intersection of the result with F . This experimental choice is justified by the fact that Google-T5 always produces in output one of the emotions above-mentioned, thus leading to potential false positives (in other words, a text could express none of the emotions in F , but one would be produced anyway). ...
In the present online social landscape, while misogyny is a well-established issue, misandry remains significantly underexplored. In an effort to rectify this discrepancy and better understand the phenomenon of gendered hate speech, we analyze four openly declared misogynistic and misandric Reddit communities, examining their characteristics at a linguistic, emotional, and structural level. We investigate whether it is possible to devise substantial and systematic discrepancies among misogynistic and misandric groups when heterogeneous factors are taken into account. Our experimental evaluation shows that no systematic differences can be observed when a double perspective, both male-to-female and female-to-male, is adopted, thus suggesting that gendered hate speech is not exacerbated by the perpetrators’ gender, indeed being a common factor of noxious communities.
... These features were selected based on their effectiveness in prior studies (Gravanis et al. 2019;Hamed et al. 2023;Verma et al. 2021;Choudhary and Arora 2021) and their relevance to linguistic, syntactic, and semantic patterns. The feature extraction process was performed in multiple steps using a combination of self-programmed Python scripts, the LIWC (Linguistic Inquiry and Word Count) dictionary (Pennebaker et al. 2001), NLTK, TextStat, and TextBlob libraries. First, quantity-based features such as character count, word count, sentence count, and paragraph count were computed to analyze the structural complexity of news articles. ...
Fake news, deliberately incorrect or misleading information presented as news, spreads through diverse platforms like online forums and traditional media. Its potential to harm lives, stir panic, and disrupt democratic systems makes it a pressing issue. Detecting fake news automatically is crucial to mitigate its impact. However, previous research often struggles with feature redundancy, suboptimal feature selection, and inefficient classification models, leading to lower accuracy and poor generalizability across datasets. To overcome these limitations, the present work introduces a methodology consisting of four distinct phases to detect fake news involving data preparation, feature extraction, feature selection, and classification. Data preprocessing eliminates ambiguities from the considered datasets and Linguistic Features (LFs) are extracted during the feature extraction phase to further perform the feature selection using Harris Hawks Optimization (HHO) algorithm. Furthermore, the CNN-BiLSTM hybrid model is then employed for classification, leveraging CNN’s spatial feature extraction and BiLSTM’s sequential learning capabilities for improved accuracy. In this work, we have performed the experiments on four publicly available datasets named ISOT, Kaggle, ConFake, and McIntire, where we achieved an accuracy of 98.89%, 98.25%, 98.56%, and 90.26%, respectively. Remarkably, this approach surpasses state-of-the-art methods by achieving an improved accuracy utilizing the HHO feature selection and CNN-BiLSTM hybrid model for fake news detection.
... Next, for Aim 2, we analyzed responses to SI posts through psycholinguistic analyses (using Linguistic Inquiry and Word Count [110]) and content analyses (using Sparse Additive Generative Model [50]), identifying key linguistic characteristics of responses to different kinds of SI posts. Our analyses revealed that responses to thwarted belongingness contained more negative affect and tentative language. ...
... Psycholinguistic research has underscored the importance of language in seeking and providing social support, highlighting how linguistic markers impact mental health outcomes [34,109,110]. Research in psychotherapeutic settings has emphasized the significance of factors such as empathy, warmth, congruence, and therapeutic alliance in predicting care-seekers' outcomes [6,82,83,102]. Translating these principles to digital mental health interventions, studies have explored the potential of computational methods to infer psychosocial dynamics from social media language, offering insights into distress detection and support mechanisms [26,27,44,64,123]. ...
... Categorical Dynamic Index (CDI) is a bipolar linguistic measure that assesses writing style on a spectrum from categorical to dynamic [110]. We calculated CDI of each response by obtaining the parts of speech occurrences as per LIWC [144]. ...
Suicide is a critical global public health issue, with millions experiencing suicidal ideation (SI) each year. Online spaces enable individuals to express SI and seek peer support. While prior research has revealed the potential of detecting SI using machine learning and natural language analysis, a key limitation is the lack of a theoretical framework to understand the underlying factors affecting high-risk suicidal intent. To bridge this gap, we adopted the Interpersonal Theory of Suicide (IPTS) as an analytic lens to analyze 59,607 posts from Reddit's r/SuicideWatch, categorizing them into SI dimensions (Loneliness, Lack of Reciprocal Love, Self Hate, and Liability) and risk factors (Thwarted Belongingness, Perceived Burdensomeness, and Acquired Capability of Suicide). We found that high-risk SI posts express planning and attempts, methods and tools, and weaknesses and pain. In addition, we also examined the language of supportive responses through psycholinguistic and content analyses to find that individuals respond differently to different stages of Suicidal Ideation (SI) posts. Finally, we explored the role of AI chatbots in providing effective supportive responses to suicidal ideation posts. We found that although AI improved structural coherence, expert evaluations highlight persistent shortcomings in providing dynamic, personalized, and deeply empathetic support. These findings underscore the need for careful reflection and deeper understanding in both the development and consideration of AI-driven interventions for effective mental health support.
... Of the more recent attempts, Schultheiss (2013a) scored PSE stories based on the closed dictionary Linguistic Inquiry and Word Count (LIWC; Pennebaker et al., 2001), which are psychologyderived dictionaries that have been used in several research areas (Boyd et al., 2022); previous researchers, including Hogenraad (2005) and Pennebaker and King (1999), used a similar approach. The dictionaries range from "first person plural pronouns" to "swear words" and "positive emotions." ...
Implicit motives, nonconscious needs that influence individuals’ behaviors and shape their emotions, have been part of personality research for nearly a century but differ from personality traits. The implicit motive assessment is very resource-intensive, involving expert coding of individuals’ written stories about ambiguous pictures, and has hampered implicit motive research. Using large language models and machine learning techniques, we aimed to create high-quality implicit motive models that are easy for researchers to use. We trained models to code the need for power, achievement, and affiliation (N = 85,028 sentences). The person-level assessments converged strongly with the holdout data, intraclass correlation coefficient, ICC(1,1) = .85, .87, and .89 for achievement, power, and affiliation, respectively. We demonstrated causal validity by reproducing two classical experimental studies that aroused implicit motives. We let three coders recode sentences where our models and the original coders strongly disagreed. We found that the new coders agreed with our models in 85% of the cases (p < .001, ϕ = .69). Using topic and word embedding analyses, we found specific language associated with each motive to have a high face validity. We argue that these models can be used in addition to, or instead of, human coders. We provide a free, user-friendly framework in the established R-package text and a tutorial for researchers to apply the models to their data, as these models reduce the coding time by over 99% and require no cognitive effort for coding. We hope this coding automation will facilitate a historical implicit motive research renaissance.
... Nodes represent editing users and Wikipedia pages, while edges represent individual edit requests. Each edge is timestamped and includes Linguistic Inquiry and Word Count (LIWC) [17] feature vectors characterizing the textual content of the edit. • Reddit [11]: This dataset comprises one month of Reddit posting logs. ...
Aggregating temporal signals from historic interactions is a key step in future link prediction on dynamic graphs. However, incorporating long histories is resource-intensive. Hence, temporal graph neural networks (TGNNs) often rely on historical neighbors sampling heuristics such as uniform sampling or recent neighbors selection. These heuristics are static and fail to adapt to the underlying graph structure. We introduce FLASH, a learnable and graph-adaptive neighborhood selection mechanism that generalizes existing heuristics. FLASH integrates seamlessly into TGNNs and is trained end-to-end using a self-supervised ranking loss. We provide theoretical evidence that commonly used heuristics hinders TGNNs performance, motivating our design. Extensive experiments across multiple benchmarks demonstrate consistent and significant performance improvements for TGNNs equipped with FLASH.
... In recent years, various text analysis tools have emerged [23][24][25][26][27][28][29] , ranging from basic approaches like counting word frequencies to more advanced methods, such as leveraging large language models (LLMs), each with its own set of advantages and limitations. For example, closed-vocabulary programs (i.e., dictionary-based approach) 1 such as the Linguistic Inquiry and Word Count (LIWC) 30 , use prede ned dictionaries to categorize words. While highly interpretable, transparent, and e cient at summarizing concepts, these methods often neglect context, leading to potential misinterpretations 25 . ...
Tracking emotion fluctuations in adolescents’ daily lives is essential for understanding mood dynamics and identifying early markers of affective disorders. This study examines the potential of text-based approaches for emotion prediction by comparing nomothetic (group-level) and idiographic (individualized) models in predicting adolescents’ daily negative affect (NA) from text features. Additionally, we evaluate different Natural Language Processing (NLP) techniques for capturing within-person emotion fluctuations. We analyzed ecological momentary assessment (EMA) text responses from 97 adolescents (ages 14-18, 77.3% female, 22.7% male, N EMA =7,680). Text features were extracted using a dictionary-based approach, topic modeling, and GPT-derived emotion ratings. Random Forest and Elastic Net Regression models predicted NA from these text features, comparing nomothetic and idiographic approaches. All key findings, interactive visualizations, and model comparisons are available via a companion web app: https://emotracknlp.streamlit.app/. Idiographic models combining text features from different NLP approaches exhibited the best performance: they performed comparably to nomothetic models in R² but yielded lower prediction error (Root Mean Squared Error), improving within-person precision. Importantly, there were substantial between-person differences in model performance and predictive linguistic features. When selecting the best-performing model for each participant, significant correlations between predicted and observed emotion scores were found for 90.7–94.8% of participants. Our findings suggest that while nomothetic models offer initial scalability, idiographic models may provide greater predictive precision with sufficient within-person data. A flexible, personalized approach that selects the optimal model for each individual may enhance emotion monitoring, while leveraging text data to provide contextual insights that could inform appropriate interventions.
... Liu et al. [13] presented a semi-supervised cognitive engagement classification method called B-LIWC-UDA. This method incorporated BERT [10] and LIWC [14] cognitive lexicon dual feature embeddings. ...
... The most widely used general purpose lexicons are General Inquirer [27], LIWC [28], the opinion lexicon of Hu and Liu (2004a) and the MPQA Subjectivity Lexicon [29]. However, the efficacy of general purpose lexicon for domain-specific task like financial and economics textual analysis is questionable. ...
This chapter is concerned with textual and sentiment analysis in agriculture commodities market using the natural language processing (NLP) methods. There are extensive research on textual and sentiment analysis in financial markets however, most of them are focusing on equity market and a minority on other commodities like energy commodities. Therefore, this chapter first reviews research works on textual and sentiment analysis in agriculture market in general. Then, presents textual analysis methods that can be carried out to study the effect of textual data and sentiment in agriculture market. Finally, it presents an example of implementing a topic modelling task and textual regression for forecasting realized volatility of corn returns. To the best of the author’s knowledge, there is no study focusing on textual regression in agriculture market. Additionally, the studies conducting textual sentiment analysis are very limited. In this spirit, this study tries to fill this gap by introducing both well established and new textual and sentiment analysis methods to the agricultural researchers community. The limited experiment carried out with these methods in the present research testifies the superiority of the text-based models in explaining future movements of corn’s volatility. More specifically, the results of one-month-ahead realized volatility regression indicates statistically significant superior performance of both direct textual regression and sentiment regression compared to traditional methods like HAR and ARIMA. In addition, as the most accurate method, textual regression’s accuracy stands higher above that of the sentiment regression model.
... In vocabulary, any sequence of words or sentences that represent social interaction or positive emotions can be denoted as extroverts, whereas when this sequence of words shows negative emotions, it can be recognized as neurotics. To understand the nuance of a language, there is a manually maintained lexicon method, known as Linguistic Inquiry and Word Count (LIWC) [8], that builds a relationship between words based on psychological classifications. This procedure also has some challenges compared to the advantages it offers. ...
... Text data is an immediate expression of people's thoughts, ideas, feelings, and emotions, making it a fantastic aid in the research area of personality prediction. Previous studies on personality detection primarily based on linguistic aspects, which include the Linguistic Inquiry and Word Count (LIWC) [8], helped researchers pick out linguistic markers related to personality traits. A prominent psychological Big Five model [2], which has been extensively utilized in computational personality prediction techniques, permits researchers to associate personality traits with textual data. ...
Personality prediction via different techniques is an established and trending topic in psychology. The advancement of machine learning algorithms in multiple fields also attracted the attention of Automatic Personality Prediction (APP). This research proposes a novel TraitBertGCN method with a data fusion technique for predicting personality traits. Initially, this work integrates a pre-trained language model, Bidirectional Encoder Representations from Transformers (BERT), with a three-layer Graph Convolutional Network (GCN) to leverage large-scale language understanding and graph-based learning for personality prediction. This study fuses the two datasets (essays and myPersonality) to overcome the bias and generalize the model across different domains. We fine-tuned our TraitBertGCN model on the fused dataset and then evaluated it on both datasets individually to assess its adaptability and accuracy in varied contexts. We compared the proposed model’s results with previous studies; our model achieved better performance in personality trait prediction across multiple datasets, with an average accuracy of 77.42% on the essays dataset and 87.59% on the myPersonality dataset.