Article

A Machine Learning Approach to Twitter User Classification

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper addresses the task of user classification in social media, with an application to Twitter. We automatically infer the values of user attributes such as political orientation or ethnicity by leveraging observable information such as the user behavior, network structure and the linguistic content of the user’s Twitter feed. We employ a machine learning approach which relies on a comprehensive set of features derived from such user information. We report encouraging experimental results on 3 tasks with different characteristics: political affiliation detection, ethnicity identification and detecting affinity for a particular business. Finally, our analysis shows that rich linguistic features prove consistently valuable across the 3 tasks and show great promise for additional user classification needs.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Relevant events determine the number and contents of twitter publications [19,20,21]. To evaluate the impact of scientific progress on the public opinion about neurofeedback, we selected the time point of a few scientific and non-scientific events and measured the reaction of the public discourse on neurofeedback. ...
... Twitter profile descriptions consist of a 160-character short description of the user, are considered an expression of aspects of their social identity [31] and can help to understand users' interests on and motivations for interaction with Twitter contents. In combination with other metadata, Twitter short bios convey information on several user's characteristics [20] as well as mental status [32]. After identifying user groups, we analyze publication activity among the different profile categories as well as the number of followers. ...
... Using the criteria of 10 tweets on neurofeedback, less than 1% of the users were classified as productive users. We employed their Twitter short bios as a container of information on user's characteristics [20]. Twitter profile descriptions are considered an expression of aspects of their social identity [31] and can help to understand users' interests on, and motivations for interaction with Twitter contents [52]. ...
Article
Neurofeedback is a popular technique to induce neuroplasticity with a controversial reputation. The public discourse on neurofeedback, as a therapeutic and neuroenhancement technique, encompasses scientific communication, therapeutic expectations and outcomes, as well as complementary and alternative practices. We investigated twitter publications from 2010 to 2022 on the keyword "neurofeedback". A total of over 138 k tweets were obtained, which originated from over 42 k different users. The communication flow in the neurofeedback community is mainly unidirectional and non-interactive. Analysis of hashtags revealed application fields, therapy provider and neuroenhancement to be the most popular contents in neurofeedback communication. A group of 1221 productive users was identified, in which clinicians, entrepreneurs, broadcasters, and scientists contribute. We identified reactions to critical publications in the twitter traffic and an increase in the number of tweets by academic users which suggest an increase in the interest on the scientific credibility of neurofeedback. More intense scientific communication on neurofeedback in twitter may contribute to promote a more realistic view on challenges and advances regarding good scientific practice of neurofeedback.
... They connect differently with others, have different tweeting habits, and differ in style and linguistic content. Studying the conversational connections between Twitter users and text mining their tweets can help classify users based on their characteristics and identify different types of users [34][35][36][37][38]. ...
... Overview Rao et al [38] and Pennacchiotti and Popescu [36,37] showed that Twitter users' demographics and political views could be distinguished by considering 3 types of user classification features: behavioral features (features extracted from the user's activity on Twitter), linguistic features (features extracted from the content of the user's tweets), and social structure features (features describing the user's social network). We followed their work and adapted these types to our different domains of distinguishing patients with IBD from others who talk about the disease. ...
... We used 2 types of linguistic features. On the basis of previous research [36][37][38] and our data's nature, we extracted several features from the text that we believed would help the classification. ...
Article
Full-text available
Background Patients use social media as an alternative information source, where they share information and provide social support. Although large amounts of health-related data are posted on Twitter and other social networking platforms each day, research using social media data to understand chronic conditions and patients’ lifestyles is limited. Objective In this study, we contributed to closing this gap by providing a framework for identifying patients with inflammatory bowel disease (IBD) on Twitter and learning from their personal experiences. We enabled the analysis of patients’ tweets by building a classifier of Twitter users that distinguishes patients from other entities. This study aimed to uncover the potential of using Twitter data to promote the well-being of patients with IBD by relying on the wisdom of the crowd to identify healthy lifestyles. We sought to leverage posts describing patients’ daily activities and their influence on their well-being to characterize lifestyle-related treatments. Methods In the first stage of the study, a machine learning method combining social network analysis and natural language processing was used to automatically classify users as patients or not. We considered 3 types of features: the user’s behavior on Twitter, the content of the user’s tweets, and the social structure of the user’s network. We compared the performances of several classification algorithms within 2 classification approaches. One classified each tweet and deduced the user’s class from their tweet-level classification. The other aggregated tweet-level features to user-level features and classified the users themselves. Different classification algorithms were examined and compared using 4 measures: precision, recall, F1 score, and the area under the receiver operating characteristic curve. In the second stage, a classifier from the first stage was used to collect patients' tweets describing the different lifestyles patients adopt to deal with their disease. Using IBM Watson Service for entity sentiment analysis, we calculated the average sentiment of 420 lifestyle-related words that patients with IBD use when describing their daily routine. Results Both classification approaches showed promising results. Although the precision rates were slightly higher for the tweet-level approach, the recall and area under the receiver operating characteristic curve of the user-level approach were significantly better. Sentiment analysis of tweets written by patients with IBD identified frequently mentioned lifestyles and their influence on patients’ well-being. The findings reinforced what is known about suitable nutrition for IBD as several foods known to cause inflammation were pointed out in negative sentiment, whereas relaxing activities and anti-inflammatory foods surfaced in a positive context. Conclusions This study suggests a pipeline for identifying patients with IBD on Twitter and collecting their tweets to analyze the experimental knowledge they share. These methods can be adapted to other diseases and enhance medical research on chronic conditions.
... Social media users frequently watch videos made by people they find interesting, which can lead to interactions with random people and, although less frequently, celebrities (Murthy, 2012). Pennacchiotti & Popescu (2011) stated that millions of users of the popular micro-blogging platform -Twitter now utilise them regularly. Micro-blogging platforms are utilised as forums for information sharing, realtime news sources, recommendation services, and keeping in touch with friends, family, and strangers. ...
... Information concerning a user's demographic characteristics or personal interests and those of other site users might significantly enhance their experience with a micro-blogging service. Such data might enable tailored suggestions for individuals to follow or articles to be read by users; moreover, it could emphasise events and subjects of interest to specific groups (Pennacchiotti & Popescu, 2011). ...
Article
Full-text available
Cyberbullying among K-pop fans in Malaysia is still unclear, but the consequences of this behavior are becoming a growing problem. The problem of cyberbullying cannot be simplified with Malaysia's current laws. There is an urgent need for K-pop fans on Twitter to understand cyberbullying and find ways to combat it. In this study, the researcher highlights the most common types of cyberbullying among K-pop fans on Twitter. The aim of this study is to understand cyberbullying on Twitter among K-pop fans in Malaysia. It specifically examines the community of K-pop fans on Twitter. There are 3 research objectives in this study, namely (i). To discover the most common types of cyberbullying among K-pop fans on Twitter, (ii). To understand the social activities that facilitate cyberbullying among K-pop fans on Twitter, and (iii). To determine whether K-pop fans on Twitter are aware of ways to combat cyberbullying. This research is based on data obtained through in-depth interviews with mass communication students. The data was analyzed using thematic analysis to determine the most common types of cyberbullying and ways to combat cyberbullying. The study found that K-pop fans know what behavior is considered cyberbullying, what types of cyberbullying to expect and what can be done about it. The results of the study also suggest that a different approach needs to be taken to better understand how to combat cyberbullying. It is hoped that the study can help to improve the prevention of cyberbullying on Twitter among K-pop fans. Keywords: Cyberbullying, Fans, K-pop, Twitter
... You and your tea-publicans had this planned for over 3 yrs. You own it #JustVote Therefore, we argue that the problem of inferring message impartiality is distinct from the task of inferring author affiliation-a task which was extensively studied in prior work (Conover et al. 2011a;Pennacchiotti and Popescu 2011;Zamal, Liu, and Ruths 2012;Wong et al. 2013). ...
... Another line of research has focused on detecting the political leaning of individual users in social media (Conover et al. 2011a;Pennacchiotti and Popescu 2011;Zamal, Liu, and Ruths 2012;Wong et al. 2013). As discussed in the introduction, the problem of measuring the impartiality of an individual message is crucially different from that of detecting its author affiliation. ...
Article
Discourse on social media platforms is often plagued by acute polarization, with different camps promoting different perspectives on the issue at hand—compare, for example, the differences in the liberal and conservative discourse on the U.S. immigration debate. A large body of research has studied this phenomenon by focusing on the affiliation of groups and individuals. We propose a new finer-grained perspective: studying the impartiality of individual messages. While the notion of message impartiality is quite intuitive, the lack of an objective definition and of a way to measure it directly has largely obstructed scientific examination. In this work we operationalize message impartiality in terms of how discernible the affiliation of its author is, and introduce a methodology for quantifying it automatically. Unlike a supervised machine learning approach, our method can be used in the context of emerging events where impartiality labels are not immediately available. Our framework enables us to study the effects of (im)partiality on social media discussions at scale. We show that this phenomenon is highly consequential, with partial messages being twice more likely to spread than impartial ones, even after controlling for author and topic. By taking this fine-grained approach to polarization, we also provide new insights into the temporal evolution of online discussions centered around major political and sporting events.
... Boutet, Kim, and Yoneki (2012) proposed an algorithm designed to identify users' political leanings relating to their political parties, and Gómez-Suta, Echeverry-Correa, and Soto-Mejía (2023) incorporated topic modeling for stance detection. However, Pennacchiotti and Popescu (2011b) leveraged machine learning techniques based on network features, and Rao, Yarowsky, Shreevats, and Gupta (2010) implemented a stacked SVM-based classification algorithm. Finally, Zotova, Agerri, and Rigau (2021) contributed to the field by semi-automatically creating labeled datasets for stance detection on Twitter. ...
Article
Full-text available
Social media platforms play a significant role in political discourse, often serving as tools for political actors to disseminate partisan narratives, frequently encapsulated in concise slogans presented as hashtags. In this paper, we present a novel systematic framework leveraging network science tools and clustering algorithms to discern the political orientations of posts through their associated hashtags, that can be used in the context of opinion dynamics. Our results show that by applying this framework within the context of the 2022 Italian Elections, we successfully quantify the online activity of political coalitions and their supporters pre and post-election. By analyzing labeled posts derived from this framework we find a surge in user activity leading up to the election, followed by a pronounced decline afterward. Moreover, we note a remarkable shift in engagement toward the winning coalition post-election. Interestingly, at the coalition level, our findings reveal an inverse correlation between posting activity and the level of engagement received on social media platforms. Finally, a rank-size analysis of publication patterns among supporters during the pre-election period highlighted comparable trends in content generation across coalitions.
... As such, a parallel equivalence with content classification in social media was made. Social media content classification is typically used for sentiments analysis [78], user analysis [79] and topic classification [80]. Topical classification was selected for this research since it deals with the grouping of social media content and most aligned with records classification. ...
Thesis
Full-text available
The rise of social media is changing accepted customs and institutional practices in news media, politics, law enforcement, advertising, commerce, and human-cultural interaction in general. Much of the impact of social media is due to the content provided by users. The volume, variety and unstructured nature of Social Media Content make it difficult to manage for the public users. The absence of proper management of social media content leaves the society vulnerable to service provider over-exploitation for profit, long-term surveillance, cyber crime, fake news and mental health problems, amongst other challenges. The research detailed in this document proposes a Records Management (RM) approach to the management of social media content. The RM approach fundamental consists of three elements: the selection of high value content as records, classification of the selected records and timely disposition of content when no longer useful. The research thesis addresses the first two elements of Record Selection and Record Classification, and presents a rationale for deferring the last element, Record Retention and Disposition to a future study. Record Selection is based on X social media posting references already curated as potential records by news media sources. For Record Classification, the thesis employed a host of supervised, semi-supervised and unsupervised machine learning classification methods as part of a novel theoretical framework named Grounded Text Mining (GTxM). The GTxM framework provides a quality controlled, theoretically prudent approach to the selection and classification of social media records from a collection of social media content. The framework consists of four sub-frameworks, Data Collection, Supervised Text Classification, Computational Grounded Theory, and Classify-Cluster-Label. Data Collection was based on X postings from news articles published on the British Broadcasting Corporation and New York Times websites. Initial exploration of the research data revealed the presence of Ground Truth Data (GTD) which were properly labeled, pre-classified data by the news publishers. The initial GTD was used to set up the GTxM Classifier in the Supervised Text Classification sub-framework, while the rest of the data was sent to the Computational Grounded Theory sub-framework for inductive discovery of new and emergent classes from the data. Newly discovered classes were verified through an intercoder reliability scheme in the Classify-Cluster-Label sub-framework as a quality control conduit prior to acceptance and onboarding into the GTD for incremental training of the GTxM Classifier. Illustrated through six continuum passes, the research showed that inductive, qualitative analysis of textual data using machine learning and grounded theory can produce a better understanding of social media content as records. At the most fundamental level of record appraisal, the research concluded that social media content contains both records and non-records, an essential condition for record selection. Concretely, the study produced nine social media record classifications and one non-record classification from the research dataset, with the GTxM Classifier demonstrating over 90% cross-validated accuracy and F1 scores in its prediction of records. The research proved that by applying computing methods to the problem of selection and classification of records, it is possible to adopt the RM approach as a solution to the challenges of social media content. ii
... Past work in the area of network classification has primarily focused on distinguishing networks from different categories using two different broad classes of approaches. In the first approach, network classification is carried out by examining certain specific structural features and investigating whether networks belonging to the same category are similar across one or more dimensions as defined by these features [5,6,7,8]. In other words, in this approach the investigator manually chooses the structural characteristics of interest and more or less manually (informally) determines the regions of the feature space that correspond to different classes. ...
Preprint
Network representations of systems from various scientific and societal domains are neither completely random nor fully regular, but instead appear to contain recurring structural building blocks. These features tend to be shared by networks belonging to the same broad class, such as the class of social networks or the class of biological networks. At a finer scale of classification within each such class, networks describing more similar systems tend to have more similar features. This occurs presumably because networks representing similar purposes or constructions would be expected to be generated by a shared set of domain specific mechanisms, and it should therefore be possible to classify these networks into categories based on their features at various structural levels. Here we describe and demonstrate a new, hybrid approach that combines manual selection of features of potential interest with existing automated classification methods. In particular, selecting well-known and well-studied features that have been used throughout social network analysis and network science and then classifying with methods such as random forests that are of special utility in the presence of feature collinearity, we find that we achieve higher accuracy, in shorter computation time, with greater interpretability of the network classification results.
... Another study showed that applying machine learning techniques to classify political leanings on Twitter based on political party messages can reveal partisanship among users [7]. A number of studies present a comparison between the predictive power of the users' social connections and their content sharing patterns for inferring political affiliation, ethnicity identification and detecting affinity for a particular business [23,24]. ...
Preprint
In this paper, we are interested in understanding the interrelationships between mainstream and social media in forming public opinion during mass crises, specifically in regards to how events are framed in the mainstream news and on social networks and to how the language used in those frames may allow to infer political slant and partisanship. We study the lingual choices for political agenda setting in mainstream and social media by analyzing a dataset of more than 40M tweets and more than 4M news articles from the mass protests in Ukraine during 2013-2014 - known as "Euromaidan" - and the post-Euromaidan conflict between Russian, pro-Russian and Ukrainian forces in eastern Ukraine and Crimea. We design a natural language processing algorithm to analyze at scale the linguistic markers which point to a particular political leaning in online media and show that political slant in news articles and Twitter posts can be inferred with a high level of accuracy. These findings allow us to better understand the dynamics of partisan opinion formation during mass crises and the interplay between main- stream and social media in such circumstances.
... To get the demographic information for individuals, extant works have either used their profile images, names, and other metadata [30][31][32] or have imputed these variables. This raises critical privacy (scraping metadata from user's profiles to get demographic information) and ethical (is it correct to do so without explicit consent) questions [33,34], as well as questions about accuracy, when pre-trained image models are used to infer gender, for example, they are less accurate for dark-skinned individuals [35,36]. ...
Article
Full-text available
Around seven-in-ten Americans use social media (SM) to connect and engage, making these platforms excellent sources of information to understand human behavior and other problems relevant to social sciences. While the presence of a behavior can be detected, it is unclear who or under what circumstances the behavior was generated. Despite the large sample sizes of SM datasets, they almost always come with significant biases, some of which have been studied before. Here, we hypothesize the presence of a largely unrecognized form of bias on SM platforms, called participation bias , that is distinct from selection bias. It is defined as the skew in the demographics of the participants who opt-in to discussions of the topic, compared to the demographics of the underlying SM platform. To infer the participant’s demographics, we propose a novel generative probabilistic framework that links surveys and SM data at the granularity of demographic subgroups (and not individuals). Our method is distinct from existing approaches that elicit such information at the individual level using their profile name, images, and other metadata, thus infringing upon their privacy. We design a statistical simulation to simulate multiple SM platforms and a diverse range of topics to validate the model’s estimates in different scenarios. We use Twitter data as a case study to demonstrate participation bias on the topic of gun violence delineated by political party affiliation and gender. Although Twitter’s user population leans Democratic and has an equal number of men and women according to Pew, our model’s estimates point to the presence of participation bias on the topic of gun control in the opposite direction, with slightly more Republicans than Democrats, and more men compared to women. Our study cautions that in the rush to use digital data for decision-making and understanding public opinions, we must account for the biases inherent in how SM data are produced, lest we may also arrive at biased inferences about the public.
... location, while the rest provide either general locations (e.g., states/provinces, countries) or nonexistent places. Pennacchiotti and Popescu (2011) conducted a pilot study of a similar nature to assess direct use of public profile information, such as gender and ethnicity, from Twitter. In a corpus of 14M active users in April 2010, they found 48% of users provided a short bio and 80% a location. ...
... Analyzing users and their behavior on online social networks has been the subject of many previous works [7,8,9]. The particular domain of the Twitter microblogging service has not been an exception. ...
Article
Full-text available
People use microblogging platforms like Twitter to involve with other users for a wide range of interests and practices. Twitter profiles run by different types of users such as humans, bots, spammers, businesses and professionals. This research uses a treemap visualization to identify different users profile on Twitter. For this purpose, we exploit users' profile and tweeting behavior information. We evaluate our approach by visualizing the different Twitter profiles. This treemap visualization technique can be used to identify easily the different users’ profile in a wide range of users. We focus just on user activity, ignoring the content of messages. We take into consideration both social interactions and tweeting patterns, which allow us to profile users according to their activity patterns using treemaps.
... Online behavior is representative of many aspects of a user's demographics [Rao et al., 2010, Pennacchiotti andPopescu, 2011]. ...
Thesis
Online Social Networks (OSN) are full of personal information such as gender, age, relationship status. The popularity and growth of OSN have rendered their platforms vulnerable to malicious activities and increased user privacy concerns. The privacy settings available in OSN do not prevent users from attribute inference attacks where an attacker seeks to illegitimately obtain their personal attributes (such as the gender attribute) from publicly available information. Disclosure of personal information can have serious outcomes such as personal spam, bullying, profile cloning for malicious activities, or sexual harassment. Existing inference techniques are either based on the target user behavior analysis through their liked pages and group memberships or based on the target user friend list. However, in real cases, the amount of available information to an attacker is small since users have realized the vulnerability of standard attribute inference attacks and concealed their generated information. To increase awareness of OSN users about threats to their privacy, in this thesis, we introduce a new class of attribute inference attacks against OSN users. We show the feasibility of these attacks from a very limited amount of data. They are applicable even when users hide all their profile information and their own comments. Our proposed methodology is to analyze Facebook picture metadata, namely (i) alt-text generated by Facebook to describe picture contents, and (ii) commenters’ words and emojis preferences while commenting underneath the picture, to infer sensitive attributes of the picture owner. We show how to launch these inference attacks on any Facebook user by i) handling online newly discovered vocabulary using a retrofitting process to enrich a core vocabulary that was built during offline training and ii) computing several embeddings for textual units (e.g., word, emoji), each one depending on a specific attribute value. Finally, we introduce ProPic, a protection mechanism that selects comments to be hidden in a computationally efficient way while minimizing utility loss according to a semantic measure. The proposed mechanism can help end-users to check their vulnerability to inference attacks and suggests comments to be hidden in order to mitigate the attacks. We have determined the success of the attacks and the protection mechanism by experiments on real data.
Article
Full-text available
The layout and site selection strategy of commercial facilities are crucial for both enterprise performance and market image, while also significantly impacting the overall planning of urban commercial environments. However, conventional methods of choosing sites sometimes depend on outdated management information systems or static statistical models, which may not take into account all relevant factors and have poor data quality. By utilizing geographical big data and geographical artificial intelligence, this study improves the viability of commercial layout and site selection methods. This study utilizes mobile phone signaling data from Beijing combined with point-of-interest (POI) data from within the Sixth Ring Road of Beijing to identify user behaviors using algorithms. Through a combination of BiLSTM-RF and reinforcement learning algorithms, a population location prediction algorithm is constructed to address the issues of inaccurate and outdated population flow data in commercial site selection. The forecast distribution has a high level of accuracy, with a prediction accuracy rate of 73.2%. Additionally, based on geographical big data, the urban landscape is reconstructed to create a 3D model of Beijing. An immersive interactive commercial site selection system is implemented using the Unreal Engine.
Chapter
Communication and information sharing between individuals have changed dramatically as a result of the internet and social media platforms combined with explosive growth. Due to the development and expansion of online technology, a vast amount of data is generated as well as being available on the web for internet users. There has been a noticeable rise in the usage of foul language in user comments. Furthermore, Twitter is a well-known medium where users are allowed to voice any opinion. The proliferation of antisocial activities, including hate speech, cyberbullying, and the use of harsh language, is a direct result of overuse of social media. The prevalence of hostile comments on social media platforms calls for the development of practical and efficient solutions. Consequently, numerous intriguing algorithms have been developed to identify these kinds of languages. The objective of this chapter is to introduce strategies, equations, and techniques that could be used to evaluate Twitter tweets to identify instances of abusive language.
Article
Political polarization is commonly observed in democratic countries. While it allows individual citizens to freely choose sides, it also causes the problem of separation and isolation. Especially in information-seeking behaviors, echo chambers and filter bubbles are observed. In this paper, we present a political sentiment dictionary for analyzing political polarization and increasing information heterogeneity. It takes advantage of large-scale social media data and is thus superior in accuracy and coverage compared to manually crafted dictionaries. Generated from Japanese tweets, more than 50k words in this dictionary cover aspects ranging from political parties and public entities to foods and personal hobbies. We describe in detail the method to construct this dictionary, which can be replicated for other languages and countries. We demonstrate the use of this dictionary in the application of recommendation diversification. We show with real-world e-commerce data that the use of the dictionary can generally increase the diversity in product recommendations, effectively mitigating the filter bubbles.
Conference Paper
Este artigo propõe uma nova abordagem de particionamento de dados categóricos para aplicar a privacidade diferencial em Gradient Boosting Decision Trees. Nele estudamos aprimoramentos no tratamento de atributos categóricos e seleção aleatória de pontos de particionamento enquanto oferecemos garantias de privacidade diferencial. Nossa abordagem define uma nova função de ganho para esses atributos e determina os limites de sensibilidade dessa função. Além disso, realizamos uma análise empírica em 6 conjuntos de dados reais, mostrando que a abordagem proposta alcança taxas de erro menores ou iguais aos modelos de referência.
Chapter
This chapter presents an insight into how collectives are built around the use of online apps and social networks. The chapter explores how large collectives emerge as the accidental result of individuals’ quest for self-affirmation and social recognition among a relatively limited group of contacts. The individual user of social media apps shares their daily activities and achievements with a small community of followers in order to receive their appreciation in the form of comments, likes and shares. Whilst the user and their community operate under the presumption that such data are limited and restricted to boundaries of their own network, the software that runs the entire digital system harvests and manipulates their data, aggregating them with other networks to generate trends and improve the system on offer. Such machine learning systems operate as overarching agencies working across communities, contacts and individuals, constructing larger, interrelated and anonymous collectives that are not visible or accessible to the individuals who form them. The software is the only entity with utter control of such collectives. Through a number of case studies, this chapter explains how such systems work and operate silently in the background of our daily activities. It provides an account of how the private and the public life of individuals are becoming increasingly mediated by software without their full awareness. This chapter presents an analysis of some the algorithmic logics that run the transition between private and public lives of individuals and construct large software-driven collectives. This analysis aims at exposing and commenting on some the mechanisms that underpin the silent making of the public life as controlled by machines.
Article
Social movements form coalitions to gain leverage and achieve mutual goals, however little is known about how coalitions work, especially in the realm of social media. In this paper we examine the 2020 #StopHateForProfit coalition which pressured corporations to pull their advertising spending from Facebook because of its permissive content moderation policies toward disinformation and hate. From the digital traces of the campaign on Twitter, we explain the participation differentials among coalition social movement organisations (SMO) partners and their followers. The findings show that the coalition's centrality to movement agenda, the ideological homogeneity of followership, and the SMO partners and their followership's central positions in the communication network led to the highest and most time persistent participation rates. Our counter-intuitive findings extend the literature on social movements coalitions by suggesting that multi-issues, “big tent” movements with ideological breadth may find invoking the core of their large followership rather challenging despite the ease of participation afforded by social media.
Article
Full-text available
This study proposes a quantitative method to assess the pertinence of political language on national issues, addressing the complexity of analyzing political discourse and its relevance to citizens’ concerns. Using word embeddings and linguistic models trained on Wikipedia, a ”pertinence score” was developed to measure the relevance of political discourse in contexts such as the economy and health. The method was applied to the 2018 Colombian presidential election, revealing significant differences in thematic pertinence between candidates. Survey validation confirmed the correlation between automatic and human scores, highlighting the model’s ability to discriminate ideological positions through lexical analysis.
Article
Full-text available
Our social identities determine how we interact and engage with the world surrounding us. In online settings, individuals can make these identities explicit by including them in their public biography, possibly signaling a change in what is important to them and how they should be viewed. While there is evidence suggesting the impact of intentional identity disclosure in online social platforms, its actual effect on engagement activities at the user level has yet to be explored. Here, we perform the first large-scale study on Twitter that examines behavioral changes following identity disclosure on Twitter profiles. Combining social networks with methods from natural language processing and quasi-experimental analyses, we discover that after disclosing an identity on their profiles, users (1) tweet and retweet more in a way that aligns with their respective identities, and (2) connect more with users that disclose similar identities. We also examine whether disclosing the identity increases the chance of being targeted for offensive comments and find that in fact (3) the combined effect of disclosing identity via both tweets and profiles is associated with a reduced number of offensive replies from others. Our findings highlight that the decision to disclose one’s identity in online spaces can lead to substantial changes in how they express themselves or forge connections, with a lesser degree of negative consequences than anticipated.
Article
Full-text available
This paper presents a systematic review to identify research combining artificial intelligence (AI) algorithms with Open source intelligence (OSINT) applications and practices. Currently, there is a lack of compilation of these approaches in the research domain and similar systematic reviews do not include research that post dates the year 2019. This systematic review attempts to fill this gap by identifying recent research. The review used the preferred reporting items for systematic reviews and meta-analyses and identified 163 research articles focusing on OSINT applications leveraging AI algorithms. This systematic review outlines several research questions concerning meta-analysis of the included research and seeks to identify research limitations and future directions in this area. The review identifies that research gaps exist in the following areas: Incorporation of pre-existing OSINT tools with AI, the creation of AI-based OSINT models that apply to penetration testing, underutilisation of alternate data sources and the incorporation of dissemination functionality. The review additionally identifies future research directions in AI-based OSINT research in the following areas: Multi-lingual support, incorporation of additional data sources, improved model robustness against data poisoning, integration with live applications, real-world use, the addition of alert generation for dissemination purposes and incorporation of algorithms for use in planning.
Article
Full-text available
This study aims to examine the demographics of participants engaged in scholarly communication on Twitter, which has been rebranded as X. Firstly, based on a dataset of tweets citing COVID-19 publications, it proposed a more precise classification system consisting of eleven user categories for individuals who tweeted academic publication. Secondly, it explores the effectiveness of graph neural network models (GNNs) in combination with a transformer-based text classification model (specifically, BERT) to classify these newly defined user categories. The findings of this research highlight that GNNs can effectively interpret the social networks within scholarly communication, and complement text classification models in characterizing user types. The best-performing model achieved an accuracy rate of 84.05 percent in classifying user categories for a dataset of 10,048 labeled users. Subsequently, this model was employed to analyze 393,030 tweeters in our dataset. The analysis revealed that relevant scholarly discussion on Twitter was dominated by members from the general public (over 71 percent). Academic researchers and institutions constituted 12.48 percent, while health science professionals and institutions made up 7.35 percent of the contributors to relevant scholarly discussions on Twitter. Notably, academic publishers and research feed accounts exhibited aggressive tweeting behaviors and were responsible for the highest volume of tweets on average. This study also demonstrates the active involvement of various non-academic members, including commercial businesses, mass media outlets, public authorities, politicians, and civil society organizations, in Twitter scholarly communication.
Chapter
Various users with diverse spiritual moods and morals react differently to online content. On some users, highly toxic submissions cannot have a noticeable impact, while on another group, even not severely toxic content may provoke them to stop their interactive participation in social media. The same is true toward the intervention of platform holders, in the sense that one punishing moderation may lead a user to respect the community rules to a certain point, while the same action may stimulate the violent and offensive reaction of another user. In that regard, the moderation interventions of the platforms should follow a fine-level nature and be based on user behavior. It should also try to protect the communities the user cares about the most as far as the user, in turn, respects the content policies. The aim of the current study is to classify users into various behavioral groups, which can potentially provide the chance to adopt more efficient moderative measures to protect the community and give the user the feeling that he really deserves the moderative intervention that has experienced. Thus, the behavior of the core users of an already-banned controversial subreddit was taken into consideration, and a machine learning-based classification strategy was imposed on their activity level and their submission toxicity scores. Results have revealed interesting behavioral differences between users of the Reddit social media toward the taken moderations and have indicated the necessity for adopting find-level measures for protecting the platform, as well as the different behavioral groups.
Chapter
In this paper, we study the behavior of users on Online Social Networks in the context of Covid-19 vaccines in Italy. We identify two main polarized communities: Provax and Novax. We find that Novax users are more active, more clustered in the network, and share less reliable information compared to the Provax users. On average, Novax are more toxic than Provax. However, starting from June 2021, the Provax became more toxic than the Novax. We show that the change in trend is explained by the aggregation of some contagion effects and the change in the activity level within communities. In fact, we establish that Provax users who increase their intensity of activity after May 2021 are significantly more toxic than the other users, shifting the toxicity up within the Provax community. Our study suggests that users presenting a spiky activity pattern tend to be more toxic.
Article
Predicting the demographics of Twitter users has become a problem with a large interest in computational social sciences. However, the limited amount of public datasets with ground truth labels and the tremendous costs of hand-labeling make this task particularly challenging. Recently, programmatic weak supervision has emerged as a new framework to train classifiers on noisy data with minimal human labeling effort. In this paper, demographic prediction is framed for the first time as a programmatic weak supervision problem. A new three-step methodology for gender, age category, and location prediction is provided, which outperforms traditional programmatic weak supervision and is competitive with the state-of-the-art deep learning model. The study is performed in Flanders, a small Dutch-speaking European region, characterized by a limited number of user profiles and tweets. An evaluation conducted on an independent hand-labeled test set shows that the proposed methodology can be generalized to unseen users within the geographic area of interest.
Article
Full-text available
The COVID-19 pandemic demonstrated the importance of social distancing practices to stem the spread of the virus. However, compliance with public health guidelines was mixed. Understanding what factors are associated with differences in compliance can improve public health messaging since messages could be targeted and tailored to different population segments. We utilize Twitter data on social mobility during COVID-19 to reveal which populations practiced social distancing and what factors correlated with this practice. We analyze correlations between demographic and political affiliation with reductions in physical mobility measured by public geolocation tweets. We find significant differences in mobility reduction between these groups in the United States. We observe that males, Asian and Latinx individuals, older individuals, Democrats, and people from higher population density states exhibited larger reductions in movement. Furthermore, our study also unveils meaningful insights into the interactions between different groups. We hope these findings will provide evidence to support public health policy-making.
Article
Full-text available
Artificial intelligence (AI) and machine learning (ML) have revolutionized the way health organizations approach social media. The sheer volume of data generated through social media can be overwhelming, but AI and ML can help organizations effectively manage this information to improve telehealth, remote patient monitoring, and the well-being of individuals and communities. Previous research has revealed several trends in AI–ML adoption: First, AI can be used to enhance social media marketing. Drawing on sentiment analysis and related tools, social media is an effective way to increase brand awareness and customer engagement. Second, social media can become a very useful data collection tool when integrated with new AI–ML technologies. Using this function well requires researchers and practitioners to protect users’ privacy carefully, such as through the deployment of privacy-enhancing technologies (PETs). Third, AI–ML enables organizations to maintain a long-term relationship with stakeholders. Chatbots and related tools can increase users’ ability to receive personalized content. The review in this paper identifies research gaps in the literature. In view of these gaps, the paper proposes a conceptual framework that highlights essential components for better utilizing AI and ML. Additionally, it enables researchers and practitioners to better design social media platforms that minimize the spread of misinformation and address ethical concerns more readily. It also provides insights into the adoption of AI and ML in the context of remote patient monitoring and telehealth within social media platforms.
Chapter
Author profiling (AP) is a very interesting research field that can be involved in many application, such as, Information Retrieval, social network security, Recommender System, etc. This paper presents an in-depth literature review on Author Profiling (AP) techniques, concentrating on text mining approaches. Text Mining-based APs techniques can be categorized into three main classes: Linguistic-based AP, Statistical-based AP and a hybrid approach that combines both linguistic and statistic methods. Also, literature review shows the extensive use of classical Machine Learning and Deep Learning in this field. Besides, we perform in this paper a discussion of the presented models and the main challenges and trends in the AP domain.KeywordsAuthor profilingText MiningMachine Learning
Article
Full-text available
The adaptive social learning paradigm helps model how networked agents are able to form opinions on a state of nature and track its drifts in a changing environment. In this framework, the agents repeatedly update their beliefs based on private observations and exchange the beliefs with their neighbors. In this work, it is shown how the sequence of publicly exchanged beliefs over time allows users to discover rich information about the underlying network topology and about the flow of information over the graph. In particular, it is shown that it is possible (i) to identify the influence of each individual agent to the objective of truth learning, (ii) to discover how well-informed each agent is, (iii) to quantify the pairwise influences between agents, and (iv) to learn the underlying network topology. The algorithm derived herein is also able to work under non-stationary environments where either the true state of nature or the graph topology are allowed to drift over time. We apply the proposed algorithm to different subnetworks of Twitter users, and identify the most influential and central agents by using their public tweets (posts).
Article
Full-text available
Interest in healthcare has grown significantly worldwide, especially since the Covid-19 outbreak. Digitalisation has allowed users to interact on social networks through platforms like Twitter, collecting user interactions over time, resulting in the proliferation of fake news. This research aims to analyse, evaluate and classify the predictive potential of Twitter analytics in healthcare, identifying the latent knowledge insights and distinguishing them from related rumours and fake news. Thus, a systematic literature review (SLR) is carried out to identify and analyse the existing academic research and applications in Twitter in predicting healthcare. The most important predictive applications are detecting mental health issues and public health emergencies. Covid-19 has been the main topic of most of the studies linked to fake news and misinformation, and this research provides a practical contribution to the use of unstructured data from Twitter and raises awareness of the importance of this content applied to healthcare. Therefore, it is pertinent to focus on the advances offered by these data as a predictive tool in healthcare since it is essential, to this end, to evaluate the veracity of the information shared on Twitter.
Article
Full-text available
Unlabelled: Democracies around the world face the threat of manipulation of their electorates via coordinated online influence campaigns. Researchers have responded by developing valuable methods for finding automated accounts and identifying false information, but these valiant efforts often fall into a cat-and-mouse game with perpetrators who constantly change their behavior. This has forced several researchers to go beyond the detection of individual malicious actors by instead identifying the coordinated activity that propels potent information operations. In this vein, we provide rigorous quantitative evidence for the notion that sudden increases in Twitter account creations may provide early warnings of online information operations. Analysis of fourteen months of tweets discussing the 2020 U.S. elections revealed that accounts created during bursts exhibited more similar behavior, showed more agreement on mail-in voting and mask wearing, and were more likely to be bots and share links to low-credibility sites. In concert with other techniques for detecting nefarious activity, social media platforms could temporarily limit the influence of accounts created during these bursts. Given the advantages of combining multiple anti-misinformation methods, we join others in presenting a case for the need to develop more integrable methods for countering online influence campaigns. Supplementary information: The online version contains supplementary material available at 10.1186/s40537-023-00695-7.
Article
Full-text available
This paper presents a novel author profiling method specially aimed at classifying social network users into the multidimen-sional perspectives for social business intelligence (SBI) applications. In this scenario, being the user profiles defined on demand for each particular SBI application, we cannot assume the existence of labelled datasets for training purposes. Thus, we propose an unsupervised method to obtain the required labelled datasets for training the profile classifiers. Contrary to other author profiling approaches in the literature, we only make use of the users' descriptions, which are usually part of the metadata posts. We exhaustively evaluated the proposed method under four different tasks for multidimensional author profiling along with state-of-the-art text classifiers. We achieved performances around 88% and 98% of F1 score for a gold standard and a silver standard datasets respectively. Additionally, we compare our results to other supervised approaches previously proposed for two of our tasks, getting very close performances despite using an unsupervised method. To the best of our knowledge, this is the first method designed to label user profiles in an unsupervised way for training profile classifiers with a similar performance to fully supervised ones.
Article
Full-text available
Most studies analyzing political traffic on Social Networks focus on a single platform, while campaigns and reactions to political events produce interactions across different social media. Ignoring such cross-platform traffic may lead to analytical errors, missing important interactions across social media that e.g. explain the cause of trending or viral discussions. This work links Twitter and YouTube social networks using cross-postings of video URLs on Twitter to discover the main tendencies and preferences of the electorate, distinguish users and communities’ favouritism towards an ideology or candidate, study the sentiment towards candidates and political events, and measure political homophily. This study shows that Twitter communities correlate with YouTube comment communities: that is, Twitter users belonging to the same community in the Retweet graph tend to post YouTube video links with comments from YouTube users belonging to the same community in the YouTube Comment graph. Specifically, we identify Twitter and YouTube communities, we measure their similarity and differences and show the interactions and the correlation between the largest communities on YouTube and Twitter. To achieve that, we have gather a dataset of approximately 20M tweets and the comments of 29K YouTube videos; we present the volume, the sentiment, and the communities formed in YouTube and Twitter graphs, and publish a representative sample of the dataset, as allowed by the corresponding Twitter policy restrictions.
Article
Full-text available
Ideological homophily on social media has been receiving increased scholarly interest, as it is associated with the formation of filter bubbles, echo chambers, and increased ideological polarization. And yet, no linkage necessarily exists between ideological homophily, echo chambers, and polarization. Despite political interactions on social media taking place to a large extent between like-minded individuals, cross-cutting interactions are also frequent. Using Twitter data, we investigated the extent to which ideological homophily, echo chambers, and polarization occur together and characterize the network of political Twitter users during the 2017 election in Norway. Despite the presence of some degree of ideological homophily, we did not find evidence of echo chambers in the Norwegian political Twittersphere during the 2017 election. And yet, the retweet network is characterized by a significant degree of polarization across ideological blocs. Our findings support the thesis according to which polarization on social media may have drivers other than the technological deterministic effect of social media affordances enhancing the formation of online echo chambers.
Article
Full-text available
The focus of the paper is to use a single weibo from a user to predict whether the user account is verified, referred to as verified account prediction, on Sina Weibo. To the best of our knowledge, verified account prediction on Sina Weibo has not been studied. For better understanding of the prediction problem, a comprehensive data analysis of weibos related to verified accounts is conducted first. Then, verified account prediction is formulated as a sequence learning problem. Specifically, a weibo from a user is represented as a sequence of feature values by feature hashing and whether the user account is verified is the corresponding label to predict. A deep learning approach is proposed for solving verified account prediction in this formulation. The proposed approach significantly outperforms the shallow learning methods in the comparisons in terms of accuracy and F1 by large margins in the experiments.
Article
Full-text available
Social media platforms have proved to be vital sources of information to support disaster response and recovery. A key issue, though, is that social media conversation about disasters tends to tail off after the immediate disaster response phase, potentially limiting the extent to which social media can be relied on to support recovery. This situation motivates the present study of social media usage patterns, including who contributes to social media around disaster recovery, which recovery activities they contribute to, and how well that participation is sustained over time. Utilising Twitter data from the 2019–20 Australian bushfires, we statistically examined the participation of different groups (citizens, emergency agencies, politicians and others) across categories of disaster recovery activity such as donations & financial support or mental health & emotional support, and observed variations over time. The results showed that user groups differed in how much they contributed on Twitter around different recovery activities, and their levels of participation varied with time. Recovery-related topics also varied significantly with time. These findings are valuable because they increase our understanding of which aspects of disaster recovery currently benefit most from social media and which are relatively neglected, indicating where to focus resources and recovery effort.
Article
Although previous studies described the role of ethnic factors in international trade, due to the difficulty of estimating the ethnicity of trade entities, such analysis was limited to a few ethnic groups, and the differences in the strength of factors across ethnic groups were not identified. By estimating corporate ethnicity using the large-scale surname data of corporate managers, we quantitatively compare and analyze the dependence of various ethnic groups on ethnicity. Asian and Middle Eastern ethnic groups have strong ethnic homophily. An analysis of ethnic factors in commerce between same-language countries using the gravity model suggests that the effect of ethnicity is significant even when language barriers are removed.
Article
Full-text available
Humans are naturally endowed with the ability to write in a particular style. They can, for instance, rephrase a formal letter in an informal way, convey a literal message with the use of figures of speech or edit a novel by mimicking the style of some well-known authors. Automating this form of creativity constitutes the goal of style transfer. As a natural language generation task, style transfer aims at rewriting existing texts, and specifically, it creates paraphrases that exhibit some desired stylistic attributes. From a practical perspective, it envisions beneficial applications, like chatbots that modulate their communicative style to appear empathetic, or systems that automatically simplify technical articles for a non-expert audience. Several style-aware paraphrasing methods have attempted to tackle style transfer. A handful of surveys give a methodological overview of the field, but they do not support researchers to focus on specific styles. With this paper, we aim at providing a comprehensive discussion of the styles that have received attention in the transfer task. We organize them in a hierarchy, highlighting the challenges for the definition of each of them and pointing out gaps in the current research landscape. The hierarchy comprises two main groups. One encompasses styles that people modulate arbitrarily, along the lines of registers and genres. The other group corresponds to unintentionally expressed styles, due to an author’s personal characteristics. Hence, our review shows how these groups relate to one another and where specific styles, including some that have not yet been explored, belong in the hierarchy. Moreover, we summarize the methods employed for different stylistic families, hinting researchers towards those that would be the most fitting for future research.
Article
In the social network, each user has attributes for self-description called user attributes which are semantically hierarchical. Attribute inference has become an essential way for social platforms to realize user classifications and targeted recommendations. Most existing approaches mainly focus on the flat inference problem neglecting the semantic hierarchy of user attributes which will cause serious inconsistency in multi-level tasks. In this article, we propose a multi-level model MLI, where information propagation part collects attribute information by mining the global graph structure, and the attribute correction part realizes the mutual correction between different levels of attributes. Further, we put forward the concept of generalized semantic tree, a way of representing the hierarchical structure of user attributes, whose nodes are allowed to have multiple parent nodes unlike the regular tree. Both regular and generalized semantic tree are commonly used in practice, and can be handled by our model. Besides, by making the inference start from sub-networks with sufficient attribute information, we design a “Ripple” algorithm to improve the efficiency and effectiveness of our model. For evaluation purposes, we conduct extensive verification experiments on DBLP datasets. The experimental results show the superior effect of MLI, compared with the state-of-the-art methods.
Chapter
People use microblogging platforms like Twitter to involve with other users for a wide range of interests and practices. Twitter profiles run by different types of users such as humans, bots, spammers, businesses and professionals. This research uses a treemap visualization to identify different users profile on Twitter. For this purpose, we exploit users’ profile and tweeting behavior information. We evaluate our approach by visualizing the different Twitter profiles. We focus just on user activity, ignoring the content of messages. We take into consideration both social interactions and tweeting patterns, which allow us to profile users according to their activity patterns using treemaps.
Article
Full-text available
We are entering an era in which online personalities and personas will grow faster and faster. People are tending to use the Internet, and social media especially, more frequently and for a wider variety of purposes. In parallel, a number of cultural spaces have already decided to invest in marketing and message spreading through the web and the media. Growing their audience, or locating the appropriate group of people to share their information, remains a tedious task within the chaotic environment of the Internet. The investment is mainly financial—usually large—and directed to advertisements. Still, there is much space for research and investment in analytics that can provide evidence considering the spreading of the word and finding groups of people interested in specific information or trending topics and influencers. In this paper, we present a part of a national project that aims to perform an analysis of Twitter’s trending topics. The main scope of the analysis is to provide a basic ordering on the topics based on their “importance”. Based on this, we clarify how cultural institutions can benefit from such an analysis in order to empower their online presence.
Conference Paper
Full-text available
We propose and evaluate a probabilistic framework for estimating a Twitter user's city-level location based purely on the content of the user's tweets, even in the absence of any other geospatial cues. By augmenting the massive human-powered sensing capabilities of Twitter and related microblogging services with content-derived location information, this framework can overcome the sparsity of geo-enabled features in these services and enable new location-based personalized information services, the targeting of regional advertisements, and so on. Three of the key features of the proposed approach are: (i) its reliance purely on tweet content, meaning no need for user IP information, private login information, or external knowledge bases; (ii) a classification component for automatically identifying words in tweets with a strong local geo-scope; and (iii) a lattice-based neighborhood smoothing model for refining a user's location estimate. The system estimates k possible locations for each user in descending order of confidence. On average we find that the location estimates converge quickly (needing just 100s of tweets), placing 51% of Twitter users within 100 miles of their actual location.
Conference Paper
Full-text available
How does the web search behavior of "rich" and "poor" people differ? Do men and women tend to click on difffferent results for the same query? What are some queries almost exclusively issued by African Americans? These are some of the questions we address in this study. Our research combines three data sources: the query log of a major US-based web search engine, profile information provided by 28 million of its users (birth year, gender and ZIP code), and US-census information including detailed demographic information aggregated at the level of ZIP code. Through this combination we can annotate each query with, e.g. the average per-capita income in the ZIP code it originated from. Though conceptually simple, this combination immediately creates a powerful user modeling tool. The main contributions of this work are the following. First, we provide a demographic description of a large sample of search engine users in the US and show that it agrees well with the distribution of the US population. Second, we describe how different segments of the population differ in their search behavior, e.g. with respect to the queries they formulate or the URLs they click. Third, we explore applications of our methodology to improve web search relevance and to provide better query suggestions. These results enable a wide range of applications including improving web search and advertising where, for instance, targeted advertisements for "family vacations" could be adapted to the (expected) income.
Conference Paper
Full-text available
Mashups showing the geographic location of the authors of social media content are popular. They generally depend on the authors reporting their own location. For blogs, auto- mated geolocation strategies using IP address and domain name are not adequate for determining an author's location. Instead, we detail textual geolocation techniques suitable for tagging social media data, facilitating development of geo- graphic mashups and spatial reasoning tools.
Conference Paper
Full-text available
Accurate prediction of blogger age from evidence in the text and metadata of blog entries would be valuable for marketing, privacy, and law enforcement concerns. This paper offers an initial exploratory data analysis of can- didate features for blogger age prediction.
Article
Full-text available
Prediction involves estimating the unknown value of an attribute of a system under study given the values of other measured attributes. In prediction (machine) learning the prediction rule is derived from data consisting of previously solved cases. Most methods for predictive learning were originated many years ago at the dawn of the computer age. Recently two new techniques have emerged that have revitalized the field. These are support vector machines and boosted decision trees. This paper provides an introduction to these two new methods tracing their respective ancestral roots to standard kernel methods and ordinary decision trees.
Article
Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitives highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Article
A relationship among language, gender, and discourse genre has previously been observed in informal, spoken interaction and formal, written texts. This study investigates the language/gender/genre relationship in weblogs, a popular new mode of computer-mediated communication (CMC). Taking as the dependent variables stylistic features identified in machine learning research and popularized in a Web interface called the Gender Genie, a multivariate analysis was conducted of entries from random weblogs in a sample balanced for author gender and weblog sub-genre (diary or filter). The results show that the diary entries contained more ‘female’ stylistic features, and the filter entries more ‘male’ stylistic features, independent of author gender. These findings problematize the characterization of the stylistic features as gendered, and suggest a need for more fine-grained genre analysis in CMC research. At the same time, it is observed that conventional associations of gender with certain spoken and written genres are reproduced in weblogs, along with their societal valuations.
Conference Paper
Despite differences in the way that men and women experience goods and communicate their perspectives, online review communities typically do not provide participants' gender. We propose to infer author gender, given a set of reviews of a particular item, and experiment on reviews posted at the Internet Movie Database (IMDb). Using logistic regression, we explore the contribution of three types of information: 1) style, 2) content, and 3) metadata (e.g. review age, social feedback). Our results concur with previous research, in that there are salient differences in writing style and content between reviews authored by men versus women. However, in comparison to literary or scientific texts, to which classification tasks are often applied, reviews are brief and occur within the context of an ongoing discourse. Therefore, to compensative for the brevity of reviews, content and stylistic features can be augmented with metadata. We find in particular that the perceived utility of a review is an important correlate of gender. The model incorporating all features has a classification accuracy of 73.7% and is not as sensitive to review length as are those based only on stylistic or content features.
Conference Paper
We investigate the subtle cues to user identity that may be exploited in attacks on the privacy of users in web search query logs. We study the application of simple classifiers to map a sequence of queries into the gender, age, and location of the user issuing the queries. We then show how these classifiers may be carefully com- bined at multiple granularities to map a sequence of queries into a set of candidate users that is 300-600 times smaller than random chance would allow. We show that this approach remains surpris- ingly accurate even after removing personally identifiable informa- tion such as names/numbers or limiting the size of the query log. We also present a new attack in which a real-world acquaintance of a user attempts to identify that user in a large query log, using personal information. We show that combinations of small pieces of information about terms a user would probably search for can be highly effective in identifying the sessions of that user. We conclude that known schemes to release even heavily scrubbed query logs that contain session information have significant privacy risks. Categories and Subject Descriptors: H.3.3 (Information Stor- age and Retrieval): Information Search and Retrieval
Conference Paper
As microblogging grows in popularity, services like Twitter are coming to support information gathering needs above and beyond their traditional roles as social networks. But most users' interaction with Twitter is still primarily focused on their social graphs, forcing the often inappropriate conflation of "people I follow" with "stuff I want to read." We characterize some information needs that the current Twitter interface fails to support, and argue for better representations of content for solving these challenges. We present a scalable implementation of a partially supervised learning model (Labeled LDA) that maps the content of the Twitter feed into dimensions. These dimensions correspond roughly to substance, style, status, and social characteristics of posts. We characterize users and tweets using this model, and present results on two information consumption oriented tasks.
Conference Paper
This paper presents and evaluates several original techniques for the latent classifi- cation of biographic attributes such as gen- der, age and native language, in diverse genres (conversation transcripts, email) and languages (Arabic, English). First, we present a novel partner-sensitive model for extracting biographic attributes in con- versations, given the differences in lexi- cal usage and discourse style such as ob- served between same-gender and mixed- gender conversations. Then, we explore a rich variety of novel sociolinguistic and discourse-based features, including mean utterance length, passive/active usage, per- centage domination of the conversation, speaking rate and filler word usage. Cu- mulatively up to 20% error reduction is achieved relative to the standard Boulis and Ostendorf (2005) algorithm for classi- fying individual conversations on Switch- board, and accuracy for gender detection on the Switchboard corpus (aggregate) and Gulf Arabic corpus exceeds 95%.
Conference Paper
Within the larger area of automatic acquisition of knowledge from the Web, we introduce a method for extracting relevant attributes, or quantifiable properties, for various classes of objects. The method extracts attributes such as capital city and President for the class Country, or cost, manufac- turer and side effects for the class Drug, without re- lying on any expensive language resources or com- plex processing tools. In a departure from previous approaches to large-scale information extraction, we explore the role of Web query logs, rather than Web documents, as an alternative source of class attributes. The quality of the extracted attributes recommends query logs as a valuable, albeit little explored, resource for information extraction.
Article
This paper describes a corpus annotation project to study issues in the manual annotation of opinions, emotions, sentiments, speculations, evaluations and other private states in language. The resulting corpus annotation scheme is described, as well as examples of its use. In addition, the manual annotation process and the results of an inter-annotator agreement study on a 10,000-sentence corpus of articles drawn from the world press are presented.
Article
Recent research on gender differences in language has mostly addressed cognitive differences. These differences have been observed on different cognitive verbal and nonverbal tasks and conclusions on the variability in language production and comprehension have been drawn from their results. In this paper, a different approach is presented. This pilot study examines lexical richness measures in conversational speech across a total of thirty subjects. All subjects were recorded and transcribed in a conversational setting. Their transcribed speech was analyzed using a set of lexical richness measures based on word-frequencies. On the basis of these measurements, statistical discriminant analysis is able to classify the two groups with 90% (74% with leave-one-out cross-validation) correct prediction rate at a statistically significant level (p = .03). The results are discussed in detail including correlation and principal components analysis. The paper concludes that there are interesting...
  • D Blei
  • A Ng
  • M Jordan
Blei, D.; Ng, A.; and Jordan, M. 2002. Latent dirichlet allocation. JMLR (3):993-1022.
Press Releases Archives
  • Burson-Marsteller
Burson-Marsteller. 2010. Press Releases Archives. In Archive of Sept 10, 2010.
Why we twitter: understanding microblogging usage and communities
  • A Java
  • X Song
  • T Finin
  • B Tseng
Java, A.; Song, X.; Finin, T.; and Tseng, B. 2007. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007.
An architecture for parallel topic models
  • A Smola
  • S Narayanamurthy
Smola, A., and Narayanamurthy, S. 2010. An architecture for parallel topic models. In Proceedings of VLDB.
Get out the vote: determining support or opposition from congressional floordebate transcripts
  • M Thomas
  • B Pang
  • L Lee
Thomas, M.; Pang, B.; and Lee, L. 2006. Get out the vote: determining support or opposition from congressional floordebate transcripts. In Proceedings of EMNLP. Twitter. 2010. Twitter API documentation. In http://dev.twitter.com/doc.