Figure 4 - uploaded by Jukka-Pekka Onnela
Content may be subject to copyright.

Per-county area cartograms of Twitter over-and undersampling rates of Caucasian, African-American, Asian, and Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity are shown. Blue regions correspond to undersampling; red regions to oversampling.
Source publication
Every second, the thoughts and feelings of millions of people across the world are recorded in the form of 140-character tweets using Twitter. However, despite the enormous potential presented by this remarkable data source, we still do not have an understanding of the Twitter population itself: Who are the Twitter users? How representative of the...
Context in source publication
Context 1
... example, if we observed that 25% of Twitter users in a county were predicted to be His- panic, and the 2000 U.S. counted 23% of people in that county as being Hispanic, we would consider Twitter to be oversampling the Hispanic users in that county. Figure 4 plots the per-county race/ethnicity distribution, relative to the 2000 U.S. Census, per all counties in which we observed more than 500 Twitter users with identifiable last names. A number of geographic trends are visible, such as the un- dersampling of Hispanic users in the southwest; the under- samping of African-American users in the south and mid- west; and the oversampling of Caucasian users in many ma- jor cities. ...
Similar publications
User generated content on Twitter (produced at an enormous rate of 340 million tweets per day) provides a rich source for gleaning people's emotions, which is necessary for deeper understanding of people's behaviors and actions. Extant studies on emotion identification lack comprehensive coverage of "emotional situations" because they use relativel...
We present a computational framework for understand-ing the social aspects of emotions in Twitter conversa-tions. Using unannotated data and semisupervised ma-chine learning, we look at emotional transitions, emo-tional influences among the conversation partners, and patterns in the overall emotional exchanges. We find that conversational partners...
Diversos estudios tratan de comprobar la capacidad predictiva y de influencia de Twitter en procesos electorales. En este marco, el presente artículo ofrece un análisis comparativo de la actividad en Twitter de los partidos políticos y candidatos durante las últimas elecciones autonómicas celebradas en Galicia (5.353 tweets analizados) con el objet...
Despite adding emotions to applications has proven to enhance the user experience, emotion recognition applications are still not widely available nor used. Within this paper, emotion recognition is done on Twitter tweets using six emotion classification algorithms that are compared on precision and timing. The paper shows that precision can be enh...
“Emotional states of individuals, also known as moods, are central to the expression of thoughts, ideas and opinions, and in turn impact attitudes and behavior”. In this paper we have proposed a method which detects the emotion or mood of the tweet and classify the twitter message under appropriate emotional category. Our approach is a two-step app...
Citations
... Nonetheless, social media data have been shown more potential in filling these gaps. First, these data have more user-generated information (e.g., surnames), which can be used to infer an individual's raceethnicity, including those minority groups (e.g., Chang, Rosenn, Backstrom, & Marlow, 2010;Mislove, Lehmann, Ahn, Onnela, & Rosenquist, 2011). Moreover, previous studies have demonstrated that geo-tagged social media data can be used to infer individual economic status, by predicting the individual's home location via trajectory mining and attaching the economic statistics by census aggregated units to the predicted home location of each individual (e.g., Huang & Wong, 2016;Wu & Huang, 2022). ...
... A major part of our analysis is heavily focused on determining and comparing the toxic characteristics of tweets shared by male and female users who participated in the #MeToo movement. Since it is not mandatory for users to disclose their gender on Twitter, we utilized the methodology introduced by Nilizadeh et al. [54] for identifying perceived gender using the Face++ API [48] and 92,626 unique first names and their gender profiles in the U.S. from 1990 to 2013 [51]. Also, users have increasingly adopted the use of gender-identifiable pronouns on their social media profiles including Twitter [38,80], which provided us with a great opportunity to identify their gender as self-disclosed by the user. ...
... Our process uses the following tools: Firstly, we checked if a particular tweet shared by a user was talking about women/ men by using Spacy [67], an advanced open-source library for natural language processing tasks. Spacy uses a syntactic parser and pre-trained word embedding to recognize the subject/object tokens of a sentence, which for our purposes were female/ male pronouns and names collected from the US Census database [51]. Secondly, since toxic tweets do not always contain offensive words [21], we use a combination of NLP-based tools to recognize the sentiment/ emotion of the tweets, which are -NLTK's sentiment analysis [55], Linguistic Inquiry and Word Count (LIWC)'s [40] 'Anger' score metric and Perspective API's [3] 'Severe toxicity' score. ...
The #MeToo movement has catalyzed widespread public discourse surrounding sexual harassment and assault, empowering survivors to share their stories and holding perpetrators accountable. While the movement has had a substantial and largely positive influence, this study aims to examine the potential negative consequences in the form of increased hostility against women and men on the social media platform Twitter. By analyzing tweets shared between October 2017 and January 2020 by more than 47.1k individuals who had either disclosed their own sexual abuse experiences on Twitter or engaged in discussions about the movement, we identify the overall increase in gender-based hostility towards both women and men since the start of the movement. We also monitor 16 pivotal real-life events that shaped the #MeToo movement to identify how these events may have amplified negative discussions targeting the opposite gender on Twitter. Furthermore, we conduct a thematic content analysis of a subset of gender-based hostile tweets, which helps us identify recurring themes and underlying motivations driving the expressions of anger and resentment from both men and women concerning the #MeToo movement. This study highlights the need for a nuanced understanding of the impact of social movements on online discourse and underscores the importance of addressing gender-based hostility in the digital sphere.
... They show that these variables do not follow the normal distribution and on average, users in the recommended dataset have more friends and reviews than those in the non-recommended dataset. Extraction of users' gender: We extracted the gender of Yelp users in our recommended and not-recommended datasets by using the methodology employed by Mislove et al. [81]. This gender detection algorithm computes the probability of usernames being a specific gender by using the names obtained from Census data from the years 1900 to 2013. ...
Web 2.0 recommendation systems, such as Yelp, connect users and businesses so that users can identify new businesses and simultaneously express their experiences in the form of reviews. Yelp recommendation software moderates user-provided content by categorizing them into recommended and not-recommended sections. Due to Yelp's substantial popularity and its high impact on local businesses' success, understanding the fairness of its algorithms is crucial. However, with no access to the training data and the algorithms used by such black-box systems, studying their fairness is not trivial, requiring a tremendous effort to minimize bias in data collection and consider the confounding factors in the analysis. This large-scale data-driven study, for the first time, investigates Yelp's business ranking and review recommendation system through the lens of fairness. We define and examine 4 hypotheses to examine if Yelp's recommendation software shows bias and if Yelp's business ranking algorithm shows bias against restaurants located in specific neighborhoods. Our findings show that reviews of female and less-established users are disproportionately categorized as recommended. We also find a positive association between restaurants being located in hotspot regions and their average exposure. Furthermore, we observed some cases of severe disparity bias in cities where the hotspots are in neighborhoods with less demographic diversity or areas with higher affluence and education levels. Indeed, biases introduced by data-driven systems, including our findings in this paper, are (almost) always implicit and through proxy attributes. Still, the authors believe such implicit biases should be detected and resolved as those can create cycles of discrimination that keep increasing the social gaps between different groups even further.
... Existing literature has identified multiple challenges related to the "representativeness" of Twitter data. Twitter users tend to be an overrepresentation of young, male, educated, and urbanized populations (Barberá and Rivero, 2015;Mellon and Prosser, 2017;Mislove et al., 2011). The other of bias is model bias, possibly introduced by analysts during the word choice for the analysis (e.g., damage indicators and "false" damage patterns to filter in damage-related tweets) and annotation process. ...
... Meanwhile, we expect similar correlations when focusing on urban population. Since most of our Twitter users are concentrated in urban areas, we expect more linguistic and socioeconomic heterogeneity at these places rather than in rural areas (27). We therefore consider the eight largest metropolitan areas in England as listed earlier. ...
The socioeconomic background of people and how they use standard forms of language are not independent, as demonstrated in various sociolinguistic studies. However, the extent to which these correlations may be influenced by the mixing of people from different socioeconomic classes remains relatively unexplored from a quantitative perspective. In this work we leverage geotagged tweets and transferable computational methods to map deviations from standard English on a large scale, in seven thousand administrative areas of England and Wales. We combine these data with high-resolution income maps to assign a proxy socioeconomic indicator to home-located users. Strikingly, across eight metropolitan areas we find a consistent pattern suggesting that the more different socioeconomic classes mix, the less interdependent the frequency of their departures from standard grammar and their income become. Further, we propose an agent-based model of linguistic variety adoption that sheds light on the mechanisms that produce the observations seen in the data.
... Networks represent the structure of complex systems as sets of nodes connected by edges [1,2,3] and are ubiquitous across diverse domains, including social sciences [4,5], transportation [6,7], finance [8,9], science of science [10,11], neuroscience [12,13], and biology [14,15,16]. Networks are complex, high-dimensional, and discrete objects, making it highly non-trivial to obtain useful representations of their structure. ...
... Nevertheless, as we shall see, our numerical simulations show that node2vec performs well even if the average degree is small. 4 ...
... Another list D rand includes the node pairs (i, j ′ ) consisting of a center node i sampled from the given sequence and a random node j ′ sampled from a random distribution P 0 (j ′ ). We use a typical random distribution, i.e., we use the long-term probability P (x (t) = j ′ ) of random walks as P 0 (j) [4]. Then, the skip-gram word2vec model estimates the probability P (i, j) ∈ D data that a given pair (i, j) comes from D data by ...
Recent advances in machine learning research have produced powerful neural graph embedding methods, which learn useful, low-dimensional vector representations of network data. These neural methods for graph embedding excel in graph machine learning tasks and are now widely adopted. However, how and why these methods work -- particularly how network structure gets encoded in the embedding -- remain largely unexplained. Here, we show that shallow neural graph embedding methods encode community structure as well as, or even better than, spectral embedding methods for both dense and sparse networks, with and without degree and community size heterogeneity. Our results provide the foundations for the design of novel effective community detection methods as well as theoretical studies that bridge network science and machine learning.
... However, this has raised concerns about privacy and the potential risks associated with the collection and use of personal data on social media. Research has shown that social media platforms collect a wide range of metadata, including timestamps, geolocation, device information, and user interactions, which can be used to infer sensitive information about users' identities, preferences, and behaviors [1]. This has led to increasing concerns about privacy, surveillance, and the potential misuse of personal data by third parties, such as advertisers, marketers, and malicious actors [2]. ...
Machine learning algorithms, such as KNN, SVM, MLP, RF, and MLR, are used to extract valuable information from shared digital data on social media platforms through their APIs in an effort to identify anonymous publishers or online users. This can leave these anonymous publishers vulnerable to privacy-related attacks, as identifying information can be revealed. Twitter is an example of such a platform where identifying anonymous users/publishers is made possible by using machine learning techniques. To provide these anonymous users with stronger protection, we have examined the effectiveness of these techniques when critical fields in the metadata are masked or encrypted using tweets (text and images) from Twitter. Our results show that SVM achieved the highest accuracy rate of 95.81% without using data masking or encryption, while SVM achieved the highest identity recognition rate of 50.24% when using data masking and AES encryption algorithm. This indicates that data masking and encryption of metadata of tweets (text and images) can provide promising protection for the anonymity of users’ identities.
... Previous research in this field used supervised learning, relying on labeled user data and deriving features from user metadata, text and network information (Preoţiuc-Pietro, Lampos, and Aletras 2015;Chakraborty et al. 2017;Preoţiuc-Pietro and Ungar 2018;Wood-Doughty et al. 2018;Pan et al. 2019;Wang et al. 2019). When labeled user data is unavailable, external data sources, occasionally combined with rule-based methods, have successfully been employed to infer demographics (Mohammady and Culotta 2014;Culotta, Kumar, and Cutler 2015;Mislove et al. 2011;Liu and Ruths 2013;Karimi et al. 2016). These methods have been applied at scale to map targeted characteristics of social media populations (Mislove et al. 2011;Sloan et al. 2015;Mellon and Prosser 2017). ...
... When labeled user data is unavailable, external data sources, occasionally combined with rule-based methods, have successfully been employed to infer demographics (Mohammady and Culotta 2014;Culotta, Kumar, and Cutler 2015;Mislove et al. 2011;Liu and Ruths 2013;Karimi et al. 2016). These methods have been applied at scale to map targeted characteristics of social media populations (Mislove et al. 2011;Sloan et al. 2015;Mellon and Prosser 2017). This vast literature relies on a wide array of models and input features; yet, it is unclear which combination generalizes best for population-level descriptions. ...
... Recent work has also shed light on the informational potential of network features (Li, Xu, and Lu 2015;Aletras and Chamberlain 2018;Pan et al. 2019) as well as data combination and transfer learning (Liu and Singh 2023). In the absence of labeled user data, researchers have resorted to rule-based methods as well as external data sources, such as county demographics (Mohammady and Culotta 2014), website traffic data (Culotta, Kumar, and Cutler 2015) and labeled name dictionaries (Mislove et al. 2011;Liu and Ruths 2013;Karimi et al. 2016). These methods have been applied at scale and provided valuable insights on the demographic composition of social media populations (Mislove et al. 2011;Mellon and Prosser 2017;Sloan et al. 2015). ...
Characterizing the demographics of social media users enables a diversity of applications, from improved targeting of
policy interventions to the derivation of representative population estimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labelled data is often scarce. Alternatively, rule-based
matching strategies provide well-grounded information but
only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize
coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy
for large-scale demographic inference by relying on minimal
labelling efforts. We combine a name-matching strategy with
graph-based methods to map the demographics of 1.8 million Nigerian Twitter users. Specifically, we compare a purely
graph-based propagation model, namely Label Propagation
(LP), with Graph Convolutional Networks (GCN), a graph
model that also incorporates node features based on user content. We find that both models largely outperform supervised
learning approaches based purely on user content that lack
graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations
of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of
Nigerian Twitter and find that it is a highly non-uniform sample of the general Nigerian population.
... On most platforms, edges represent people's activity rather than their social connectivity, or may not cover important connections such as close family. As a result, most large-scale social network datasets are non-random samples of an underlying social structure 22,39,42 . It is therefore hard to derive generically applicable actionable insights from such data, or to generalize findings across studies 38 . ...
Large-scale human social network structure is typically inferred from digital trace samples of online social media platforms or mobile communication data. Instead, here we investigate the social network structure of a complete population, where people are connected by high-quality links sourced from administrative registers of family, household, work, school, and next-door neighbors. We examine this multilayer social opportunity structure through three common concepts in network analysis: degree, closure, and distance. Findings present how particular network layers contribute to presumably universal scale-free and small-world properties of networks. Furthermore, we suggest a novel measure of excess closure and apply this in a life-course perspective to show how the social opportunity structure of individuals varies along age, socio-economic status, and education level.
... Previous research in this field used supervised learning, relying on labeled user data and deriving features from user metadata, text and network information (Preoţiuc-Pietro, Lampos, and Aletras 2015;Chakraborty et al. 2017;Preoţiuc-Pietro and Ungar 2018;Wood-Doughty et al. 2018;Pan et al. 2019;Wang et al. 2019). When labeled user data is unavailable, external data sources, occasionally combined with rule-based methods, have successfully been employed to infer demographics (Mohammady and Culotta 2014;Culotta, Kumar, and Cutler 2015;Mislove et al. 2011;Liu and Ruths 2013;Karimi et al. 2016). These methods have been applied at scale to map targeted characteristics of social media populations (Mislove et al. 2011;Sloan et al. 2015;Mellon and Prosser 2017). ...
... When labeled user data is unavailable, external data sources, occasionally combined with rule-based methods, have successfully been employed to infer demographics (Mohammady and Culotta 2014;Culotta, Kumar, and Cutler 2015;Mislove et al. 2011;Liu and Ruths 2013;Karimi et al. 2016). These methods have been applied at scale to map targeted characteristics of social media populations (Mislove et al. 2011;Sloan et al. 2015;Mellon and Prosser 2017). This vast literature relies on a wide array of models and input features; yet, it is unclear which combination generalizes best for population-level descriptions. ...
... Recent work has also shed light on the informational potential of network features (Li, Xu, and Lu 2015;Aletras and Chamberlain 2018;Pan et al. 2019) as well as data combination and transfer learning (Liu and Singh 2023). In the absence of labeled user data, researchers have resorted to rule-based methods as well as external data sources, such as county demographics (Mohammady and Culotta 2014), website traffic data (Culotta, Kumar, and Cutler 2015) and labeled name dictionaries (Mislove et al. 2011;Liu and Ruths 2013;Karimi et al. 2016). These methods have been applied at scale and provided valuable insights on the demographic composition of social media populations (Mislove et al. 2011;Mellon and Prosser 2017;Sloan et al. 2015). ...
Characterizing the demographics of social media users enables a diversity of applications, from better targeting of policy interventions to the derivation of representative population estimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users. Specifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content. We find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.