Conference PaperPDF Available

Understanding the Demographics of Twitter Users

Authors:

Abstract and Figures

Every second, the thoughts and feelings of millions of people across the world are recorded in the form of 140-character tweets using Twitter. However, despite the enormous potential presented by this remarkable data source, we still do not have an understanding of the Twitter population itself: Who are the Twitter users? How representative of the overall population are they? In this paper, we take the first steps towards answering these questions by analyzing data on a set of Twitter users representing over 1 % of the U.S. population. We develop techniques that allow us to compare the Twitter population to the U.S. population along three axes (geography, gender, and race/ethnicity), and find that the Twitter population is a highly non-uniform sample of the population.
Content may be subject to copyright.
Understanding the Demographics of Twitter Users
Alan MisloveSune LehmannYon g- Yeo l A h nJukka-Pekka OnnelaJ. Niels Rosenquist
Northeastern University Tec h n i c a l U nive r s i ty of D e n m ark Harvard Medical School
Abstract
Every second, the thoughts and feelings of millions of people
across the world are recorded in the form of 140-character
tweets using Twitter. However, despite the enormous poten-
tial presented by this remarkable data source, we still do not
have an understanding of the Twitter population itself: Who
are the Twitter users? How representative of the overall pop-
ulation are they? In this paper, we take the first steps towards
answering these questions by analyzing data on a set of Twit-
ter users representing over 1% of the U.S. population. We
develop techniques that allow us to compare the Twitter pop-
ulation to the U.S. population along three axes (geography,
gender, and race/ethnicity), and find that the Twitter popula-
tion is a highly non-uniform sample of the population.
Introduction
Online social networks are now a popular way for users to
connect, communicate, and share content; many serve as the
de-facto Internet portal for millions of users. Because of the
massive popularity of these sites, data about the users and
their communication offers unprecedented opportunities to
examine how human society functions at scale. However,
concerns over user privacy often force service providers to
keep such data private. Twitter represents an exception: Over
91% of Twitter users choose to make their profile and com-
munication history publicly visible, allowing researchers
access to the vast majority of the site. Twitter, therefore,
presents a unique opportunity to examine the public com-
munication of a large fraction of the population.
In fact, researchers have recently begun to use the con-
tent of Twitter messages to measure and predict real-world
phenomena, including movie box office returns (Asur and
Huberman 2010), elections (O’Connor et al. 2010), and the
stock market (Bollen, Mao, and Zeng 2010). While these
studies show remarkable promise, one heretofore unan-
swered question is: Are Twitter users a representative sam-
ple of society? If not, which demographics are over- or un-
derrepresented in the Twitter population? Because existing
studies generally treat Twitter as a “black box,” shedding
light on the characteristics of the Twitter population is likely
to lead to improvements in existing prediction and measure-
ment methods. Moreover, understanding the characteristics
Copyright c
!2011, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
of the Twitter population is crucial to move towards more
advanced observations and predictions, since such an un-
derstanding will help us determine what predictions can be
made and what other data is necessary to correct for any bi-
ases.
In this paper, we take a first look at the demographics of
the Twitter users, aiming to answer these questions. To do
so, we use a data set of over 1,755,925,520 Twitter messages
sent by 54,981,152 users between March 2006 and August
2009 (Cha et al. 2010). We focus on users whose identified
location is within the United States, because the plurality of
users at the time of the data collection are in U.S., and be-
cause we have the detailed demographic data for U.S. popu-
lation. Even with the location constraint, our dataset covers
over three million users, representing more than 1% of the
entire U.S. population.
Ideally, when comparing the Twitter population to society
as a whole, we would like to compare properties including
socio-economic status, education level, and type of employ-
ment. However, we are restricted to only using the data that
is (optionally) self-reported and made visible by the Twitter
users, including their name, location, and the text of their
tweets. We develop techniques to examine the properties
of the Twitter population along three separate but interre-
lated axes, based on the feasibility of comparison. First, we
compare the geographic distribution of users to the popu-
lation as a whole using U.S. Census data. We demonstrate
that Twitter users are more likely to live within populous
counties than would be expected from the Census data, and
that sparsely populated regions of the U.S. are significantly
underrepresented. Second, we infer the gender of Twitter
users and demonstrate that a significant male bias exists,
although the bias is becoming less pronounced over time.
Third, we examine the race/ethnicity of Twitter users and
demonstrate that the distribution of race/ethnicity is highly
geographically-dependent.
Geographic distribution
Detection location using self-reported data
To d e t e r m i ne geog r a p h ic i n f o rm a ti o n a b ou t users, w e use
the self-reported location field in the user profile. The loca-
tion is an optional self-reported string; we found that 75.3%
of the publicly visible users listed a location. In order to turn
the user-provided string into a mappable location, we use
the Google Maps API. Beginning with the most popular lo-
cation strings (i.e, the strings provided by the most users),
we query Google Maps with each location string. If Google
Maps is able to interpret a string as a location, we receive a
latitude and longitude as a response. We restrict our scope to
users in the U.S. by only considering response latitudes and
longitudes that are within the U.S.. In total, we find map-
pings to a U.S. longitude and latitude for 246,015 unique
strings, covering 3,279,425 users (representing 8.8% of the
users who list a location).
To com p a r e o u r Tw i tt e r d at a t o t h e 2 00 0 U . S. C e n su s , i t
is necessary to aggregate the users into U.S. counties. Using
data from the U.S. National Atlas and the U.S. Geological
Survey, we map each of the 246,015 latitudes and longitudes
into their respective U.S. county. Unless otherwise stated,
our analysis for the remainder of this paper is at the U.S.
county level.
Limitations We now b rie y d i sc u ss po te nt ia l li m it at io ns
of our location inference methodology. First, it is worth not-
ing that Google Maps will also interpret locations that are
at a granularity coarser than a U.S. county (e.g., “Texas”).
We m a n ua ll y re mo v ed t h es e , incl u d in g th e map p i n gs o f all
50 states, as well as “United States” and “Earth.” Second,
users may lie about their location, or may list an out-of-date
location. Third, since the location is per-user (rather than
per-tweet), a user who moves from one city to another (and
updates his location) will have all of his tweets considered
as being from the latter location.
Geographic distribution of Twitter users
We b e g i n b y ex am in in g th e ge ogra p h ic d i s tr ib u t io n of Twit -
ter users, and comparing it to the entire U.S. population.
Overall, the 3,279,425 Twitter users who we are able to geo-
locate represent 1.15% of the entire population (at the time
of the 2000 Census). However, if we examine the distribu-
tion of Twitter users per county, we observe a highly non-
uniform distribution.
Figure 1 presents this analysis, with the county popula-
tion along the xaxis and the fraction of this population we
observe in Twitter along the yaxis. We see that, as the popu-
lation of the county increases, the Tw i t t e r repre s e n t a t i o n rate
(simply the number of Twitter users in that county divided
by the number of people in that county in the 2000 U.S.
Census) increases as well. For example, consider the median
per-county Twitter representation rate of 0.324%. We ob-
serve that 93.5% of the counties with over 100,000 residents
have a higher Twitter representation rate than the median,
compared to only 40.8% of the counties with fewer than
100,000 residents (were Twitter users a truly random pop-
ulation sample, we would expect these percentages to both
be 50%). Thus, the Twitter users significantly overrepresent
populous counties, a fact underscored by the difference be-
tween the median (0.324%) per-county Twitter representa-
tion rates and the overall population sample of 1.15%.
The overrepresentation of populous counties in and of it-
self may not come as a surprise, due to the patterns of so-
cial media adoption across different regions. However, the
0.01%
0.10%
1.00%
10.00%
103104105106107
Twitter Representation Rate
County Population
Figure 1: Scatterplot of US county population versus Twitter
representation rate in that county. The dark line represents
the aggregated median, and the dashed black line represents
the overall median (0.324%). There is a clear overrepresen-
tation of more populous counties.
magnitude of the difference is striking: We observe an or-
der of magnitude difference in median per-county Twitter
representation rate between counties with 1,000 people and
counties with 1,000,000 people. This indicates a bias in the
Twitter p o pu l a ti o n ( re l a t ive t o th e U. S . po p ul a t i o n) a n d sug-
gests that entire regions of the U.S. may be significantly un-
derrepresented.
Distribution across counties We n o w e x am in e which re-
gions of the U.S. contain these over- and underrepresented
counties. To do so, we plot a map of the U.S. based on the
Twitter re p r e s e n ta t i o n ra t e , r e l a t ive to t h e m e d ia n r a t e o f
0.324%. Figure 2 presents this data, using both a normal rep-
resentation and an area cartogram representation (Gastner
and Newman 2004). In this figure, the counties are colored
according to the level of over- or underrepresentation, with
blue colors representing underrepresentation and red colors
representing overrepresentation, relative to the median rate
of 0.324%. Thus, the same number of counties will be col-
ored red as blue.
These two maps lead to a number of interesting conclu-
sions: First, as evident in the normal representation, much of
the mid-west is significantly underrepresented in the Twit-
ter user base in this time period. Second, as evident in the
significantly red hue of the area cartogram, more populous
counties are consistently oversampled. However, the level of
oversampling does not appear to be dependent upon geogra-
phy: Both east coast and west coast cities are clearly visible
(e.g., San Francisco and Boston), as well as mid-west and
southern cities (e.g, Dallas, Chicago, and Atlanta).
Gender
Detecting gender using first names
As we have very limited information available on each user,
we rely on using the self-reported name available in each
user’s profile in order to detect gender. To do so, we first ob-
tain the most popular 1,000 male and female names for ba-
bies born in the U.S. for each year 1900–2009, as reported
by the U.S. Social Security Administration (Social Secu-
rity Administration 2010). We then aggregate the names to-
gether, calculating the total frequency of each of the result-
ing 3,034 male and 3,643 female names. As certain names
occurred in both lists, we remove the 241 names that were
(a) Normal representation (b) Area cartogram representation
Figure 2: Per-county over- and underrepresentation of U.S. population in Twitter, relative to the median per-county represen-
tation rate of 0.324%, presented in both (a) a normal layout and (b) an area cartogram based on the 2000 Census population.
Blue colors indicate underrepresentation, while red colorsrepresentoverrepresentation.Theintensityofthecolorcorresponds
to the log of the over- or underrepresentation rate. Clear trends are visible, such as the underrepresentation of mid-west and
overrepresentation of populous counties.
less than 95% predictive (e.g., the name Avery was observed
to correspond to male babies only 56.8% of the time; it was
therefore removed). The result is a list of 5,836 names that
we use to infer gender.
Limitations Clearly, this approach to detecting gender is
subject to a number of potential limitations. First, users may
misrepresent their name, leading to an incorrect gender in-
ference. Second, there may be differences in choosing to re-
veal one’s name between genders, leading us to believe that
fewer users of one gender are present. Third, the name lists
above may cover different fractions of the male and female
populations.
Gender of Twitter users
We fir s t d et e rm in e th e n um ber of th e 3 , 27 9, 425 U. S. -b as e d
users who we could infer a gender for, based on their name
and the list previously described. We do so by comparing
the first word of their self-reported name to the gender list.
We o b s e rv e th a t the r e e x is t s a mat c h f o r 64 .2% of t h e u se rs .
Moreover, we find a strong bias towards male users: Fully
71.8% of the the users who we find a name match for had a
male name.
0
0.2
0.4
0.6
0.8
1
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07
Fraction of Joining Users
who are Male
Date
Figure 3: Gender of joining users over time, binned into
groups of 10,000 joining users (note that the join rate in-
creases substantially). The bias towards male users is ob-
served to be decreasing over time.
To fur t h e r e x p l or e t h is t r e nd , w e ex am i n e th e h is t o r ic g e n -
der bias. To do so, we use the join date of each user (avail-
able in the user’s profile). Figure 3 plots the average fraction
of joining users who are male over time. From this plot, it
is clear that while the male gender bias was significantly
stronger among the early Twitter adopters, the bias is be-
coming reduced over time.
Race/ethnicity
Detecting race/ethnicity using last names
Again, since we have very limited information available
on each Twitter user, we resort to inferring race/ethnicity
using self-reported last name. We examine the last name
of users, and correlate the last name with data from the
U.S. 2000 Census (U.S. Census 2000). In more detail, for
each last name with over 100 individuals in the U.S. dur-
ing the 2000 Census, the Census releases the distribution of
race/ethnicity for that last name. For example, the last name
“Myers” was observed to correspond to Caucasians 86% of
the time, African-Americans 9.7%, Asians 0.4%, and His-
panics 1.4%.
Race/ethnicity distribution of Twitter users
We fir s t d e t er mi ne d th e num b e r o f U.S . - b as ed u s e rs f or
whom we could infer the race/ethnicity by comparing the
last word of their self-reported name to the U.S. Census
last name list. We observed that we found a match for
71.8% of the users. We the determined the distribution of
race/ethnicity in each county by taking the race/ethnicity
distribution in the Census list, weighted by the frequency
of each name occurring in Twitter users in that county.1
Due to the large amount of ambiguity in the last name-to-
race/ethnicity list (in particular, the last name list is more
than 95% predictive for only 18.5% of the users), we are un-
able to directly compare the Twitter race/ethnicity distribu-
1This is effectively the census.model approach discussed in
prior work (Chang et al. 2010).
Undersampling
Oversampling
(a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic
Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and
Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity
are shown. Blue regions correspond to undersampling; red regions to oversampling.
tion directly to race/ethnicity distribution in the U.S. Census.
However, we are able to make relat iv e comparisons between
Twitter u s e r s in differen t g e o g ra p hi c r e g io n s , al l o w i n g us to
explore geographic trends in the race/ethnicity distribution.
Thus, we examine the per-county race/ethnicity distribution
of Twitter users.
In order to account for the uneven distribution of
race/ethnicity across the U.S., we examine the per-county
race/ethnicity distribution relative to the distribution from
the overall U.S. Census. For example, if we observed that
25% of Twitter users in a county were predicted to be His-
panic, and the 2000 U.S. counted 23% of people in that
county as being Hispanic, we would consider Twitter to be
oversampling the Hispanic users in that county. Figure 4
plots the per-county race/ethnicity distribution, relative to
the 2000 U.S. Census, per all counties in which we observed
more than 500 Twitter users with identifiable last names.
Anumberofgeographictrendsarevisible,suchastheun-
dersampling of Hispanic users in the southwest; the under-
samping of African-American users in the south and mid-
west; and the oversampling of Caucasian users in many ma-
jor cities.
Related work
Afewotherstudieshaveexaminedthedemographicsofso-
cial network users. For example, recent studies have exam-
ined the ethnicity of Facebook users (Chang et al. 2010),
general demographics of Facebook users (Corbett 2010),
and differences in online behavior on Facebook and MyS-
pace by gender (Strayhorn 2009). However, studies of gen-
eral social networking sites are able to leverage the broad
nature of the profiles available; in contrast, on Twitter, users
self-report only a minimal set of information, making calcu-
lating demographics significantly more difficult.
Conclusion
Twitter h a s r e c eived s i g n i c an t r e s e ar c h i n te r e st l a t e l y asa
means for understanding, monitoring, and even predicting
real-world phenomena. However, most existing work does
not address the sampling bias, simply applying machine
learning and data mining algorithms without an understand-
ing of the Twitter user population. In this paper, we took
afirstlookattheuserpopulationthemselves,andexam-
ined the population along the axes of geography, gender, and
race/ethnicity. Overall, we found that Twitter users signif-
icantly overrepresent the densely population regions of the
U.S., are predominantly male, and represent a highly non-
random sample of the overall race/ethnicity distribution.
Going forward, our study sets the foundation for future
work upon Twitter data. Existing approaches could imme-
diately use our analysis to improve predictions or measure-
ments. By enabling post-hoc corrections, our work is a first
step towards turning Twitter into a tool that can make infer-
ences about the population as a whole. More nuanced anal-
yses on the biases in the Twitter population will enhance
the ability for Twitter to be used as a sophisticated inference
tool.
Acknowledgements
We th a n k F a br ic io B e nev e nto and M e e y ou ng Ch a for t h e i r
assistance in gathering the Twitter data used in this study.
We als o t h an k Ji m Ba gr ow f o r va l u ab le d is c ussio n s a n d his
collection of geographic data from Google Maps. This re-
search was supported in part by NSF grant IIS-0964465 and
an Amazon Web Services in Education Grant.
References
Asur, S., and Huberman, B. 2010. Predicing the future with social
media. http://arxiv.org/abs/1003.5699.
Bollen, J.; Mao, H.; and Zeng, X.-J. 2010. Twitter mood predicts
the stock market. In ICWSM.
Cha, M.; Haddadi, H.; Benevenuto, F.; and Gummadi, K. 2010.
Measuring user influence in twitter: The million follower fallacy.
In ICWSM.
Chang, J.; Rosenn, I.; Backstrom, L.; and Marlow, C. 2010.
epluribus: Ethnicity on social networks. In ICWSM.
Corbett, P. 2010. Facebook demographics and statistics re-
port 2010. http://www.istrategylabs.com/2010/01/
facebook-demographics-and-statistics- report-\
2010-145-growth-in- 1-year.
Gastner, M. T., and Newman, M. E. J. 2004. Diffusion-based
method for producing density-equalizing maps. PNAS 101.
O’Connor, B.; Balasubramanyan, R.; Routledge, B.; and Smith, N.
2010. From tweets to polls: Linking text sentiment to public opin-
ion time series. In ICWSM.
Social Security Administration. 2010. Most popular baby names.
http://www.ssa.gov/oact/babynames.
Strayhorn, T. 2009. Sex differences in use of facebook and mys-
pace among first-year college students. Stud. Affairs 10(2).
U.S. Census. 2000. Genealogy data: Frequently occur-
ring surnames from census. http://www.census.gov/
genealogy/www/data/2000surnames.
... El dataset o quizás más bien DameGender (el software para crear el dataset) podría ser aplicado en artículos de las diversas disciplinas donde operan las herramientas de detección de género a partir del nombre (Sun et al. 2019): ingeniería de software (Vasilescu, Capiluppi y Serebrenik, 2012), lingüística (Hutson 2016) y (Al-Zumor, 2009), bibliometría (Holman, Stuart-Fox y Hauser, 2018), periodismo (Mislove et al. 2011), (Niemi 2017) y (Sainz de Baranda 2014), ... Con vocación de ser puntuados muchos de ellos como de ciencia reproducible (Peng 2011) si se tienen las precauciones correctas de la definición. La estructura de este artículo es la siguiente: ...
... Este tipo de estudios permite analizar demografía de población Twitter (Mislove et al 2011), siendo del sexo una variable clave en cualquier estudio de demografía. En este caso, ...
Article
Full-text available
La igualdad de género es el quinto objetivo de desarrollo sostenible (ODS) para Naciones Unidas. Esta igualdad puede ser lograda midiendo, analizando datos y, creando buenas políticas con los resultados. Muchos estudios de género cuentan hombres y mujeres para explicar la posible desigualdad, por ejemplo, artículos de investigación, puestos de trabajo, calles, etc. El método tradicional de investigación es usar APIs comerciales con datos propietarios sin idea acerca de cómo los datos fueron recogidos. Los datos pueden también ser recogidos desde Wikipedia, estudios lingüísticos, sitios científicos, u oficinas estadísticas. Este enfoque está basado en recoger Datasets Abiertos (Open Datasets) que incluyen nombre, género y frecuencia desde muchas instituciones estadísticas. Así́, las tareas abordadas están basadas en unificar formatos, procesar datos y, crear pruebas para medir la precisión de los nuevos datasets. El dataset usado cubre más de 20 países en el mundo occidental trayendo miles de nombres con una precisión de acierto mayor del 90%. Esto permitirá medir brecha de género a estudiantes y académicos interesados en el fenómeno sin costes y de una manera reproducible y más personas estarán contribuyendo a eliminar la brecha de género. El Software Libre y los datos provistos por instituciones estadísticas hacen posible producir investigación reproducible por pares.
... A major part of our analysis is heavily focused on determining and comparing the toxic characteristics of tweets shared by male and female users who participated in the #MeToo movement. Since it is not mandatory for users to disclose their gender on Twitter, we utilized the methodology introduced by Nilizadeh et al. [54] for identifying perceived gender using the Face++ API [48] and 92,626 unique first names and their gender profiles in the U.S. from 1990 to 2013 [51]. Also, users have increasingly adopted the use of gender-identifiable pronouns on their social media profiles including Twitter [38,80], which provided us with a great opportunity to identify their gender as self-disclosed by the user. ...
... Our process uses the following tools: Firstly, we checked if a particular tweet shared by a user was talking about women/ men by using Spacy [67], an advanced open-source library for natural language processing tasks. Spacy uses a syntactic parser and pre-trained word embedding to recognize the subject/object tokens of a sentence, which for our purposes were female/ male pronouns and names collected from the US Census database [51]. Secondly, since toxic tweets do not always contain offensive words [21], we use a combination of NLP-based tools to recognize the sentiment/ emotion of the tweets, which are -NLTK's sentiment analysis [55], Linguistic Inquiry and Word Count (LIWC)'s [40] 'Anger' score metric and Perspective API's [3] 'Severe toxicity' score. ...
Preprint
Full-text available
The #MeToo movement has catalyzed widespread public discourse surrounding sexual harassment and assault, empowering survivors to share their stories and holding perpetrators accountable. While the movement has had a substantial and largely positive influence, this study aims to examine the potential negative consequences in the form of increased hostility against women and men on the social media platform Twitter. By analyzing tweets shared between October 2017 and January 2020 by more than 47.1k individuals who had either disclosed their own sexual abuse experiences on Twitter or engaged in discussions about the movement, we identify the overall increase in gender-based hostility towards both women and men since the start of the movement. We also monitor 16 pivotal real-life events that shaped the #MeToo movement to identify how these events may have amplified negative discussions targeting the opposite gender on Twitter. Furthermore, we conduct a thematic content analysis of a subset of gender-based hostile tweets, which helps us identify recurring themes and underlying motivations driving the expressions of anger and resentment from both men and women concerning the #MeToo movement. This study highlights the need for a nuanced understanding of the impact of social movements on online discourse and underscores the importance of addressing gender-based hostility in the digital sphere.
... However, it is also important to note that simply removing features that represent bias, such as gender and age, is not sufficient. This is because bias can be deeply embedded in the data and manifest indirectly through other correlated features [72] [73]. ...
Article
Full-text available
Sentiment analysis (SA) and text emotion detection (TED) are two computer techniques used to analyze text. SA categorizes text into positive, negative, or neutral opinions, while TED can identify a wide array of emotional states, allowing an automated agent to respond appropriately. These techniques can be helpful in areas such as employee and customer management, online support, and customer loyalty, where identifying human emotions is crucial. Among other approaches, research has been conducted using machine learning (ML) algorithms, and labeled datasets have been created to train these models. Current state-of-the-art research for supervised ML algorithms reports good performance for TED (approximately 80% accuracy) and even better results for SA (above 90%). After conducting an extensive review of 30 survey articles, the primary objective of this manuscript is to highlight the disproportionate emphasis placed on comparing computational approaches, as evidenced by 94% of the articles surveyed that feature algorithmic aspects in their summaries. %), the corpora utilized for training (30%), and the data source employed during analysis and evaluation (20%). The lack of standardization across these essential elements presents a significant challenge when performing meaningful performance comparisons among algorithms. Consequently, the absence of a unified framework for comparison hampers the practical implementation of SA and TED techniques within mission-critical scenarios within real-world mission-critical scenarios.
... Nonetheless, social media data have been shown more potential in filling these gaps. First, these data have more user-generated information (e.g., surnames), which can be used to infer an individual's raceethnicity, including those minority groups (e.g., Chang, Rosenn, Backstrom, & Marlow, 2010;Mislove, Lehmann, Ahn, Onnela, & Rosenquist, 2011). Moreover, previous studies have demonstrated that geo-tagged social media data can be used to infer individual economic status, by predicting the individual's home location via trajectory mining and attaching the economic statistics by census aggregated units to the predicted home location of each individual (e.g., Huang & Wong, 2016;Wu & Huang, 2022). ...
... They show that these variables do not follow the normal distribution and on average, users in the recommended dataset have more friends and reviews than those in the non-recommended dataset. Extraction of users' gender: We extracted the gender of Yelp users in our recommended and not-recommended datasets by using the methodology employed by Mislove et al. [81]. This gender detection algorithm computes the probability of usernames being a specific gender by using the names obtained from Census data from the years 1900 to 2013. ...
Preprint
Full-text available
Web 2.0 recommendation systems, such as Yelp, connect users and businesses so that users can identify new businesses and simultaneously express their experiences in the form of reviews. Yelp recommendation software moderates user-provided content by categorizing them into recommended and not-recommended sections. Due to Yelp's substantial popularity and its high impact on local businesses' success, understanding the fairness of its algorithms is crucial. However, with no access to the training data and the algorithms used by such black-box systems, studying their fairness is not trivial, requiring a tremendous effort to minimize bias in data collection and consider the confounding factors in the analysis. This large-scale data-driven study, for the first time, investigates Yelp's business ranking and review recommendation system through the lens of fairness. We define and examine 4 hypotheses to examine if Yelp's recommendation software shows bias and if Yelp's business ranking algorithm shows bias against restaurants located in specific neighborhoods. Our findings show that reviews of female and less-established users are disproportionately categorized as recommended. We also find a positive association between restaurants being located in hotspot regions and their average exposure. Furthermore, we observed some cases of severe disparity bias in cities where the hotspots are in neighborhoods with less demographic diversity or areas with higher affluence and education levels. Indeed, biases introduced by data-driven systems, including our findings in this paper, are (almost) always implicit and through proxy attributes. Still, the authors believe such implicit biases should be detected and resolved as those can create cycles of discrimination that keep increasing the social gaps between different groups even further.
... Existing literature has identified multiple challenges related to the "representativeness" of Twitter data. Twitter users tend to be an overrepresentation of young, male, educated, and urbanized populations (Barberá and Rivero, 2015;Mellon and Prosser, 2017;Mislove et al., 2011). The other of bias is model bias, possibly introduced by analysts during the word choice for the analysis (e.g., damage indicators and "false" damage patterns to filter in damage-related tweets) and annotation process. ...
... Meanwhile, we expect similar correlations when focusing on urban population. Since most of our Twitter users are concentrated in urban areas, we expect more linguistic and socioeconomic heterogeneity at these places rather than in rural areas (27). We therefore consider the eight largest metropolitan areas in England as listed earlier. ...
Preprint
Full-text available
The socioeconomic background of people and how they use standard forms of language are not independent, as demonstrated in various sociolinguistic studies. However, the extent to which these correlations may be influenced by the mixing of people from different socioeconomic classes remains relatively unexplored from a quantitative perspective. In this work we leverage geotagged tweets and transferable computational methods to map deviations from standard English on a large scale, in seven thousand administrative areas of England and Wales. We combine these data with high-resolution income maps to assign a proxy socioeconomic indicator to home-located users. Strikingly, across eight metropolitan areas we find a consistent pattern suggesting that the more different socioeconomic classes mix, the less interdependent the frequency of their departures from standard grammar and their income become. Further, we propose an agent-based model of linguistic variety adoption that sheds light on the mechanisms that produce the observations seen in the data.
... With an operational Metaverse, bridges are built between distant entities, allowing for real-time interaction regardless of physical location [5]. started to redefine traditional marketing norms and calls for a new conceptual understanding and strategic approach [11]. ...
Article
Full-text available
This paper explores the radical impact of the Metaverse on modern marketing strategies, identifying new opportunities while acknowledging its unique challenges. As the Metaverse rapidly evolves as the next significant frontier in digital engagement, understanding its implications for marketing professionals is crucial. Drawing on both theoretical and empirical assessments, the paper delves into the core differences between traditional digital marketing and Metaverse marketing, bringing to light the vast opportunities inherent in enhanced customer engagement, personalized marketing, immersive shopping experiences, and more. Concurrently, it analyses potential challenges, such as data privacy issues, technology adoption hurdles and regulatory complications. Grounded in real-world case studies, the paper illuminates successful Metaverse marketing strategies while forecasting future trends and challenges. The concluding section offers a snapshot of the Metaverse's future development and provides strategies for marketers to navigate this intricate yet promising digital landscape. This study not only provides a comprehensive understanding of Metaverse marketing but also acts as a strategic guide for practitioners venturing into the Metaverse.
Article
We present an indicator of job loss derived from Twitter data, based on a fine-tuned neural network with transfer learning to classify if a tweet is job-loss related or not. We show that our Twitter-based measure of job loss is well-correlated with and predictive of other measures of unemployment available in the official statistics and with the added benefits of real-time availability and daily frequency. These findings are especially strong for the period of the Pandemic Recession, when our Twitter indicator continues to track job loss well but where other real-time measures like unemployment insurance claims provided an imperfect signal of job loss. Additionally, we find that our Twitter job loss indicator provides incremental information in predicting official unemployment flows in a given month beyond what weekly unemployment insurance claims offer.
Conference Paper
Full-text available
Directed links in social media could represent anything from intimate friendships to common interests, or even a passion for breaking news or celebrity gossip. Such directed links determine the flow of information and hence indicate a user's influence on others—a concept that is crucial in sociology and viral marketing. In this paper, using a large amount of data collected from Twit- ter, we present an in-depth comparison of three mea- sures of influence: indegree, retweets, and mentions. Based on these measures, we investigate the dynam- ics of user influence across topics and time. We make several interesting observations. First, popular users who have high indegree are not necessarily influential in terms of spawning retweets or mentions. Second, most influential users can hold significant influence over a variety of topics. Third, influence is not gained spon- taneously or accidentally, but through concerted effort such as limiting tweets to a single topic. We believe that these findings provide new insights for viral marketing and suggest that topological measures such as indegree alone reveals very little about the influence of a user.
Conference Paper
Full-text available
We propose an approach to determine the ethnic break- down of a population based solely on people's names and data provided by the U.S. Census Bureau. We demon- strate that our approach is able to predict the ethnicities of individuals as well as the ethnicity of an entire pop- ulation better than natural alternatives. We apply our technique to the population of U.S. Facebook users and uncover the demographic characteristics of ethnicities and how they relate. We also discover that while Face- book has always been diverse, diversity has increased over time leading to a population that today looks very similar to the overall U.S. population. We also find that different ethnic groups relate to one another in an as- sortative manner, and that these groups have different profiles across demographics, beliefs, and usage of site features.
Article
Full-text available
In recent years, social media has become ubiquitous and important for social networking and content sharing. And yet, the content that is generated from these websites remains largely untapped. In this paper, we demonstrate how social media content can be used to predict real-world outcomes. In particular, we use the chatter from Twitter.com to forecast box-office revenues for movies. We show that a simple model built from the rate at which tweets are created about particular topics can outperform market-based predictors. We further demonstrate how sentiments extracted from Twitter can be further utilized to improve the forecasting power of social media.
Article
Full-text available
Map makers have for many years searched for a way to construct cartograms, maps in which the sizes of geographic regions such as countries or provinces appear in proportion to their population or some other analogous property. Such maps are invaluable for the representation of census results, election returns, disease incidence, and many other kinds of human data. Unfortunately, to scale regions and still have them fit together, one is normally forced to distort the regions' shapes, potentially resulting in maps that are difficult to read. Many methods for making cartograms have been proposed, some of them are extremely complex, but all suffer either from this lack of readability or from other pathologies, like overlapping regions or strong dependence on the choice of coordinate axes. Here, we present a technique based on ideas borrowed from elementary physics that suffers none of these drawbacks. Our method is conceptually simple and produces useful, elegant, and easily readable maps. We illustrate the method with applications to the results of the 2000 U.S. presidential election, lung cancer cases in the State of New York, and the geographical distribution of stories appearing in the news.
Article
Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e., can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public's response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%.
Most popular baby names
  • Social Security Administration
Social Security Administration. 2010. Most popular baby names. http://www.ssa.gov/oact/babynames.
Genealogy data: Frequently occurring surnames from census
  • U S Census
U.S. Census. 2000. Genealogy data: Frequently occurring surnames from census. http://www.census.gov/ genealogy/www/data/2000surnames.
Facebook demographics and statistics report 2010
  • P Corbett
Corbett, P. 2010. Facebook demographics and statistics report 2010. http://www.istrategylabs.com/2010/01/ facebook-demographics-and-statistics-report-\ 2010-145-growth-in-1-year.