Conference PaperPDF Available

Understanding the Demographics of Twitter Users

Authors:

Abstract and Figures

Every second, the thoughts and feelings of millions of people across the world are recorded in the form of 140-character tweets using Twitter. However, despite the enormous potential presented by this remarkable data source, we still do not have an understanding of the Twitter population itself: Who are the Twitter users? How representative of the overall population are they? In this paper, we take the first steps towards answering these questions by analyzing data on a set of Twitter users representing over 1 % of the U.S. population. We develop techniques that allow us to compare the Twitter population to the U.S. population along three axes (geography, gender, and race/ethnicity), and find that the Twitter population is a highly non-uniform sample of the population.
Content may be subject to copyright.
Understanding the Demographics of Twitter Users
Alan MisloveSune LehmannYon g- Yeo l A h nJukka-Pekka OnnelaJ. Niels Rosenquist
Northeastern University Tec h n i c a l U nive r s i ty of D e n m ark Harvard Medical School
Abstract
Every second, the thoughts and feelings of millions of people
across the world are recorded in the form of 140-character
tweets using Twitter. However, despite the enormous poten-
tial presented by this remarkable data source, we still do not
have an understanding of the Twitter population itself: Who
are the Twitter users? How representative of the overall pop-
ulation are they? In this paper, we take the first steps towards
answering these questions by analyzing data on a set of Twit-
ter users representing over 1% of the U.S. population. We
develop techniques that allow us to compare the Twitter pop-
ulation to the U.S. population along three axes (geography,
gender, and race/ethnicity), and find that the Twitter popula-
tion is a highly non-uniform sample of the population.
Introduction
Online social networks are now a popular way for users to
connect, communicate, and share content; many serve as the
de-facto Internet portal for millions of users. Because of the
massive popularity of these sites, data about the users and
their communication offers unprecedented opportunities to
examine how human society functions at scale. However,
concerns over user privacy often force service providers to
keep such data private. Twitter represents an exception: Over
91% of Twitter users choose to make their profile and com-
munication history publicly visible, allowing researchers
access to the vast majority of the site. Twitter, therefore,
presents a unique opportunity to examine the public com-
munication of a large fraction of the population.
In fact, researchers have recently begun to use the con-
tent of Twitter messages to measure and predict real-world
phenomena, including movie box office returns (Asur and
Huberman 2010), elections (O’Connor et al. 2010), and the
stock market (Bollen, Mao, and Zeng 2010). While these
studies show remarkable promise, one heretofore unan-
swered question is: Are Twitter users a representative sam-
ple of society? If not, which demographics are over- or un-
derrepresented in the Twitter population? Because existing
studies generally treat Twitter as a “black box,” shedding
light on the characteristics of the Twitter population is likely
to lead to improvements in existing prediction and measure-
ment methods. Moreover, understanding the characteristics
Copyright c
!2011, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
of the Twitter population is crucial to move towards more
advanced observations and predictions, since such an un-
derstanding will help us determine what predictions can be
made and what other data is necessary to correct for any bi-
ases.
In this paper, we take a first look at the demographics of
the Twitter users, aiming to answer these questions. To do
so, we use a data set of over 1,755,925,520 Twitter messages
sent by 54,981,152 users between March 2006 and August
2009 (Cha et al. 2010). We focus on users whose identified
location is within the United States, because the plurality of
users at the time of the data collection are in U.S., and be-
cause we have the detailed demographic data for U.S. popu-
lation. Even with the location constraint, our dataset covers
over three million users, representing more than 1% of the
entire U.S. population.
Ideally, when comparing the Twitter population to society
as a whole, we would like to compare properties including
socio-economic status, education level, and type of employ-
ment. However, we are restricted to only using the data that
is (optionally) self-reported and made visible by the Twitter
users, including their name, location, and the text of their
tweets. We develop techniques to examine the properties
of the Twitter population along three separate but interre-
lated axes, based on the feasibility of comparison. First, we
compare the geographic distribution of users to the popu-
lation as a whole using U.S. Census data. We demonstrate
that Twitter users are more likely to live within populous
counties than would be expected from the Census data, and
that sparsely populated regions of the U.S. are significantly
underrepresented. Second, we infer the gender of Twitter
users and demonstrate that a significant male bias exists,
although the bias is becoming less pronounced over time.
Third, we examine the race/ethnicity of Twitter users and
demonstrate that the distribution of race/ethnicity is highly
geographically-dependent.
Geographic distribution
Detection location using self-reported data
To d e t e r m i ne geog r a p h ic i n f o rm a ti o n a b ou t users, w e use
the self-reported location field in the user profile. The loca-
tion is an optional self-reported string; we found that 75.3%
of the publicly visible users listed a location. In order to turn
the user-provided string into a mappable location, we use
the Google Maps API. Beginning with the most popular lo-
cation strings (i.e, the strings provided by the most users),
we query Google Maps with each location string. If Google
Maps is able to interpret a string as a location, we receive a
latitude and longitude as a response. We restrict our scope to
users in the U.S. by only considering response latitudes and
longitudes that are within the U.S.. In total, we find map-
pings to a U.S. longitude and latitude for 246,015 unique
strings, covering 3,279,425 users (representing 8.8% of the
users who list a location).
To com p a r e o u r Tw i tt e r d at a t o t h e 2 00 0 U . S. C e n su s , i t
is necessary to aggregate the users into U.S. counties. Using
data from the U.S. National Atlas and the U.S. Geological
Survey, we map each of the 246,015 latitudes and longitudes
into their respective U.S. county. Unless otherwise stated,
our analysis for the remainder of this paper is at the U.S.
county level.
Limitations We now b rie y d i sc u ss po te nt ia l li m it at io ns
of our location inference methodology. First, it is worth not-
ing that Google Maps will also interpret locations that are
at a granularity coarser than a U.S. county (e.g., “Texas”).
We m a n ua ll y re mo v ed t h es e , incl u d in g th e map p i n gs o f all
50 states, as well as “United States” and “Earth.” Second,
users may lie about their location, or may list an out-of-date
location. Third, since the location is per-user (rather than
per-tweet), a user who moves from one city to another (and
updates his location) will have all of his tweets considered
as being from the latter location.
Geographic distribution of Twitter users
We b e g i n b y ex am in in g th e ge ogra p h ic d i s tr ib u t io n of Twit -
ter users, and comparing it to the entire U.S. population.
Overall, the 3,279,425 Twitter users who we are able to geo-
locate represent 1.15% of the entire population (at the time
of the 2000 Census). However, if we examine the distribu-
tion of Twitter users per county, we observe a highly non-
uniform distribution.
Figure 1 presents this analysis, with the county popula-
tion along the xaxis and the fraction of this population we
observe in Twitter along the yaxis. We see that, as the popu-
lation of the county increases, the Tw i t t e r repre s e n t a t i o n rate
(simply the number of Twitter users in that county divided
by the number of people in that county in the 2000 U.S.
Census) increases as well. For example, consider the median
per-county Twitter representation rate of 0.324%. We ob-
serve that 93.5% of the counties with over 100,000 residents
have a higher Twitter representation rate than the median,
compared to only 40.8% of the counties with fewer than
100,000 residents (were Twitter users a truly random pop-
ulation sample, we would expect these percentages to both
be 50%). Thus, the Twitter users significantly overrepresent
populous counties, a fact underscored by the difference be-
tween the median (0.324%) per-county Twitter representa-
tion rates and the overall population sample of 1.15%.
The overrepresentation of populous counties in and of it-
self may not come as a surprise, due to the patterns of so-
cial media adoption across different regions. However, the
0.01%
0.10%
1.00%
10.00%
103104105106107
Twitter Representation Rate
County Population
Figure 1: Scatterplot of US county population versus Twitter
representation rate in that county. The dark line represents
the aggregated median, and the dashed black line represents
the overall median (0.324%). There is a clear overrepresen-
tation of more populous counties.
magnitude of the difference is striking: We observe an or-
der of magnitude difference in median per-county Twitter
representation rate between counties with 1,000 people and
counties with 1,000,000 people. This indicates a bias in the
Twitter p o pu l a ti o n ( re l a t ive t o th e U. S . po p ul a t i o n) a n d sug-
gests that entire regions of the U.S. may be significantly un-
derrepresented.
Distribution across counties We n o w e x am in e which re-
gions of the U.S. contain these over- and underrepresented
counties. To do so, we plot a map of the U.S. based on the
Twitter re p r e s e n ta t i o n ra t e , r e l a t ive to t h e m e d ia n r a t e o f
0.324%. Figure 2 presents this data, using both a normal rep-
resentation and an area cartogram representation (Gastner
and Newman 2004). In this figure, the counties are colored
according to the level of over- or underrepresentation, with
blue colors representing underrepresentation and red colors
representing overrepresentation, relative to the median rate
of 0.324%. Thus, the same number of counties will be col-
ored red as blue.
These two maps lead to a number of interesting conclu-
sions: First, as evident in the normal representation, much of
the mid-west is significantly underrepresented in the Twit-
ter user base in this time period. Second, as evident in the
significantly red hue of the area cartogram, more populous
counties are consistently oversampled. However, the level of
oversampling does not appear to be dependent upon geogra-
phy: Both east coast and west coast cities are clearly visible
(e.g., San Francisco and Boston), as well as mid-west and
southern cities (e.g, Dallas, Chicago, and Atlanta).
Gender
Detecting gender using first names
As we have very limited information available on each user,
we rely on using the self-reported name available in each
user’s profile in order to detect gender. To do so, we first ob-
tain the most popular 1,000 male and female names for ba-
bies born in the U.S. for each year 1900–2009, as reported
by the U.S. Social Security Administration (Social Secu-
rity Administration 2010). We then aggregate the names to-
gether, calculating the total frequency of each of the result-
ing 3,034 male and 3,643 female names. As certain names
occurred in both lists, we remove the 241 names that were
(a) Normal representation (b) Area cartogram representation
Figure 2: Per-county over- and underrepresentation of U.S. population in Twitter, relative to the median per-county represen-
tation rate of 0.324%, presented in both (a) a normal layout and (b) an area cartogram based on the 2000 Census population.
Blue colors indicate underrepresentation, while red colorsrepresentoverrepresentation.Theintensityofthecolorcorresponds
to the log of the over- or underrepresentation rate. Clear trends are visible, such as the underrepresentation of mid-west and
overrepresentation of populous counties.
less than 95% predictive (e.g., the name Avery was observed
to correspond to male babies only 56.8% of the time; it was
therefore removed). The result is a list of 5,836 names that
we use to infer gender.
Limitations Clearly, this approach to detecting gender is
subject to a number of potential limitations. First, users may
misrepresent their name, leading to an incorrect gender in-
ference. Second, there may be differences in choosing to re-
veal one’s name between genders, leading us to believe that
fewer users of one gender are present. Third, the name lists
above may cover different fractions of the male and female
populations.
Gender of Twitter users
We fir s t d et e rm in e th e n um ber of th e 3 , 27 9, 425 U. S. -b as e d
users who we could infer a gender for, based on their name
and the list previously described. We do so by comparing
the first word of their self-reported name to the gender list.
We o b s e rv e th a t the r e e x is t s a mat c h f o r 64 .2% of t h e u se rs .
Moreover, we find a strong bias towards male users: Fully
71.8% of the the users who we find a name match for had a
male name.
0
0.2
0.4
0.6
0.8
1
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07
Fraction of Joining Users
who are Male
Date
Figure 3: Gender of joining users over time, binned into
groups of 10,000 joining users (note that the join rate in-
creases substantially). The bias towards male users is ob-
served to be decreasing over time.
To fur t h e r e x p l or e t h is t r e nd , w e ex am i n e th e h is t o r ic g e n -
der bias. To do so, we use the join date of each user (avail-
able in the user’s profile). Figure 3 plots the average fraction
of joining users who are male over time. From this plot, it
is clear that while the male gender bias was significantly
stronger among the early Twitter adopters, the bias is be-
coming reduced over time.
Race/ethnicity
Detecting race/ethnicity using last names
Again, since we have very limited information available
on each Twitter user, we resort to inferring race/ethnicity
using self-reported last name. We examine the last name
of users, and correlate the last name with data from the
U.S. 2000 Census (U.S. Census 2000). In more detail, for
each last name with over 100 individuals in the U.S. dur-
ing the 2000 Census, the Census releases the distribution of
race/ethnicity for that last name. For example, the last name
“Myers” was observed to correspond to Caucasians 86% of
the time, African-Americans 9.7%, Asians 0.4%, and His-
panics 1.4%.
Race/ethnicity distribution of Twitter users
We fir s t d e t er mi ne d th e num b e r o f U.S . - b as ed u s e rs f or
whom we could infer the race/ethnicity by comparing the
last word of their self-reported name to the U.S. Census
last name list. We observed that we found a match for
71.8% of the users. We the determined the distribution of
race/ethnicity in each county by taking the race/ethnicity
distribution in the Census list, weighted by the frequency
of each name occurring in Twitter users in that county.1
Due to the large amount of ambiguity in the last name-to-
race/ethnicity list (in particular, the last name list is more
than 95% predictive for only 18.5% of the users), we are un-
able to directly compare the Twitter race/ethnicity distribu-
1This is effectively the census.model approach discussed in
prior work (Chang et al. 2010).
Undersampling
Oversampling
(a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic
Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and
Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity
are shown. Blue regions correspond to undersampling; red regions to oversampling.
tion directly to race/ethnicity distribution in the U.S. Census.
However, we are able to make relat iv e comparisons between
Twitter u s e r s in differen t g e o g ra p hi c r e g io n s , al l o w i n g us to
explore geographic trends in the race/ethnicity distribution.
Thus, we examine the per-county race/ethnicity distribution
of Twitter users.
In order to account for the uneven distribution of
race/ethnicity across the U.S., we examine the per-county
race/ethnicity distribution relative to the distribution from
the overall U.S. Census. For example, if we observed that
25% of Twitter users in a county were predicted to be His-
panic, and the 2000 U.S. counted 23% of people in that
county as being Hispanic, we would consider Twitter to be
oversampling the Hispanic users in that county. Figure 4
plots the per-county race/ethnicity distribution, relative to
the 2000 U.S. Census, per all counties in which we observed
more than 500 Twitter users with identifiable last names.
Anumberofgeographictrendsarevisible,suchastheun-
dersampling of Hispanic users in the southwest; the under-
samping of African-American users in the south and mid-
west; and the oversampling of Caucasian users in many ma-
jor cities.
Related work
Afewotherstudieshaveexaminedthedemographicsofso-
cial network users. For example, recent studies have exam-
ined the ethnicity of Facebook users (Chang et al. 2010),
general demographics of Facebook users (Corbett 2010),
and differences in online behavior on Facebook and MyS-
pace by gender (Strayhorn 2009). However, studies of gen-
eral social networking sites are able to leverage the broad
nature of the profiles available; in contrast, on Twitter, users
self-report only a minimal set of information, making calcu-
lating demographics significantly more difficult.
Conclusion
Twitter h a s r e c eived s i g n i c an t r e s e ar c h i n te r e st l a t e l y asa
means for understanding, monitoring, and even predicting
real-world phenomena. However, most existing work does
not address the sampling bias, simply applying machine
learning and data mining algorithms without an understand-
ing of the Twitter user population. In this paper, we took
afirstlookattheuserpopulationthemselves,andexam-
ined the population along the axes of geography, gender, and
race/ethnicity. Overall, we found that Twitter users signif-
icantly overrepresent the densely population regions of the
U.S., are predominantly male, and represent a highly non-
random sample of the overall race/ethnicity distribution.
Going forward, our study sets the foundation for future
work upon Twitter data. Existing approaches could imme-
diately use our analysis to improve predictions or measure-
ments. By enabling post-hoc corrections, our work is a first
step towards turning Twitter into a tool that can make infer-
ences about the population as a whole. More nuanced anal-
yses on the biases in the Twitter population will enhance
the ability for Twitter to be used as a sophisticated inference
tool.
Acknowledgements
We th a n k F a br ic io B e nev e nto and M e e y ou ng Ch a for t h e i r
assistance in gathering the Twitter data used in this study.
We als o t h an k Ji m Ba gr ow f o r va l u ab le d is c ussio n s a n d his
collection of geographic data from Google Maps. This re-
search was supported in part by NSF grant IIS-0964465 and
an Amazon Web Services in Education Grant.
References
Asur, S., and Huberman, B. 2010. Predicing the future with social
media. http://arxiv.org/abs/1003.5699.
Bollen, J.; Mao, H.; and Zeng, X.-J. 2010. Twitter mood predicts
the stock market. In ICWSM.
Cha, M.; Haddadi, H.; Benevenuto, F.; and Gummadi, K. 2010.
Measuring user influence in twitter: The million follower fallacy.
In ICWSM.
Chang, J.; Rosenn, I.; Backstrom, L.; and Marlow, C. 2010.
epluribus: Ethnicity on social networks. In ICWSM.
Corbett, P. 2010. Facebook demographics and statistics re-
port 2010. http://www.istrategylabs.com/2010/01/
facebook-demographics-and-statistics- report-\
2010-145-growth-in- 1-year.
Gastner, M. T., and Newman, M. E. J. 2004. Diffusion-based
method for producing density-equalizing maps. PNAS 101.
O’Connor, B.; Balasubramanyan, R.; Routledge, B.; and Smith, N.
2010. From tweets to polls: Linking text sentiment to public opin-
ion time series. In ICWSM.
Social Security Administration. 2010. Most popular baby names.
http://www.ssa.gov/oact/babynames.
Strayhorn, T. 2009. Sex differences in use of facebook and mys-
pace among first-year college students. Stud. Affairs 10(2).
U.S. Census. 2000. Genealogy data: Frequently occur-
ring surnames from census. http://www.census.gov/
genealogy/www/data/2000surnames.
... The data is fine-grained and allows for detailed temporal analysis, which is useful for tracking decision-making processes such as the Coal Commission, which span a period of time with different significant events occurring at different stages (Cody et al., 2015;Klašnja et al., 2018). This comes with the downside of not being representative of the population (Mislove et al., 2011;Mellon and Prosser, 2017). ...
... Furthermore, Twitter users are not representative of national populations, as is well documented (Duggan et al., 2015;Mislove et al., 2011;Malik et al., 2015;Mellon and Prosser, 2017;Fernández et al., 2014). In the United States, it has been found that the most populous counties are overrepresented (Mislove et al., 2011), and there are also significant biases towards younger users and users of higher income when comparing geotagged tweets and census data (Malik et al., 2015). ...
... Furthermore, Twitter users are not representative of national populations, as is well documented (Duggan et al., 2015;Mislove et al., 2011;Malik et al., 2015;Mellon and Prosser, 2017;Fernández et al., 2014). In the United States, it has been found that the most populous counties are overrepresented (Mislove et al., 2011), and there are also significant biases towards younger users and users of higher income when comparing geotagged tweets and census data (Malik et al., 2015). ...
Article
Phasing out coal is a prerequisite to achieving the Paris climate mitigation targets. In 2018, the German government established a multi-stakeholder commission with the mandate to negotiate a plan for the national coal phase-out, fueling a continued public debate over the future of coal. This study analyzes the German coal debate on Twitter before, during, and after the session of the so-called Coal Commission, over a period of three years. In particular, we investigate whether and how the work of the commission translated into shared perceptions and sentiments in the public debate on Twitter. We find that the sentiment of the German coal debate on Twitter becomes increasingly negative over time. In addition, the sentiment becomes more polarized over time due to an increase in the use of more negative and positive language. The analysis of retweet networks shows no increase in interactions between communities over time. These findings suggest that the Coal Commission did not further consensus in the coal debate on Twitter. While the debate on social media only represents a section of the national debate, it provides insights for policy-makers to evaluate the interaction of multi-stakeholder commissions and public debates.
... One reason for this relates to how the individuals' home locations are distributed across the study area. Previous studies have suggested that most active Twitter users live in urban areas [42], [40] and that using sparse geolocations of Twitter data for simulating travel demand is more suitable for urban residents than for the population as a whole. Another reason is that São Paulo has the smallest area but the greatest number of individuals and geolocations in the sparse traces. ...
Article
Full-text available
Knowing how much people travel is essential for transport planning. Empirical mobility traces collected from call detail records (CDRs), location-based social networks (LBSNs), and social media data have been used widely to study mobility patterns. However, these data suffer from sparsity, an issue that has largely been overlooked. In order to extend the use of these low-cost and accessible data, this study proposes a mobility model that fills the gaps in sparse mobility traces from which one can later synthesise travel demand. The proposed model extends the fundamental mechanisms of exploration and preferential return to synthesise mobility trips. The model is tested on sparse mobility traces from Twitter. We validate our model and find good agreement on origin-destination matrices and trip distance distributions for Sweden, the Netherlands, and São Paulo, Brazil, compared with a benchmark model using a heuristic method, especially for the most frequent trip distance range (1–40 km). Moreover, the learned model parameters are found to be transferable from one region to another. Using the proposed model, reasonable travel demand values can be synthesised from a dataset covering a large enough population of very sparse individual geolocations (around 1.5 geolocations per day covering 100 days on average).
... One of the main reasons is because this research only employs geotagged tweets which only account for around 1%-2% of all tweets. Furthermore, previous studies have found that Twitter users tend to live in populous counties and sparsely populated counties are significantly underrepresented [61]. Urban users are over-represented and provide more information than rural users [62]. ...
Article
Full-text available
The COVID-19 pandemic has been sweeping across the United States of America since early 2020. The whole world was waiting for vaccination to end this pandemic. Since the approval of the first vaccine by the U.S. CDC on 9 November 2020, nearly 67.5% of the US population have been fully vaccinated by 10 July 2022. While quite successful in controlling the spreading of COVID-19, there were voices against vaccines. Therefore, this research utilizes geo-tweets and Bayesian-based method to investigate public opinions towards vaccines based on (1) the spatiotemporal changes in public engagement and public sentiment; (2) how the public engagement and sentiment react to different vaccine-related topics; (3) how various races behave differently. We connected the phenomenon observed to real-time and historical events. We found that in general the public is positive towards COVID-19 vaccines. Public sentiment positivity went up as more people were vaccinated. Public sentiment on specific topics varied in different periods. African Americans’ sentiment toward vaccines was relatively lower than other races.
... Therefore, user portrait has been extensively investigated in recent years. Alan et al. [10] analyzed Twitter users from three perspectives, geographic location, gender, and belief, and found that the Internet users truly reflect the true population distribution in every area of the USA. Ruas et al. [11] identified different user behaviors through clustering methods based on the degree of interaction among Facebook users. ...
Article
Full-text available
Meeting users’ preferences and increasing business revenue is an ongoing challenge in the mobile service application. In this paper, we address these challenges by mining mobile user behavior patterns and propose an approach to construct a group user portrait by analyzing access data collected from the users of the WeChat Mini Program. We extract the attributes of mobile users considering their geographic information, online duration, and age group. Using Z -score standardized processing and K -means clustering algorithm, we then model the user portraits through three dimensions including daily average duration, interaction intensity, and access frequency. Our analysis has two important features. Firstly, the significant log data used in our experiments was collected from the production environment ensuring that the results reflect the real attributes of WeChat Mini Program users’ behavior. Secondly, we provide data-driven decision-making to help marketers enhance the quality of the product and improve user experience. The experimental results indicate that by distilling and analyzing the key factors from the log data, the characters of typical users can be properly profiled to help product owners better optimize the exact set of the features which need to sustain and further grow.
... It should be noticed that such labels disregard the fluidity of one's gender experience and performance, which would be better described along a spectrum (Eckert and McConnell-Ginet 2003), and they represent age as a chronological variable rather than a social one depending on peoples' personal experiences (Eckert 1997). This simplification is not made by style transfer specifically, but it is common to many studies focused on authors' traits, due to how the available datasets were constructed-e.g., in gender-centric resources, labels are inferred from the name of the texts' authors (Mislove et al. 2011). ...
Article
Full-text available
Humans are naturally endowed with the ability to write in a particular style. They can, for instance, rephrase a formal letter in an informal way, convey a literal message with the use of figures of speech or edit a novel by mimicking the style of some well-known authors. Automating this form of creativity constitutes the goal of style transfer. As a natural language generation task, style transfer aims at rewriting existing texts, and specifically, it creates paraphrases that exhibit some desired stylistic attributes. From a practical perspective, it envisions beneficial applications, like chatbots that modulate their communicative style to appear empathetic, or systems that automatically simplify technical articles for a non-expert audience. Several style-aware paraphrasing methods have attempted to tackle style transfer. A handful of surveys give a methodological overview of the field, but they do not support researchers to focus on specific styles. With this paper, we aim at providing a comprehensive discussion of the styles that have received attention in the transfer task. We organize them in a hierarchy, highlighting the challenges for the definition of each of them and pointing out gaps in the current research landscape. The hierarchy comprises two main groups. One encompasses styles that people modulate arbitrarily, along the lines of registers and genres. The other group corresponds to unintentionally expressed styles, due to an author’s personal characteristics. Hence, our review shows how these groups relate to one another and where specific styles, including some that have not yet been explored, belong in the hierarchy. Moreover, we summarize the methods employed for different stylistic families, hinting researchers towards those that would be the most fitting for future research.
... Biases in geo-located Twitter users have been previously analyzed [13][14][15]. Twitter users trend younger, wealthier and urban. However, under-aged individuals are under-represented and wealth is not a significant factor among mulitple industrialized cities [16]. ...
Article
Understanding and mapping the emergence and boundaries of cultural areas is a challenge for social sciences. In this paper, we present a method for analyzing the cultural composition of regions via Twitter hashtags. Cultures can be described as distinct combination of traits which we capture via principal component analysis (PCA). We investigate the top 8 PCA components of an area including France, Spain, and Portugal, in terms of the geographic distribution of their hashtag composition. We also discuss relationships between components and the insights those relationships can provide into the structure of a cultural space. Finally, we compare the spatial autocorrelation of PCA components in the Twitter data to similar components resulting from the Axelrod model. We conclude that properties of Twitter behavior can be framed in the discussion of cultural emergence and collective learning.
... Despite this wealth of research that has used large corpora to identify regional differences in the structure of language, we are aware of no research that has leveraged such datasets to identify regional variation in the content of language, much less to use this information to infer cultural regions. In general, big data corpora generated from microblogging platforms certainly present a number of biases: incomplete demographic representativeness [32], non-homogeneous spatio-temporal distribution [33] or severe topic differences with the offline world [34]. However, Twitter is the only variety of geotagged natural language data currently available in sufficient amounts to permit reliable analyses in an automatic way, and is a very popular social media platform used regularly by millions of people from across the US, mostly in interactive contexts [35]. ...
Preprint
Cultural areas represent a useful concept that cross-fertilizes diverse fields in social sciences. Knowledge of how humans organize and relate their ideas and behavior within a society helps to understand their actions and attitudes towards different issues. However, the selection of common traits that shape a cultural area is somewhat arbitrary. What is needed is a method that can leverage the massive amounts of data coming online, especially through social media, to identify cultural regions without ad-hoc assumptions, biases or prejudices. In this work, we take a crucial step towards this direction by introducing a method to infer cultural regions based on the automatic analysis of large datasets from microblogging posts. Our approach is based on the principle that cultural affiliation can be inferred from the topics that people discuss among themselves. Specifically, we measure regional variations in written discourse generated in American social media. From the frequency distributions of content words in geotagged Tweets, we find the words' usage regional hotspots, and from there we derive principal components of regional variation. Through a hierarchical clustering of the data in this lower-dimensional space, our method yields clear cultural areas and the topics of discussion that define them. We obtain a manifest North-South separation, which is primarily influenced by the African American culture, and further contiguous (East-West) and non-contiguous (urban-rural) divisions that provide a comprehensive picture of today's cultural areas in the US.
... This situation can be affected by the difficulty for social scientists to retrieve socio-demographic variables such as gender, ethnicity, level of education and occupation . To underline this limitation, some authors propose to define these data as "datalight" (Gayo-Avello, 2012); other authors have instead questioned the reliability of Twitter, and other social media, as sources for social research (Gayo-Avello, 2012;Mislove et al., 2011). The lack of demographic data brings two critical questions: 1) how a phenomenon appears within a social stratum or/and a territory; 2) The representativity of research results. ...
Article
In the data revolution era, new data and new sources allow researchers to find new ways to study society and its dynamics. Among these types of data, geo-located data enable better ways of producing social knowledge. The availability of data with geographic information put the spatial dimension – initially ignored in social media analysis – at the centre of the interest in digital and web studies. In addition, this data also makes it possible to address the representativeness of big data innovatively. For this reason, we explore the territorial distribution of geo-located tweets regarding some significant territorial socio-economic dimensions in Italy. Our main results show a concentration of users in specific macroareas, a direct proportionality between the size of the city and tweets number, and more users in the urban centre than in metropolitan suburbs. In conclusion, we try to identify the factors underlying these differences and their implications in terms of data analysis and representativeness of the results.
Chapter
Recently, the media sphere has been found to be regularly flooded with misinformation and disinformation. Some scholars refer to such outbreaks as infodemics (Bruns, A., Harrington, S., & Hurcombe, E., Media International Australia, 2020). We are at a point where we must challenge the assumption of online spaces as emblematic of democratic ideals where participation is assumed to solely foster healthy debate in an authentic marketplace of ideas. Prominent in this subverted media arena are bots, inauthentic participants in the exchange of information, and trolls, the “Ghostwriters,” flamethrowers of the internet, whether automated or not. This problem is felt acutely in Lithuania where there are fears that a Russian information war could turn into real war or other political disruption. It is, therefore, imperative to understand the process of this media manipulation. This chapter argues that researchers will have to adapt theories and methodologies to do so by linking the theoretical groundings of mass media influence to the concept of information warfare (Cronin, B., & Crawford, H., Information Society 15:257–263, 1999), where social media chaos can warp civic participation (Zelenkauskaite, A., Creating Chaos Online: Disinformation and Subverted Post-Publics. University of Michigan Press, 2022).
Chapter
The demographic and population modeling methods have been under investigation trends since the 1980s. Extrapolation, prediction, and theoretical computational analysis of exogenous variables are approaches to the forecasting of population processes. Such methods can be exploited to predict individual birth preferences or experts’ views at the population level. Predicting demographic changes have been problematic while its precision usually depends on the case or pattern; numerous methods have been explored; however, so far there is no clear guidelines where the proper approach ought to be. Like certain fields of industry and policy, planning is focused on projections for the future composition of the population, the potential creation of population sizes and institutions which are significant. In order to recognize potential social security issues as one determinant of overall macroeconomic growth, countries that have reduced mortality and low fertility, the case with some of the Asian nations, desperately require accurate demographic estimates. This introduction provides a stochastic cohort model that uses stochastic fertility, migration, and mortality modeling approaches to forecast the population by gender and literacy. This work focus on the population and literacy ratio of India as this nation holds the second largest population in the world. Our approach is based on artificial neural network algorithm that can forecast the population literacy ratio and gender differences based on living states populations using social networks data. We concentrated primarily on quantifying future planning challenges as previous research appeared to neglect potential risks. Our model is then used to forecast/predict gender-wise population for each major state/city. The findings offer clear perspectives on the projected gender demographic composition, and our model holds high precision results.
Conference Paper
Full-text available
Directed links in social media could represent anything from intimate friendships to common interests, or even a passion for breaking news or celebrity gossip. Such directed links determine the flow of information and hence indicate a user's influence on others—a concept that is crucial in sociology and viral marketing. In this paper, using a large amount of data collected from Twit- ter, we present an in-depth comparison of three mea- sures of influence: indegree, retweets, and mentions. Based on these measures, we investigate the dynam- ics of user influence across topics and time. We make several interesting observations. First, popular users who have high indegree are not necessarily influential in terms of spawning retweets or mentions. Second, most influential users can hold significant influence over a variety of topics. Third, influence is not gained spon- taneously or accidentally, but through concerted effort such as limiting tweets to a single topic. We believe that these findings provide new insights for viral marketing and suggest that topological measures such as indegree alone reveals very little about the influence of a user.
Conference Paper
Full-text available
We propose an approach to determine the ethnic break- down of a population based solely on people's names and data provided by the U.S. Census Bureau. We demon- strate that our approach is able to predict the ethnicities of individuals as well as the ethnicity of an entire pop- ulation better than natural alternatives. We apply our technique to the population of U.S. Facebook users and uncover the demographic characteristics of ethnicities and how they relate. We also discover that while Face- book has always been diverse, diversity has increased over time leading to a population that today looks very similar to the overall U.S. population. We also find that different ethnic groups relate to one another in an as- sortative manner, and that these groups have different profiles across demographics, beliefs, and usage of site features.
Article
Full-text available
In recent years, social media has become ubiquitous and important for social networking and content sharing. And yet, the content that is generated from these websites remains largely untapped. In this paper, we demonstrate how social media content can be used to predict real-world outcomes. In particular, we use the chatter from Twitter.com to forecast box-office revenues for movies. We show that a simple model built from the rate at which tweets are created about particular topics can outperform market-based predictors. We further demonstrate how sentiments extracted from Twitter can be further utilized to improve the forecasting power of social media.
Article
Full-text available
Map makers have for many years searched for a way to construct cartograms, maps in which the sizes of geographic regions such as countries or provinces appear in proportion to their population or some other analogous property. Such maps are invaluable for the representation of census results, election returns, disease incidence, and many other kinds of human data. Unfortunately, to scale regions and still have them fit together, one is normally forced to distort the regions' shapes, potentially resulting in maps that are difficult to read. Many methods for making cartograms have been proposed, some of them are extremely complex, but all suffer either from this lack of readability or from other pathologies, like overlapping regions or strong dependence on the choice of coordinate axes. Here, we present a technique based on ideas borrowed from elementary physics that suffers none of these drawbacks. Our method is conceptually simple and produces useful, elegant, and easily readable maps. We illustrate the method with applications to the results of the 2000 U.S. presidential election, lung cancer cases in the State of New York, and the geographical distribution of stories appearing in the news.
Article
Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e., can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public's response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%.
Most popular baby names
  • Social Security Administration
Social Security Administration. 2010. Most popular baby names. http://www.ssa.gov/oact/babynames.
Genealogy data: Frequently occurring surnames from census
  • U S Census
U.S. Census. 2000. Genealogy data: Frequently occurring surnames from census. http://www.census.gov/ genealogy/www/data/2000surnames.
Facebook demographics and statistics report 2010
  • P Corbett
Corbett, P. 2010. Facebook demographics and statistics report 2010. http://www.istrategylabs.com/2010/01/ facebook-demographics-and-statistics-report-\ 2010-145-growth-in-1-year.