Conference PaperPDF Available

Understanding the Demographics of Twitter Users

Authors:

Abstract and Figures

Every second, the thoughts and feelings of millions of people across the world are recorded in the form of 140-character tweets using Twitter. However, despite the enormous potential presented by this remarkable data source, we still do not have an understanding of the Twitter population itself: Who are the Twitter users? How representative of the overall population are they? In this paper, we take the first steps towards answering these questions by analyzing data on a set of Twitter users representing over 1 % of the U.S. population. We develop techniques that allow us to compare the Twitter population to the U.S. population along three axes (geography, gender, and race/ethnicity), and find that the Twitter population is a highly non-uniform sample of the population.
Content may be subject to copyright.
Understanding the Demographics of Twitter Users
Alan MisloveSune LehmannYon g- Yeo l A h nJukka-Pekka OnnelaJ. Niels Rosenquist
Northeastern University Tec h n i c a l U nive r s i ty of D e n m ark Harvard Medical School
Abstract
Every second, the thoughts and feelings of millions of people
across the world are recorded in the form of 140-character
tweets using Twitter. However, despite the enormous poten-
tial presented by this remarkable data source, we still do not
have an understanding of the Twitter population itself: Who
are the Twitter users? How representative of the overall pop-
ulation are they? In this paper, we take the first steps towards
answering these questions by analyzing data on a set of Twit-
ter users representing over 1% of the U.S. population. We
develop techniques that allow us to compare the Twitter pop-
ulation to the U.S. population along three axes (geography,
gender, and race/ethnicity), and find that the Twitter popula-
tion is a highly non-uniform sample of the population.
Introduction
Online social networks are now a popular way for users to
connect, communicate, and share content; many serve as the
de-facto Internet portal for millions of users. Because of the
massive popularity of these sites, data about the users and
their communication offers unprecedented opportunities to
examine how human society functions at scale. However,
concerns over user privacy often force service providers to
keep such data private. Twitter represents an exception: Over
91% of Twitter users choose to make their profile and com-
munication history publicly visible, allowing researchers
access to the vast majority of the site. Twitter, therefore,
presents a unique opportunity to examine the public com-
munication of a large fraction of the population.
In fact, researchers have recently begun to use the con-
tent of Twitter messages to measure and predict real-world
phenomena, including movie box office returns (Asur and
Huberman 2010), elections (O’Connor et al. 2010), and the
stock market (Bollen, Mao, and Zeng 2010). While these
studies show remarkable promise, one heretofore unan-
swered question is: Are Twitter users a representative sam-
ple of society? If not, which demographics are over- or un-
derrepresented in the Twitter population? Because existing
studies generally treat Twitter as a “black box,” shedding
light on the characteristics of the Twitter population is likely
to lead to improvements in existing prediction and measure-
ment methods. Moreover, understanding the characteristics
Copyright c
!2011, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
of the Twitter population is crucial to move towards more
advanced observations and predictions, since such an un-
derstanding will help us determine what predictions can be
made and what other data is necessary to correct for any bi-
ases.
In this paper, we take a first look at the demographics of
the Twitter users, aiming to answer these questions. To do
so, we use a data set of over 1,755,925,520 Twitter messages
sent by 54,981,152 users between March 2006 and August
2009 (Cha et al. 2010). We focus on users whose identified
location is within the United States, because the plurality of
users at the time of the data collection are in U.S., and be-
cause we have the detailed demographic data for U.S. popu-
lation. Even with the location constraint, our dataset covers
over three million users, representing more than 1% of the
entire U.S. population.
Ideally, when comparing the Twitter population to society
as a whole, we would like to compare properties including
socio-economic status, education level, and type of employ-
ment. However, we are restricted to only using the data that
is (optionally) self-reported and made visible by the Twitter
users, including their name, location, and the text of their
tweets. We develop techniques to examine the properties
of the Twitter population along three separate but interre-
lated axes, based on the feasibility of comparison. First, we
compare the geographic distribution of users to the popu-
lation as a whole using U.S. Census data. We demonstrate
that Twitter users are more likely to live within populous
counties than would be expected from the Census data, and
that sparsely populated regions of the U.S. are significantly
underrepresented. Second, we infer the gender of Twitter
users and demonstrate that a significant male bias exists,
although the bias is becoming less pronounced over time.
Third, we examine the race/ethnicity of Twitter users and
demonstrate that the distribution of race/ethnicity is highly
geographically-dependent.
Geographic distribution
Detection location using self-reported data
To d e t e r m i ne geog r a p h ic i n f o rm a ti o n a b ou t users, w e use
the self-reported location field in the user profile. The loca-
tion is an optional self-reported string; we found that 75.3%
of the publicly visible users listed a location. In order to turn
the user-provided string into a mappable location, we use
the Google Maps API. Beginning with the most popular lo-
cation strings (i.e, the strings provided by the most users),
we query Google Maps with each location string. If Google
Maps is able to interpret a string as a location, we receive a
latitude and longitude as a response. We restrict our scope to
users in the U.S. by only considering response latitudes and
longitudes that are within the U.S.. In total, we find map-
pings to a U.S. longitude and latitude for 246,015 unique
strings, covering 3,279,425 users (representing 8.8% of the
users who list a location).
To com p a r e o u r Tw i tt e r d at a t o t h e 2 00 0 U . S. C e n su s , i t
is necessary to aggregate the users into U.S. counties. Using
data from the U.S. National Atlas and the U.S. Geological
Survey, we map each of the 246,015 latitudes and longitudes
into their respective U.S. county. Unless otherwise stated,
our analysis for the remainder of this paper is at the U.S.
county level.
Limitations We now b rie y d i sc u ss po te nt ia l li m it at io ns
of our location inference methodology. First, it is worth not-
ing that Google Maps will also interpret locations that are
at a granularity coarser than a U.S. county (e.g., “Texas”).
We m a n ua ll y re mo v ed t h es e , incl u d in g th e map p i n gs o f all
50 states, as well as “United States” and “Earth.” Second,
users may lie about their location, or may list an out-of-date
location. Third, since the location is per-user (rather than
per-tweet), a user who moves from one city to another (and
updates his location) will have all of his tweets considered
as being from the latter location.
Geographic distribution of Twitter users
We b e g i n b y ex am in in g th e ge ogra p h ic d i s tr ib u t io n of Twit -
ter users, and comparing it to the entire U.S. population.
Overall, the 3,279,425 Twitter users who we are able to geo-
locate represent 1.15% of the entire population (at the time
of the 2000 Census). However, if we examine the distribu-
tion of Twitter users per county, we observe a highly non-
uniform distribution.
Figure 1 presents this analysis, with the county popula-
tion along the xaxis and the fraction of this population we
observe in Twitter along the yaxis. We see that, as the popu-
lation of the county increases, the Tw i t t e r repre s e n t a t i o n rate
(simply the number of Twitter users in that county divided
by the number of people in that county in the 2000 U.S.
Census) increases as well. For example, consider the median
per-county Twitter representation rate of 0.324%. We ob-
serve that 93.5% of the counties with over 100,000 residents
have a higher Twitter representation rate than the median,
compared to only 40.8% of the counties with fewer than
100,000 residents (were Twitter users a truly random pop-
ulation sample, we would expect these percentages to both
be 50%). Thus, the Twitter users significantly overrepresent
populous counties, a fact underscored by the difference be-
tween the median (0.324%) per-county Twitter representa-
tion rates and the overall population sample of 1.15%.
The overrepresentation of populous counties in and of it-
self may not come as a surprise, due to the patterns of so-
cial media adoption across different regions. However, the
0.01%
0.10%
1.00%
10.00%
103104105106107
Twitter Representation Rate
County Population
Figure 1: Scatterplot of US county population versus Twitter
representation rate in that county. The dark line represents
the aggregated median, and the dashed black line represents
the overall median (0.324%). There is a clear overrepresen-
tation of more populous counties.
magnitude of the difference is striking: We observe an or-
der of magnitude difference in median per-county Twitter
representation rate between counties with 1,000 people and
counties with 1,000,000 people. This indicates a bias in the
Twitter p o pu l a ti o n ( re l a t ive t o th e U. S . po p ul a t i o n) a n d sug-
gests that entire regions of the U.S. may be significantly un-
derrepresented.
Distribution across counties We n o w e x am in e which re-
gions of the U.S. contain these over- and underrepresented
counties. To do so, we plot a map of the U.S. based on the
Twitter re p r e s e n ta t i o n ra t e , r e l a t ive to t h e m e d ia n r a t e o f
0.324%. Figure 2 presents this data, using both a normal rep-
resentation and an area cartogram representation (Gastner
and Newman 2004). In this figure, the counties are colored
according to the level of over- or underrepresentation, with
blue colors representing underrepresentation and red colors
representing overrepresentation, relative to the median rate
of 0.324%. Thus, the same number of counties will be col-
ored red as blue.
These two maps lead to a number of interesting conclu-
sions: First, as evident in the normal representation, much of
the mid-west is significantly underrepresented in the Twit-
ter user base in this time period. Second, as evident in the
significantly red hue of the area cartogram, more populous
counties are consistently oversampled. However, the level of
oversampling does not appear to be dependent upon geogra-
phy: Both east coast and west coast cities are clearly visible
(e.g., San Francisco and Boston), as well as mid-west and
southern cities (e.g, Dallas, Chicago, and Atlanta).
Gender
Detecting gender using first names
As we have very limited information available on each user,
we rely on using the self-reported name available in each
user’s profile in order to detect gender. To do so, we first ob-
tain the most popular 1,000 male and female names for ba-
bies born in the U.S. for each year 1900–2009, as reported
by the U.S. Social Security Administration (Social Secu-
rity Administration 2010). We then aggregate the names to-
gether, calculating the total frequency of each of the result-
ing 3,034 male and 3,643 female names. As certain names
occurred in both lists, we remove the 241 names that were
(a) Normal representation (b) Area cartogram representation
Figure 2: Per-county over- and underrepresentation of U.S. population in Twitter, relative to the median per-county represen-
tation rate of 0.324%, presented in both (a) a normal layout and (b) an area cartogram based on the 2000 Census population.
Blue colors indicate underrepresentation, while red colorsrepresentoverrepresentation.Theintensityofthecolorcorresponds
to the log of the over- or underrepresentation rate. Clear trends are visible, such as the underrepresentation of mid-west and
overrepresentation of populous counties.
less than 95% predictive (e.g., the name Avery was observed
to correspond to male babies only 56.8% of the time; it was
therefore removed). The result is a list of 5,836 names that
we use to infer gender.
Limitations Clearly, this approach to detecting gender is
subject to a number of potential limitations. First, users may
misrepresent their name, leading to an incorrect gender in-
ference. Second, there may be differences in choosing to re-
veal one’s name between genders, leading us to believe that
fewer users of one gender are present. Third, the name lists
above may cover different fractions of the male and female
populations.
Gender of Twitter users
We fir s t d et e rm in e th e n um ber of th e 3 , 27 9, 425 U. S. -b as e d
users who we could infer a gender for, based on their name
and the list previously described. We do so by comparing
the first word of their self-reported name to the gender list.
We o b s e rv e th a t the r e e x is t s a mat c h f o r 64 .2% of t h e u se rs .
Moreover, we find a strong bias towards male users: Fully
71.8% of the the users who we find a name match for had a
male name.
0
0.2
0.4
0.6
0.8
1
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07
Fraction of Joining Users
who are Male
Date
Figure 3: Gender of joining users over time, binned into
groups of 10,000 joining users (note that the join rate in-
creases substantially). The bias towards male users is ob-
served to be decreasing over time.
To fur t h e r e x p l or e t h is t r e nd , w e ex am i n e th e h is t o r ic g e n -
der bias. To do so, we use the join date of each user (avail-
able in the user’s profile). Figure 3 plots the average fraction
of joining users who are male over time. From this plot, it
is clear that while the male gender bias was significantly
stronger among the early Twitter adopters, the bias is be-
coming reduced over time.
Race/ethnicity
Detecting race/ethnicity using last names
Again, since we have very limited information available
on each Twitter user, we resort to inferring race/ethnicity
using self-reported last name. We examine the last name
of users, and correlate the last name with data from the
U.S. 2000 Census (U.S. Census 2000). In more detail, for
each last name with over 100 individuals in the U.S. dur-
ing the 2000 Census, the Census releases the distribution of
race/ethnicity for that last name. For example, the last name
“Myers” was observed to correspond to Caucasians 86% of
the time, African-Americans 9.7%, Asians 0.4%, and His-
panics 1.4%.
Race/ethnicity distribution of Twitter users
We fir s t d e t er mi ne d th e num b e r o f U.S . - b as ed u s e rs f or
whom we could infer the race/ethnicity by comparing the
last word of their self-reported name to the U.S. Census
last name list. We observed that we found a match for
71.8% of the users. We the determined the distribution of
race/ethnicity in each county by taking the race/ethnicity
distribution in the Census list, weighted by the frequency
of each name occurring in Twitter users in that county.1
Due to the large amount of ambiguity in the last name-to-
race/ethnicity list (in particular, the last name list is more
than 95% predictive for only 18.5% of the users), we are un-
able to directly compare the Twitter race/ethnicity distribu-
1This is effectively the census.model approach discussed in
prior work (Chang et al. 2010).
Undersampling
Oversampling
(a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic
Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and
Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity
are shown. Blue regions correspond to undersampling; red regions to oversampling.
tion directly to race/ethnicity distribution in the U.S. Census.
However, we are able to make relat iv e comparisons between
Twitter u s e r s in differen t g e o g ra p hi c r e g io n s , al l o w i n g us to
explore geographic trends in the race/ethnicity distribution.
Thus, we examine the per-county race/ethnicity distribution
of Twitter users.
In order to account for the uneven distribution of
race/ethnicity across the U.S., we examine the per-county
race/ethnicity distribution relative to the distribution from
the overall U.S. Census. For example, if we observed that
25% of Twitter users in a county were predicted to be His-
panic, and the 2000 U.S. counted 23% of people in that
county as being Hispanic, we would consider Twitter to be
oversampling the Hispanic users in that county. Figure 4
plots the per-county race/ethnicity distribution, relative to
the 2000 U.S. Census, per all counties in which we observed
more than 500 Twitter users with identifiable last names.
Anumberofgeographictrendsarevisible,suchastheun-
dersampling of Hispanic users in the southwest; the under-
samping of African-American users in the south and mid-
west; and the oversampling of Caucasian users in many ma-
jor cities.
Related work
Afewotherstudieshaveexaminedthedemographicsofso-
cial network users. For example, recent studies have exam-
ined the ethnicity of Facebook users (Chang et al. 2010),
general demographics of Facebook users (Corbett 2010),
and differences in online behavior on Facebook and MyS-
pace by gender (Strayhorn 2009). However, studies of gen-
eral social networking sites are able to leverage the broad
nature of the profiles available; in contrast, on Twitter, users
self-report only a minimal set of information, making calcu-
lating demographics significantly more difficult.
Conclusion
Twitter h a s r e c eived s i g n i c an t r e s e ar c h i n te r e st l a t e l y asa
means for understanding, monitoring, and even predicting
real-world phenomena. However, most existing work does
not address the sampling bias, simply applying machine
learning and data mining algorithms without an understand-
ing of the Twitter user population. In this paper, we took
afirstlookattheuserpopulationthemselves,andexam-
ined the population along the axes of geography, gender, and
race/ethnicity. Overall, we found that Twitter users signif-
icantly overrepresent the densely population regions of the
U.S., are predominantly male, and represent a highly non-
random sample of the overall race/ethnicity distribution.
Going forward, our study sets the foundation for future
work upon Twitter data. Existing approaches could imme-
diately use our analysis to improve predictions or measure-
ments. By enabling post-hoc corrections, our work is a first
step towards turning Twitter into a tool that can make infer-
ences about the population as a whole. More nuanced anal-
yses on the biases in the Twitter population will enhance
the ability for Twitter to be used as a sophisticated inference
tool.
Acknowledgements
We th a n k F a br ic io B e nev e nto and M e e y ou ng Ch a for t h e i r
assistance in gathering the Twitter data used in this study.
We als o t h an k Ji m Ba gr ow f o r va l u ab le d is c ussio n s a n d his
collection of geographic data from Google Maps. This re-
search was supported in part by NSF grant IIS-0964465 and
an Amazon Web Services in Education Grant.
References
Asur, S., and Huberman, B. 2010. Predicing the future with social
media. http://arxiv.org/abs/1003.5699.
Bollen, J.; Mao, H.; and Zeng, X.-J. 2010. Twitter mood predicts
the stock market. In ICWSM.
Cha, M.; Haddadi, H.; Benevenuto, F.; and Gummadi, K. 2010.
Measuring user influence in twitter: The million follower fallacy.
In ICWSM.
Chang, J.; Rosenn, I.; Backstrom, L.; and Marlow, C. 2010.
epluribus: Ethnicity on social networks. In ICWSM.
Corbett, P. 2010. Facebook demographics and statistics re-
port 2010. http://www.istrategylabs.com/2010/01/
facebook-demographics-and-statistics- report-\
2010-145-growth-in- 1-year.
Gastner, M. T., and Newman, M. E. J. 2004. Diffusion-based
method for producing density-equalizing maps. PNAS 101.
O’Connor, B.; Balasubramanyan, R.; Routledge, B.; and Smith, N.
2010. From tweets to polls: Linking text sentiment to public opin-
ion time series. In ICWSM.
Social Security Administration. 2010. Most popular baby names.
http://www.ssa.gov/oact/babynames.
Strayhorn, T. 2009. Sex differences in use of facebook and mys-
pace among first-year college students. Stud. Affairs 10(2).
U.S. Census. 2000. Genealogy data: Frequently occur-
ring surnames from census. http://www.census.gov/
genealogy/www/data/2000surnames.
... Though the lack of representativeness of Twitter data is widely acknowledged in studies using this data, there are few studies that have systematically studied bias related to this data. One important study in this regard is [29] who study differences in the distribution between offline census data and Twitter users for gender, geography and race. They observed a male-dominated Twitter population, concentrated in urban areas, with geographic patterns varying as to which race is over-or underrepresented. ...
... Gender is usually considered one of the easier demographic dimensions to infer. The most basic approach is to use a gender-based dictionary, often based on census data [23,29] and there are web services that can be used for this purpose 1 . Tweet content has also been used, in particular for non-English languages where the form of adjectives can often reveal the gender of the speaker [13]. ...
... There are many other potential variables that could be inferred, but which we did not use for this study. These include political orientation [13,35,36], religious affiliation [32,11], or ethnicity and race [29,35,36]. ...
Preprint
Given the ever increasing amount of publicly available social media data, there is growing interest in using online data to study and quantify phenomena in the offline "real" world. As social media data can be obtained in near real-time and at low cost, it is often used for "now-casting" indices such as levels of flu activity or unemployment. The term "social sensing" is often used in this context to describe the idea that users act as "sensors", publicly reporting their health status or job losses. Sensor activity during a time period is then typically aggregated in a "one tweet, one vote" fashion by simply counting. At the same time, researchers readily admit that social media users are not a perfect representation of the actual population. Additionally, users differ in the amount of details of their personal lives that they reveal. Intuitively, it should be possible to improve now-casting by assigning different weights to different user groups. In this paper, we ask "How does social sensing actually work?" or, more precisely, "Whom should we sense--and whom not--for optimal results?". We investigate how different sampling strategies affect the performance of now-casting of two common offline indices: flu activity and unemployment rate. We show that now-casting can be improved by 1) applying user filtering techniques and 2) selecting users with complete profiles. We also find that, using the right type of user groups, now-casting performance does not degrade, even when drastically reducing the size of the dataset. More fundamentally, we describe which type of users contribute most to the accuracy by asking if "babblers are better". We conclude the paper by providing guidance on how to select better user groups for more accurate now-casting.
... Wake and Vredenburg [4] visualized global amphibian species diversity using the method. Other applications include the visualization of the democracies and autocracies of different countries [5], the race/ethnicity distribution of Twitter users in the United States [6], the rate of obesity for individuals in Canada [7], and the world citation network [8]. ...
... The connectivity of r 0 is kept unchanged; 4 Apply the reflection map g(z) = 1 z to the triangulated unit disk D T ; 5 Glue D T and g(D T ). Update r 0 by the glued result; 6 Remove all vertices and faces of r 0 outside {z : |z| > 5}; 7 Rescale r 0 to restore the size of the flattening map; ...
Preprint
In this paper, we are concerned with the problem of creating flattening maps of simply-connected open surfaces in R3\mathbb{R}^3. Using a natural principle of density diffusion in physics, we propose an effective algorithm for computing density-equalizing flattening maps with any prescribed density distribution. By varying the initial density distribution, a large variety of mappings with different properties can be achieved. For instance, area-preserving parameterizations of simply-connected open surfaces can be easily computed. Experimental results are presented to demonstrate the effectiveness of our proposed method. Applications to data visualization and surface remeshing are explored.
... One of the first efforts to extract and analyze demographic information presents a comparative study between the demographic distribution of gender/race of Twitter users and U.S. population [23]. After that, several efforts have arisen that investigate demographic information, in various social media, using different strategies for distinct purposes [4,5,19,29,18]. ...
... The field of demographic status is not mandatory when a user registers in Twitter and, thus, the direct retrieval of gender, race, or even age is not feasible. There are several studies related to demographic information in Twitter that attempt to infer the user's gender from the user name [4,19,21,23]. Also, some works use pattern based methodology to identify age [27] in Twitter profile description using regular expressions '25 yr old' or 'born in 1990'. ...
Preprint
The massive popularity of online social media provides a unique opportunity for researchers to study the linguistic characteristics and patterns of user's interactions. In this paper, we provide an in-depth characterization of language usage across demographic groups in Twitter. In particular, we extract the gender and race of Twitter users located in the U.S. using advanced image processing algorithms from Face++. Then, we investigate how demographic groups (i.e. male/female, Asian/Black/White) differ in terms of linguistic styles and also their interests. We extract linguistic features from 6 categories (affective attributes, cognitive attributes, lexical density and awareness, temporal references, social and personal concerns, and interpersonal focus), in order to identify the similarities and differences in particular writing set of attributes. In addition, we extract the absolute ranking difference of top phrases between demographic groups. As a dimension of diversity, we also use the topics of interest that we retrieve from each user. Our analysis unveils clear differences in the writing styles (and the topics of interest) of different demographic groups, with variation seen across both gender and race lines. We hope our effort can stimulate the development of new studies related to demographic information in the online space.
... Having Twitter as a new kind of data source, researchers have looked into the development of tools for real-time trend analytics [32], [56] or early detection of newsworthy events [51], as well as into analytical approaches for understanding the sentiment expressed by users towards a target [24], [26], [52], or public opinion on a specific topic [5]. However, Twitter data lacks reliable demographic details that would enable a representative sample of users to be collected and/or a focus on a specific user subgroup [36], or other specific applications such as helping establish the trustworthiness of information posted [34]. Automated inference of social media demographics would be useful, among others, to broaden demographically aware social media analyses that are conducted through surveys [16]. ...
... A growing body of research deals with the automated inference of demographic details of Twitter users [36]. Re-1. ...
Preprint
In contrast to much previous work that has focused on location classification of tweets restricted to a specific country, here we undertake the task in a broader context by classifying global tweets at the country level, which is so far unexplored in a real-time scenario. We analyse the extent to which a tweet's country of origin can be determined by making use of eight tweet-inherent features for classification. Furthermore, we use two datasets, collected a year apart from each other, to analyse the extent to which a model trained from historical tweets can still be leveraged for classification of new tweets. With classification experiments on all 217 countries in our datasets, as well as on the top 25 countries, we offer some insights into the best use of tweet-inherent features for an accurate country-level classification of tweets. We find that the use of a single feature, such as the use of tweet content alone -- the most widely used feature in previous work -- leaves much to be desired. Choosing an appropriate combination of both tweet content and metadata can actually lead to substantial improvements of between 20\% and 50\%. We observe that tweet content, the user's self-reported location and the user's real name, all of which are inherent in a tweet and available in a real-time scenario, are particularly useful to determine the country of origin. We also experiment on the applicability of a model trained on historical tweets to classify new tweets, finding that the choice of a particular combination of features whose utility does not fade over time can actually lead to comparable performance, avoiding the need to retrain. However, the difficulty of achieving accurate classification increases slightly for countries with multiple commonalities, especially for English and Spanish speaking countries.
... As a whole, mobile clients for microblogging platforms, social networking tools, and other "proxy" data of human activity collected in the web allow for the quantitative analysis of social systems at a scale that would have been unimaginable just a few years ago [3][4][5][6]. In particular, the possibility of using mobileenabled microblogging platforms, such as Twitter, as monitors of public opinion, social movements and as tools for the mapping of social communities has generated much interest in the literature [7][8][9][10][11][12][13][14]. At the same time it is crucial to understand to which extent the picture of socio-technical systems emerging from digital data proxies is a statistically sound and how well it does scale to a planetary dimension [15]. ...
... Our analysis is restricted to GPStagged tweets in order to preserve maximum level of geographical detail, taking into account both live GPS updates and device stored locations. The amount of geolocalized signal could in fact be increased by considering different kinds of metadata, like for example self reported locations [13], but these procedures would not allow us to reach the level of granularity and detail we aim to. Further details about the data collection and analysis procedures, as well as on the (live) GPS metadata, can be found in the Methods section. ...
Preprint
Large scale analysis and statistics of socio-technical systems that just a few short years ago would have required the use of consistent economic and human resources can nowadays be conveniently performed by mining the enormous amount of digital data produced by human activities. Although a characterization of several aspects of our societies is emerging from the data revolution, a number of questions concerning the reliability and the biases inherent to the big data "proxies" of social life are still open. Here, we survey worldwide linguistic indicators and trends through the analysis of a large-scale dataset of microblogging posts. We show that available data allow for the study of language geography at scales ranging from country-level aggregation to specific city neighborhoods. The high resolution and coverage of the data allows us to investigate different indicators such as the linguistic homogeneity of different countries, the touristic seasonal patterns within countries and the geographical distribution of different languages in multilingual regions. This work highlights the potential of geolocalized studies of open data sources to improve current analysis and develop indicators for major social phenomena in specific communities.
... Semertzidis et al. [1] analyzed the profile information in order to understand what Twitter users choose to expose about themselves in their profile information. Mislove et al. [2] used the profile information in order to compare the Twitter population to the U.S. population along three axes (geography, gender, and race/ethnicity). 1 https://twitter.com/ Alowibdi et al. [3] and Vicente et al. [4] used the profile information to conduct a gender classification on Twitter. ...
Preprint
We can see profile information such as name, description and location in order to know the user on social media. However, this profile information is not always fixed. If there is a change in the user's life, the profile information will be changed. In this study, we focus on user's profile information changes and analyze the timing and reasons for these changes on Twitter. The results indicate that the peak of profile information change occurs in April among Japanese users, but there was no such trend observed for English users throughout the year. Our analysis also shows that English users most frequently change their names on their birthdays, while Japanese users change their names as their Twitter engagement and activities decrease over time.
... Table 1 shows several user-level properties, including: pre-treatment survey responses, Twitter personas (average URL alignment, connection diversity, verified status, number of followers/followees), and self-reported demographics (gender, age, and political ideology). Respondents appear to heavily skew male, liberal, and between the ages of 25-44 -likely influenced by the fact that large portions of Americans are under-represented on Twitter [19]. To test for the robustness of our randomization, we use a logistic regression to check whether there are any significant differences in the distribution of these properties across treatments. ...
Preprint
Full-text available
Homophily -- our tendency to surround ourselves with others who share our perspectives and opinions about the world -- is both a part of human nature and an organizing principle underpinning many of our digital social networks. However, when it comes to politics or culture, homophily can amplify tribal mindsets and produce "echo chambers" that degrade the quality, safety, and diversity of discourse online. While several studies have empirically proven this point, few have explored how making users aware of the extent and nature of their political echo chambers influences their subsequent beliefs and actions. In this paper, we introduce Social Mirror, a social network visualization tool that enables a sample of Twitter users to explore the politically-active parts of their social network. We use Social Mirror to recruit Twitter users with a prior history of political discourse to a randomized experiment where we evaluate the effects of different treatments on participants' i) beliefs about their network connections, ii) the political diversity of who they choose to follow, and iii) the political alignment of the URLs they choose to share. While we see no effects on average political alignment of shared URLs, we find that recommending accounts of the opposite political ideology to follow reduces participants' beliefs in the political homogeneity of their network connections but still enhances their connection diversity one week after treatment. Conversely, participants who enhance their belief in the political homogeneity of their Twitter connections have less diverse network connections 2-3 weeks after treatment. We explore the implications of these disconnects between beliefs and actions on future efforts to promote healthier exchanges in our digital public spheres.
... Characterizing the demographics of Twitter users has been studied by (Mislove et al. 2011) who infer geography, gender, and race of the users based on self-reported locations and the names of the users. They find large deviances from the demographic distribution of the overall population. ...
Preprint
Understanding the demographics of app users is crucial, for example, for app developers, who wish to target their advertisements more effectively. Our work addresses this need by studying the predictability of user demographics based on the list of a user's apps which is readily available to many app developers. We extend previous work on the problem on three frontiers: (1) We predict new demographics (age, race, and income) and analyze the most informative apps for four demographic attributes included in our analysis. The most predictable attribute is gender (82.3 % accuracy), whereas the hardest to predict is income (60.3 % accuracy). (2) We compare several dimensionality reduction methods for high-dimensional app data, finding out that an unsupervised method yields superior results compared to aggregating the apps at the app category level, but the best results are obtained simply by the raw list of apps. (3) We look into the effect of the training set size and the number of apps on the predictability and show that both of these factors have a large impact on the prediction accuracy. The predictability increases, or in other words, a user's privacy decreases, the more apps the user has used, but somewhat surprisingly, after 100 apps, the prediction accuracy starts to decrease.
... We manually crafted a set of 52 features belonging to three different classes: user metadata, timing features, and network statistics, as detailed below. User metadata and activity features: User metadata have been proved pivotal to model classes of users in social media [12], [13]. We build user-based features leveraging the metadata provided by the Twitter API related to the author of each tweet, as well as the source of each retweet. ...
Preprint
We present a machine learning framework that leverages a mixture of metadata, network, and temporal features to detect extremist users, and predict content adopters and interaction reciprocity in social media. We exploit a unique dataset containing millions of tweets generated by more than 25 thousand users who have been manually identified, reported, and suspended by Twitter due to their involvement with extremist campaigns. We also leverage millions of tweets generated by a random sample of 25 thousand regular users who were exposed to, or consumed, extremist content. We carry out three forecasting tasks, (i) to detect extremist users, (ii) to estimate whether regular users will adopt extremist content, and finally (iii) to predict whether users will reciprocate contacts initiated by extremists. All forecasting tasks are set up in two scenarios: a post hoc (time independent) prediction task on aggregated data, and a simulated real-time prediction task. The performance of our framework is extremely promising, yielding in the different forecasting scenarios up to 93% AUC for extremist user detection, up to 80% AUC for content adoption prediction, and finally up to 72% AUC for interaction reciprocity forecasting. We conclude by providing a thorough feature analysis that helps determine which are the emerging signals that provide predictive power in different scenarios.
Preprint
Full-text available
In the face of rapid population growth, urbanisation, and accelerating climate change, the need for rapid and accurate disaster detection has become critical to minimising human and material losses. In this context, geo-social media data has proven to be a sensible data source for tracing disaster-related conversations, especially during flood events. However, current research often neglects the relationship between information from social media posts and their corresponding geographical 15 context. In this paper, we examine the emergence of disaster-related social media topics in relation with hydrological and socio-environmental features on watershed level during the 2021 Western European flood, while focusing on transboundary river basins. Building upon an advanced machine learning-based topic modelling approach, we show the emergence of flood-related geo-social media topics both in river-basin specific and cross-basin contexts. Our analysis reveals distinct spatio-temporal dynamics in the public discourse, showing that timely topics describing heavy rains or flood damages were closely 20 tied to immediate environmental conditions in upstream areas, while post-disaster topics about helping victims or volunteering were more prevalent in less affected areas located in both upstream and downstream areas. These findings highlight how social media responses to disasters differ spatially across watersheds and underscore the importance of integrating geo-social media analysis into disaster coordination efforts, opening new opportunities for transboundary collaborations and the coordination of emergency response along border-crossing rivers.
Conference Paper
Full-text available
Directed links in social media could represent anything from intimate friendships to common interests, or even a passion for breaking news or celebrity gossip. Such directed links determine the flow of information and hence indicate a user's influence on others—a concept that is crucial in sociology and viral marketing. In this paper, using a large amount of data collected from Twit- ter, we present an in-depth comparison of three mea- sures of influence: indegree, retweets, and mentions. Based on these measures, we investigate the dynam- ics of user influence across topics and time. We make several interesting observations. First, popular users who have high indegree are not necessarily influential in terms of spawning retweets or mentions. Second, most influential users can hold significant influence over a variety of topics. Third, influence is not gained spon- taneously or accidentally, but through concerted effort such as limiting tweets to a single topic. We believe that these findings provide new insights for viral marketing and suggest that topological measures such as indegree alone reveals very little about the influence of a user.
Conference Paper
Full-text available
We propose an approach to determine the ethnic break- down of a population based solely on people's names and data provided by the U.S. Census Bureau. We demon- strate that our approach is able to predict the ethnicities of individuals as well as the ethnicity of an entire pop- ulation better than natural alternatives. We apply our technique to the population of U.S. Facebook users and uncover the demographic characteristics of ethnicities and how they relate. We also discover that while Face- book has always been diverse, diversity has increased over time leading to a population that today looks very similar to the overall U.S. population. We also find that different ethnic groups relate to one another in an as- sortative manner, and that these groups have different profiles across demographics, beliefs, and usage of site features.
Article
Full-text available
In recent years, social media has become ubiquitous and important for social networking and content sharing. And yet, the content that is generated from these websites remains largely untapped. In this paper, we demonstrate how social media content can be used to predict real-world outcomes. In particular, we use the chatter from Twitter.com to forecast box-office revenues for movies. We show that a simple model built from the rate at which tweets are created about particular topics can outperform market-based predictors. We further demonstrate how sentiments extracted from Twitter can be further utilized to improve the forecasting power of social media.
Article
Full-text available
Map makers have for many years searched for a way to construct cartograms, maps in which the sizes of geographic regions such as countries or provinces appear in proportion to their population or some other analogous property. Such maps are invaluable for the representation of census results, election returns, disease incidence, and many other kinds of human data. Unfortunately, to scale regions and still have them fit together, one is normally forced to distort the regions' shapes, potentially resulting in maps that are difficult to read. Many methods for making cartograms have been proposed, some of them are extremely complex, but all suffer either from this lack of readability or from other pathologies, like overlapping regions or strong dependence on the choice of coordinate axes. Here, we present a technique based on ideas borrowed from elementary physics that suffers none of these drawbacks. Our method is conceptually simple and produces useful, elegant, and easily readable maps. We illustrate the method with applications to the results of the 2000 U.S. presidential election, lung cancer cases in the State of New York, and the geographical distribution of stories appearing in the news.
Article
Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e., can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public's response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%.
Most popular baby names
  • Social Security Administration
Social Security Administration. 2010. Most popular baby names. http://www.ssa.gov/oact/babynames.
Genealogy data: Frequently occurring surnames from census
  • U S Census
U.S. Census. 2000. Genealogy data: Frequently occurring surnames from census. http://www.census.gov/ genealogy/www/data/2000surnames.
Facebook demographics and statistics report 2010
  • P Corbett
Corbett, P. 2010. Facebook demographics and statistics report 2010. http://www.istrategylabs.com/2010/01/ facebook-demographics-and-statistics-report-\ 2010-145-growth-in-1-year.