Conference PaperPDF Available

Understanding the Demographics of Twitter Users

Authors:

Abstract and Figures

Every second, the thoughts and feelings of millions of people across the world are recorded in the form of 140-character tweets using Twitter. However, despite the enormous potential presented by this remarkable data source, we still do not have an understanding of the Twitter population itself: Who are the Twitter users? How representative of the overall population are they? In this paper, we take the first steps towards answering these questions by analyzing data on a set of Twitter users representing over 1 % of the U.S. population. We develop techniques that allow us to compare the Twitter population to the U.S. population along three axes (geography, gender, and race/ethnicity), and find that the Twitter population is a highly non-uniform sample of the population.
Content may be subject to copyright.
Understanding the Demographics of Twitter Users
Alan MisloveSune LehmannYon g- Yeo l A h nJukka-Pekka OnnelaJ. Niels Rosenquist
Northeastern University Tec h n i c a l U nive r s i ty of D e n m ark Harvard Medical School
Abstract
Every second, the thoughts and feelings of millions of people
across the world are recorded in the form of 140-character
tweets using Twitter. However, despite the enormous poten-
tial presented by this remarkable data source, we still do not
have an understanding of the Twitter population itself: Who
are the Twitter users? How representative of the overall pop-
ulation are they? In this paper, we take the first steps towards
answering these questions by analyzing data on a set of Twit-
ter users representing over 1% of the U.S. population. We
develop techniques that allow us to compare the Twitter pop-
ulation to the U.S. population along three axes (geography,
gender, and race/ethnicity), and find that the Twitter popula-
tion is a highly non-uniform sample of the population.
Introduction
Online social networks are now a popular way for users to
connect, communicate, and share content; many serve as the
de-facto Internet portal for millions of users. Because of the
massive popularity of these sites, data about the users and
their communication offers unprecedented opportunities to
examine how human society functions at scale. However,
concerns over user privacy often force service providers to
keep such data private. Twitter represents an exception: Over
91% of Twitter users choose to make their profile and com-
munication history publicly visible, allowing researchers
access to the vast majority of the site. Twitter, therefore,
presents a unique opportunity to examine the public com-
munication of a large fraction of the population.
In fact, researchers have recently begun to use the con-
tent of Twitter messages to measure and predict real-world
phenomena, including movie box office returns (Asur and
Huberman 2010), elections (O’Connor et al. 2010), and the
stock market (Bollen, Mao, and Zeng 2010). While these
studies show remarkable promise, one heretofore unan-
swered question is: Are Twitter users a representative sam-
ple of society? If not, which demographics are over- or un-
derrepresented in the Twitter population? Because existing
studies generally treat Twitter as a “black box,” shedding
light on the characteristics of the Twitter population is likely
to lead to improvements in existing prediction and measure-
ment methods. Moreover, understanding the characteristics
Copyright c
!2011, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
of the Twitter population is crucial to move towards more
advanced observations and predictions, since such an un-
derstanding will help us determine what predictions can be
made and what other data is necessary to correct for any bi-
ases.
In this paper, we take a first look at the demographics of
the Twitter users, aiming to answer these questions. To do
so, we use a data set of over 1,755,925,520 Twitter messages
sent by 54,981,152 users between March 2006 and August
2009 (Cha et al. 2010). We focus on users whose identified
location is within the United States, because the plurality of
users at the time of the data collection are in U.S., and be-
cause we have the detailed demographic data for U.S. popu-
lation. Even with the location constraint, our dataset covers
over three million users, representing more than 1% of the
entire U.S. population.
Ideally, when comparing the Twitter population to society
as a whole, we would like to compare properties including
socio-economic status, education level, and type of employ-
ment. However, we are restricted to only using the data that
is (optionally) self-reported and made visible by the Twitter
users, including their name, location, and the text of their
tweets. We develop techniques to examine the properties
of the Twitter population along three separate but interre-
lated axes, based on the feasibility of comparison. First, we
compare the geographic distribution of users to the popu-
lation as a whole using U.S. Census data. We demonstrate
that Twitter users are more likely to live within populous
counties than would be expected from the Census data, and
that sparsely populated regions of the U.S. are significantly
underrepresented. Second, we infer the gender of Twitter
users and demonstrate that a significant male bias exists,
although the bias is becoming less pronounced over time.
Third, we examine the race/ethnicity of Twitter users and
demonstrate that the distribution of race/ethnicity is highly
geographically-dependent.
Geographic distribution
Detection location using self-reported data
To d e t e r m i ne geog r a p h ic i n f o rm a ti o n a b ou t users, w e use
the self-reported location field in the user profile. The loca-
tion is an optional self-reported string; we found that 75.3%
of the publicly visible users listed a location. In order to turn
the user-provided string into a mappable location, we use
the Google Maps API. Beginning with the most popular lo-
cation strings (i.e, the strings provided by the most users),
we query Google Maps with each location string. If Google
Maps is able to interpret a string as a location, we receive a
latitude and longitude as a response. We restrict our scope to
users in the U.S. by only considering response latitudes and
longitudes that are within the U.S.. In total, we find map-
pings to a U.S. longitude and latitude for 246,015 unique
strings, covering 3,279,425 users (representing 8.8% of the
users who list a location).
To com p a r e o u r Tw i tt e r d at a t o t h e 2 00 0 U . S. C e n su s , i t
is necessary to aggregate the users into U.S. counties. Using
data from the U.S. National Atlas and the U.S. Geological
Survey, we map each of the 246,015 latitudes and longitudes
into their respective U.S. county. Unless otherwise stated,
our analysis for the remainder of this paper is at the U.S.
county level.
Limitations We now b rie y d i sc u ss po te nt ia l li m it at io ns
of our location inference methodology. First, it is worth not-
ing that Google Maps will also interpret locations that are
at a granularity coarser than a U.S. county (e.g., “Texas”).
We m a n ua ll y re mo v ed t h es e , incl u d in g th e map p i n gs o f all
50 states, as well as “United States” and “Earth.” Second,
users may lie about their location, or may list an out-of-date
location. Third, since the location is per-user (rather than
per-tweet), a user who moves from one city to another (and
updates his location) will have all of his tweets considered
as being from the latter location.
Geographic distribution of Twitter users
We b e g i n b y ex am in in g th e ge ogra p h ic d i s tr ib u t io n of Twit -
ter users, and comparing it to the entire U.S. population.
Overall, the 3,279,425 Twitter users who we are able to geo-
locate represent 1.15% of the entire population (at the time
of the 2000 Census). However, if we examine the distribu-
tion of Twitter users per county, we observe a highly non-
uniform distribution.
Figure 1 presents this analysis, with the county popula-
tion along the xaxis and the fraction of this population we
observe in Twitter along the yaxis. We see that, as the popu-
lation of the county increases, the Tw i t t e r repre s e n t a t i o n rate
(simply the number of Twitter users in that county divided
by the number of people in that county in the 2000 U.S.
Census) increases as well. For example, consider the median
per-county Twitter representation rate of 0.324%. We ob-
serve that 93.5% of the counties with over 100,000 residents
have a higher Twitter representation rate than the median,
compared to only 40.8% of the counties with fewer than
100,000 residents (were Twitter users a truly random pop-
ulation sample, we would expect these percentages to both
be 50%). Thus, the Twitter users significantly overrepresent
populous counties, a fact underscored by the difference be-
tween the median (0.324%) per-county Twitter representa-
tion rates and the overall population sample of 1.15%.
The overrepresentation of populous counties in and of it-
self may not come as a surprise, due to the patterns of so-
cial media adoption across different regions. However, the
0.01%
0.10%
1.00%
10.00%
103104105106107
Twitter Representation Rate
County Population
Figure 1: Scatterplot of US county population versus Twitter
representation rate in that county. The dark line represents
the aggregated median, and the dashed black line represents
the overall median (0.324%). There is a clear overrepresen-
tation of more populous counties.
magnitude of the difference is striking: We observe an or-
der of magnitude difference in median per-county Twitter
representation rate between counties with 1,000 people and
counties with 1,000,000 people. This indicates a bias in the
Twitter p o pu l a ti o n ( re l a t ive t o th e U. S . po p ul a t i o n) a n d sug-
gests that entire regions of the U.S. may be significantly un-
derrepresented.
Distribution across counties We n o w e x am in e which re-
gions of the U.S. contain these over- and underrepresented
counties. To do so, we plot a map of the U.S. based on the
Twitter re p r e s e n ta t i o n ra t e , r e l a t ive to t h e m e d ia n r a t e o f
0.324%. Figure 2 presents this data, using both a normal rep-
resentation and an area cartogram representation (Gastner
and Newman 2004). In this figure, the counties are colored
according to the level of over- or underrepresentation, with
blue colors representing underrepresentation and red colors
representing overrepresentation, relative to the median rate
of 0.324%. Thus, the same number of counties will be col-
ored red as blue.
These two maps lead to a number of interesting conclu-
sions: First, as evident in the normal representation, much of
the mid-west is significantly underrepresented in the Twit-
ter user base in this time period. Second, as evident in the
significantly red hue of the area cartogram, more populous
counties are consistently oversampled. However, the level of
oversampling does not appear to be dependent upon geogra-
phy: Both east coast and west coast cities are clearly visible
(e.g., San Francisco and Boston), as well as mid-west and
southern cities (e.g, Dallas, Chicago, and Atlanta).
Gender
Detecting gender using first names
As we have very limited information available on each user,
we rely on using the self-reported name available in each
user’s profile in order to detect gender. To do so, we first ob-
tain the most popular 1,000 male and female names for ba-
bies born in the U.S. for each year 1900–2009, as reported
by the U.S. Social Security Administration (Social Secu-
rity Administration 2010). We then aggregate the names to-
gether, calculating the total frequency of each of the result-
ing 3,034 male and 3,643 female names. As certain names
occurred in both lists, we remove the 241 names that were
(a) Normal representation (b) Area cartogram representation
Figure 2: Per-county over- and underrepresentation of U.S. population in Twitter, relative to the median per-county represen-
tation rate of 0.324%, presented in both (a) a normal layout and (b) an area cartogram based on the 2000 Census population.
Blue colors indicate underrepresentation, while red colorsrepresentoverrepresentation.Theintensityofthecolorcorresponds
to the log of the over- or underrepresentation rate. Clear trends are visible, such as the underrepresentation of mid-west and
overrepresentation of populous counties.
less than 95% predictive (e.g., the name Avery was observed
to correspond to male babies only 56.8% of the time; it was
therefore removed). The result is a list of 5,836 names that
we use to infer gender.
Limitations Clearly, this approach to detecting gender is
subject to a number of potential limitations. First, users may
misrepresent their name, leading to an incorrect gender in-
ference. Second, there may be differences in choosing to re-
veal one’s name between genders, leading us to believe that
fewer users of one gender are present. Third, the name lists
above may cover different fractions of the male and female
populations.
Gender of Twitter users
We fir s t d et e rm in e th e n um ber of th e 3 , 27 9, 425 U. S. -b as e d
users who we could infer a gender for, based on their name
and the list previously described. We do so by comparing
the first word of their self-reported name to the gender list.
We o b s e rv e th a t the r e e x is t s a mat c h f o r 64 .2% of t h e u se rs .
Moreover, we find a strong bias towards male users: Fully
71.8% of the the users who we find a name match for had a
male name.
0
0.2
0.4
0.6
0.8
1
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07
Fraction of Joining Users
who are Male
Date
Figure 3: Gender of joining users over time, binned into
groups of 10,000 joining users (note that the join rate in-
creases substantially). The bias towards male users is ob-
served to be decreasing over time.
To fur t h e r e x p l or e t h is t r e nd , w e ex am i n e th e h is t o r ic g e n -
der bias. To do so, we use the join date of each user (avail-
able in the user’s profile). Figure 3 plots the average fraction
of joining users who are male over time. From this plot, it
is clear that while the male gender bias was significantly
stronger among the early Twitter adopters, the bias is be-
coming reduced over time.
Race/ethnicity
Detecting race/ethnicity using last names
Again, since we have very limited information available
on each Twitter user, we resort to inferring race/ethnicity
using self-reported last name. We examine the last name
of users, and correlate the last name with data from the
U.S. 2000 Census (U.S. Census 2000). In more detail, for
each last name with over 100 individuals in the U.S. dur-
ing the 2000 Census, the Census releases the distribution of
race/ethnicity for that last name. For example, the last name
“Myers” was observed to correspond to Caucasians 86% of
the time, African-Americans 9.7%, Asians 0.4%, and His-
panics 1.4%.
Race/ethnicity distribution of Twitter users
We fir s t d e t er mi ne d th e num b e r o f U.S . - b as ed u s e rs f or
whom we could infer the race/ethnicity by comparing the
last word of their self-reported name to the U.S. Census
last name list. We observed that we found a match for
71.8% of the users. We the determined the distribution of
race/ethnicity in each county by taking the race/ethnicity
distribution in the Census list, weighted by the frequency
of each name occurring in Twitter users in that county.1
Due to the large amount of ambiguity in the last name-to-
race/ethnicity list (in particular, the last name list is more
than 95% predictive for only 18.5% of the users), we are un-
able to directly compare the Twitter race/ethnicity distribu-
1This is effectively the census.model approach discussed in
prior work (Chang et al. 2010).
Undersampling
Oversampling
(a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic
Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and
Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity
are shown. Blue regions correspond to undersampling; red regions to oversampling.
tion directly to race/ethnicity distribution in the U.S. Census.
However, we are able to make relat iv e comparisons between
Twitter u s e r s in differen t g e o g ra p hi c r e g io n s , al l o w i n g us to
explore geographic trends in the race/ethnicity distribution.
Thus, we examine the per-county race/ethnicity distribution
of Twitter users.
In order to account for the uneven distribution of
race/ethnicity across the U.S., we examine the per-county
race/ethnicity distribution relative to the distribution from
the overall U.S. Census. For example, if we observed that
25% of Twitter users in a county were predicted to be His-
panic, and the 2000 U.S. counted 23% of people in that
county as being Hispanic, we would consider Twitter to be
oversampling the Hispanic users in that county. Figure 4
plots the per-county race/ethnicity distribution, relative to
the 2000 U.S. Census, per all counties in which we observed
more than 500 Twitter users with identifiable last names.
Anumberofgeographictrendsarevisible,suchastheun-
dersampling of Hispanic users in the southwest; the under-
samping of African-American users in the south and mid-
west; and the oversampling of Caucasian users in many ma-
jor cities.
Related work
Afewotherstudieshaveexaminedthedemographicsofso-
cial network users. For example, recent studies have exam-
ined the ethnicity of Facebook users (Chang et al. 2010),
general demographics of Facebook users (Corbett 2010),
and differences in online behavior on Facebook and MyS-
pace by gender (Strayhorn 2009). However, studies of gen-
eral social networking sites are able to leverage the broad
nature of the profiles available; in contrast, on Twitter, users
self-report only a minimal set of information, making calcu-
lating demographics significantly more difficult.
Conclusion
Twitter h a s r e c eived s i g n i c an t r e s e ar c h i n te r e st l a t e l y asa
means for understanding, monitoring, and even predicting
real-world phenomena. However, most existing work does
not address the sampling bias, simply applying machine
learning and data mining algorithms without an understand-
ing of the Twitter user population. In this paper, we took
afirstlookattheuserpopulationthemselves,andexam-
ined the population along the axes of geography, gender, and
race/ethnicity. Overall, we found that Twitter users signif-
icantly overrepresent the densely population regions of the
U.S., are predominantly male, and represent a highly non-
random sample of the overall race/ethnicity distribution.
Going forward, our study sets the foundation for future
work upon Twitter data. Existing approaches could imme-
diately use our analysis to improve predictions or measure-
ments. By enabling post-hoc corrections, our work is a first
step towards turning Twitter into a tool that can make infer-
ences about the population as a whole. More nuanced anal-
yses on the biases in the Twitter population will enhance
the ability for Twitter to be used as a sophisticated inference
tool.
Acknowledgements
We th a n k F a br ic io B e nev e nto and M e e y ou ng Ch a for t h e i r
assistance in gathering the Twitter data used in this study.
We als o t h an k Ji m Ba gr ow f o r va l u ab le d is c ussio n s a n d his
collection of geographic data from Google Maps. This re-
search was supported in part by NSF grant IIS-0964465 and
an Amazon Web Services in Education Grant.
References
Asur, S., and Huberman, B. 2010. Predicing the future with social
media. http://arxiv.org/abs/1003.5699.
Bollen, J.; Mao, H.; and Zeng, X.-J. 2010. Twitter mood predicts
the stock market. In ICWSM.
Cha, M.; Haddadi, H.; Benevenuto, F.; and Gummadi, K. 2010.
Measuring user influence in twitter: The million follower fallacy.
In ICWSM.
Chang, J.; Rosenn, I.; Backstrom, L.; and Marlow, C. 2010.
epluribus: Ethnicity on social networks. In ICWSM.
Corbett, P. 2010. Facebook demographics and statistics re-
port 2010. http://www.istrategylabs.com/2010/01/
facebook-demographics-and-statistics- report-\
2010-145-growth-in- 1-year.
Gastner, M. T., and Newman, M. E. J. 2004. Diffusion-based
method for producing density-equalizing maps. PNAS 101.
O’Connor, B.; Balasubramanyan, R.; Routledge, B.; and Smith, N.
2010. From tweets to polls: Linking text sentiment to public opin-
ion time series. In ICWSM.
Social Security Administration. 2010. Most popular baby names.
http://www.ssa.gov/oact/babynames.
Strayhorn, T. 2009. Sex differences in use of facebook and mys-
pace among first-year college students. Stud. Affairs 10(2).
U.S. Census. 2000. Genealogy data: Frequently occur-
ring surnames from census. http://www.census.gov/
genealogy/www/data/2000surnames.
... Wang et al. (2016) indicated that affected citizens are likely to express their feelings and concerns regarding the crises on social media. Meanwhile, Mislove et al. (2011) andLuo et al. (2016) have demonstrated the feasibility of using Twitter users' names to infer their demographics such as race/ethnicity and gender. Hence, social media such as Twitter has the potential to support the analysis of different demographic groups' responses during disasters. ...
... In addition, Peddinti et al. (2014) have found that nearly 35% of their 100,000 randomly selected Twitter users were using real names. Accordingly, various studies have inferred Twitter users' race/ethnicity and gender by their full names such as Mislove et al. (2011) and Luo et al. (2016). However, studies on affected citizens' demographics during natural disasters are very limited (Mandel et al. 2012). ...
Conference Paper
Social groups are characterized by their demographic characters such as race/ethnicity and gender. Different demographic groups were found to have experienced significantly varying impacts of the same disasters. For instance, ethnic minorities were impacted more severely than the white during Hurricane Katrina. These varying impacts can be reflected in their different crisis responses. However, research on disaster response disparities among different demographic groups remains a critical challenge due to the lack of disaggregated-level data classified by demographic characters. To fill in this gap, this research takes the first step to investigate the demographics of affected citizens during Hurricane Florence. This paper infers social media users’ demographic characters from their users’ names. The results are used for analyzing social media activities in different demographic groups. This study found the white groups performed most active while the black group acted least active in talking about the hurricane event on social media. Moreover, the female affected citizens were found to be less active than the male affected citizens on social media during Hurricane Florence. The comparative results of demographic compositions among the affected and not-affected citizens have presented different distributions. Our findings can help the classification of Twitter data by demographic groups. The classified Twitter data can be further used for exploring the sentiment and concerns of different demographic groups. The understanding of varying sentiment and concerns of different demographic groups can help crisis response managers design and implement on-target response strategies.
... We build upon previous research in social media demographic inference that analyzes users' location, gender, race, and occupation using publicly available data [7,13]. Previous works on Twitter demographic inference differ from our approach in two ways: they either focus on a slim set of demographics (three or less) or only take a single aspect of a tweet object as input (only text, for example) [16]. In contrast, the recent advances in machine learning makes it possible to utilize a multimodal approach to demographic inference along a wide set of demographics. ...
... The fact that there are more Democrats within the collected data is unsurprising, given that existing Twitter demographic research shows that most Twitter users are identify as liberal [16]. It is notable that Republicans have sentiment that is more negative than the Democrats' sentiment is positive. ...
Preprint
In spite of a growing body of scientific evidence on the effectiveness of individual face mask usage for reducing transmission rates, individual face mask usage has become a highly polarized topic within the United States. A series of policy shifts by various governmental bodies have been speculated to have contributed to the polarization of face masks. A typical method to investigate the effects of these policy shifts is to use surveys. However, survey-based approaches have multiple limitations: biased responses, limited sample size, badly crafted questions may skew responses and inhibit insight, and responses may prove quickly irrelevant as opinions change in response to a dynamic topic. We propose a novel approach to 1) accurately gauge public sentiment towards face masks in the United States during COVID-19 using a multi-modal demographic inference framework with topic modeling and 2) determine whether face mask policy shifts contributed to polarization towards face masks using offline change point analysis on Twitter data. First, we infer several key demographics of individual Twitter users such as their age, gender, and whether they are a college student using a multi-modal demographic prediction framework and analyze the average sentiment for each respective demographic. Next, we conduct topic analysis using latent Dirichlet allocation (LDA). Finally, we conduct offline change point discovery on our sentiment time series data using the Pruned Exact Linear Time (PELT) search algorithm. Experimental results on a large corpus of Twitter data reveal multiple insights regarding demographic sentiment towards face masks that agree with existing surveys. Furthermore, we find two key policy-shift events contributed to statistically significant changes in sentiment for both Republicans and Democrats.
... Furthermore, the representativeness may vary geographically. Despite the attempts to improve the understanding of the demographics of Twitter users via profile scrutiny and tweets mining [27,59], the intrinsic biases in Twitter samples should be considered when the results of this study are interpreted. The problem of representativeness, however, exists in all digital services. ...
Article
Full-text available
The current COVID-19 pandemic raises concerns worldwide, leading to serious health, economic, and social challenges. The rapid spread of the virus at a global scale highlights the need for a more harmonized, less privacy-concerning, easily accessible approach to monitoring the human mobility that has proven to be associated with viral transmission. In this study, we analyzed over 580 million tweets worldwide to see how global collaborative efforts in reducing human mobility are reflected from the user-generated information at the global, country, and U.S. state scale. Considering the multifaceted nature of mobility, we propose two types of distance: the single-day distance and the cross-day distance. To quantify the responsiveness in certain geographic regions, we further propose a mobility-based responsive index (MRI) that captures the overall degree of mobility changes within a time window. The results suggest that mobility patterns obtained from Twitter data are amenable to quantitatively reflect the mobility dynamics. Globally, the proposed two distances had greatly deviated from their baselines after March 11, 2020, when WHO declared COVID-19 as a pandemic. The considerably less periodicity after the declaration suggests that the protection measures have obviously affected people's travel routines. The country scale comparisons reveal the discrepancies in responsiveness, evidenced by the contrasting mobility patterns in different epidemic phases. We find that the triggers of mobility changes correspond well with the national announcements of mitigation measures, proving that Twitter-based mobility implies the effectiveness of those measures. In the U.S., the influence of the COVID-19 pandemic on mobility is distinct. However, the impacts vary substantially among states.
... Twitter users are younger, better educated, and more liberal [1,10,14,19,20,33,34]. Furthermore, among Twitter users, rates of tweeting are extremely skewed [3]. ...
Chapter
Full-text available
Over a million tweets were analyzed using various methods in an attempt to predict the results of the Eurovision Song Contest televoting. Different methods of sentiment analysis (English, multilingual polarity lexicons and deep learning) and translating the focus language tweets into English were used to determine the method that produced the best prediction for the contest. Furthermore, we analyzed the effect of sampling tweets during different periods, namely during the performances and/or during the televoting phase of the competition. The quality of the predictions was assessed through correlations between the actual ranks of the televoting and the predicted ranks. The prediction was based on the application of an adjusted Eurovision televoting scoring system to the results of the sentiment analysis of tweets. A predicted rank for each performance resulted in a Spearman correlation coefficients of 0.62 and 0.74 during the televoting period for the lexicon sentiment-based and deep learning approaches, respectively.
... Er zijn altijd groepen mensen van wie de data niet regelmatig verzameld en geanalyseerd worden, omdat ze niet of onvoldoende participeren in het (online)gedrag dat als basis voor veel Big Data-analyses dient (Lerman 2013). Ook is Twitter geen willekeurige sample van de etnische samenstelling van de bevolking (Mislove et al. 2011). Vooral bij het combineren van databronnen en hergebruik kan het lastig zijn om te achterhalen hoe datasets tot stand zijn gekomen en dus wat de precieze bias is die in de data zit. ...
Book
Full-text available
Het gebruik van Big Data in het veiligheidsdomein vraagt om nieuwe kaders. Dat is nodig om de mogelijkheden van Big Data te benutten en tegelijkertijd de fundamentele rechten en vrijheden van burgers te waarborgen. Dat schrijft de WRR in zijn rapport Big Data in een vrije en veilige samenleving (rapport nr. 95, 2016).
... Activity captured on social media platforms may be solely indicative of interactions on that platform rather than generalizable inferences of social phenomena (Cowls and Schroeder, 2015). Extant scholarship considering social media as a measure of public interest also falls prey to dependence on non-representative samples (Blank, 2016;Mislove et al., 2011). We believe Wikipedia usership to be significantly more representative of the general public both because it is a widely used resource (Schroeder & Taylor, 2015) and since it is used by a variety of individuals (Head & Eisenberg, 2010;Göbel & Munzert, 2018;Messner & South, 2011). ...
Preprint
Election prediction has long been an evergreen in political science literature. Traditionally, such efforts included polling aggregates, economic indicators, partisan affiliation, and campaign effects to predict aggregate voting outcomes. With increasing secondary usage of online-generated data in social science, researchers have begun to consult metadata from widely used web-based platforms such as Facebook, Twitter, Google Trends and Wikipedia to calibrate forecasting models. Web-based platforms offer the means for voters to retrieve detailed campaign-related information, and for researchers to study the popularity of campaigns and public sentiment surrounding them. However, past contributions have often overlooked the interaction between conventional election variables and information-seeking behaviour patterns. In this work, we aim to unify traditional and novel methodology by considering how information retrieval differs between incumbent and challenger campaigns, as well as the effect of perceived candidate viability and media coverage on Wikipedia pageviews predictive ability. In order to test our hypotheses, we use election data from United States Congressional (Senate and House) elections between 2016 and 2018. We demonstrate that Wikipedia data, as a proxy for information-seeking behaviour patterns, is particularly useful for predicting the success of well-funded challengers who are relatively less covered in the media. In general, our findings underline the importance of a mixed-data approach to predictive analytics in computational social science.
... And whilst even the academic community has shown more than once that this platform is a good reflection of the political and social discourse, Twitter also has certain drawbacks from the perspective of diversity. In fact, several comparative studies have found that women are underrepresented in Twitter, (Armstrong & Gao, 2011;Ausserhofer, & Maireder, 2013;Bode et al., 2011;Mislove et al., 2011;Parmelee & Bichard, 2011) and that women are less likely to tweet publicly (Hargittai, & Jennrich, 2016). Although this research does not analyze gender of the Twitter users in the sample, several recent studies that analyzed the profile of Twitter users in Spain found that only 35%-40% of active Twitter users were women (Barberá & Rivero, 2015), among the referent communicators women only represented 26.83% (Arrabal-Sánchez & De-Aguilera-Moyano, 2016) and the average number of tweets published by female users was lower than tweets from men (Arrabal-Sánchez & De-Aguilera-Moyano, 2016). ...
Article
The instance of image-based abuse that ended in the victim’s suicide, known as the “Iveco case,” had an unprecedented social impact in Spain in 2019. This case provoked a great social reaction and became particularly viral on social networks such as Twitter. The present research investigates how this case has been dealt with through Twitter discourse. In particular, this study aimed to identify the main elements that could explain how people engaged with the problem of nonconsensual sharing of sexually explicit images in general, and with this case in particular. In total, 1,895 tweets with the word “Iveco” written in Spain were selected by streaming API, and their content was analyzed by lexical analysis using Iramuteq software (Reinert method). This software carries out an automatic lexical classification cluster analysis that groups the most significant words and text segments according to their co-occurrence. The results revealed that, on Twitter, it was stressed that the victim was a married woman with children who had practiced sexting. However, in response to this initial description, many voices also emerged that labelled this image-based abuse as gender-based online violence. Criticism was aimed at both the passivity of the company, and the attitude of hundreds of thousands of people who share the sexting video by WhatsApp groups without permission. Consequently, several feminist mobilizations emerged, framing this case within a sexist and patriarchal society and asking for accountability. However, in contrast, countermovements such as the #NotAllMen also emerged.
... Numerous approaches exist for demographic inference of social media including text-based [22,26,58,72,89,102] and image-based approaches [65,75]. It is widely acknowledged that both of these approaches have inherent biases and limitations. ...
Preprint
The #MeToo movement on Twitter has drawn attention to the pervasive nature of sexual harassment and violence. While #MeToo has been praised for providing support for self-disclosures of harassment or violence and shifting societal response, it has also been criticized for exemplifying how women of color have been discounted for their historical contributions to and excluded from feminist movements. Through an analysis of over 600,000 tweets from over 256,000 unique users, we examine online #MeToo conversations across gender and racial/ethnic identities and the topics that each demographic emphasized. We found that tweets authored by white women were overrepresented in the movement compared to other demographics, aligning with criticism of unequal representation. We found that intersected identities contributed differing narratives to frame the movement, co-opted the movement to raise visibility in parallel ongoing movements, employed the same hashtags both critically and supportively, and revived and created new hashtags in response to pivotal moments. Notably, tweets authored by black women often expressed emotional support and were critical about differential treatment in the justice system and by police. In comparison, tweets authored by white women and men often highlighted sexual harassment and violence by public figures and weaved in more general political discussions. We discuss the implications of work for digital activism research and design including suggestions to raise visibility by those who were under-represented in this hashtag activism movement. Content warning: this article discusses issues of sexual harassment and violence.
... They also argued that microblogging, such as Twitter feeds, can be useful for personality research. However, there many potential factors, including, for example, inherent of subconscious bias (González-Bailón et al., 2012;Mislove, 2011), posting credibility (Kang, O'Donovan, & Höllerer, 2012), or human impulsivity (Savci & Aysan, 2015), among other factors, that could significantly affect and/or modify genuine emotions expressed by sports fans. While beyond the scope of the present study, future research in this area should focus on the cross-validation of the emotions expressed through the notes posted to Twitter with other data. ...
This study explored the chaotic properties of human emotions as expressed in social media and its implications for attainable forecasting horizons. Three human emotional states extracted from Twitter were analyzed using the nonlinear dynamics approach. The greatest positive Lyapunov exponent (LE) and 0-1 test methods were applied to a time series set consisting of over 25,000 data points reflecting the hourly recorded data of over 1.3 million tweets. The results suggest that the examined emotional time series data represent a nonlinear dynamical system with deterministic chaos properties. Therefore, by utilizing traditional linear methods of social media data analysis, one may not be able to fully understand and forecast critical transition trends over time or beyond a limited duration. It was concluded that the nonlinear dynamics approach is useful to determine a feasible forecasting horizon and to assess the prediction accuracy of social media data in general.
Chapter
This chapter examines one possible explanation for what social media data might be revealing; social media may well provide a window into public attention rather than attitudes and opinions. It describes an empirical test of the assumption that Twitter might provide a window into what Americans were thinking about in the run‐up to the 2016 US Presidential election. The chapter compares answers to a series of open‐ended survey questions about what Americans had heard about the candidates with the content of posts about those same candidates at the same time on Twitter. It examines whether the frequency of mentions of event‐related terms from these two types of data displayed similar patterns over time. Twitter data for the current analyses were collected using the Sysomos firehose access tool, which allows subscribing researchers to download a random sample of tweets related to any given keyword on a particular date.
Conference Paper
Full-text available
Directed links in social media could represent anything from intimate friendships to common interests, or even a passion for breaking news or celebrity gossip. Such directed links determine the flow of information and hence indicate a user's influence on others—a concept that is crucial in sociology and viral marketing. In this paper, using a large amount of data collected from Twit- ter, we present an in-depth comparison of three mea- sures of influence: indegree, retweets, and mentions. Based on these measures, we investigate the dynam- ics of user influence across topics and time. We make several interesting observations. First, popular users who have high indegree are not necessarily influential in terms of spawning retweets or mentions. Second, most influential users can hold significant influence over a variety of topics. Third, influence is not gained spon- taneously or accidentally, but through concerted effort such as limiting tweets to a single topic. We believe that these findings provide new insights for viral marketing and suggest that topological measures such as indegree alone reveals very little about the influence of a user.
Conference Paper
Full-text available
We propose an approach to determine the ethnic break- down of a population based solely on people's names and data provided by the U.S. Census Bureau. We demon- strate that our approach is able to predict the ethnicities of individuals as well as the ethnicity of an entire pop- ulation better than natural alternatives. We apply our technique to the population of U.S. Facebook users and uncover the demographic characteristics of ethnicities and how they relate. We also discover that while Face- book has always been diverse, diversity has increased over time leading to a population that today looks very similar to the overall U.S. population. We also find that different ethnic groups relate to one another in an as- sortative manner, and that these groups have different profiles across demographics, beliefs, and usage of site features.
Article
Full-text available
In recent years, social media has become ubiquitous and important for social networking and content sharing. And yet, the content that is generated from these websites remains largely untapped. In this paper, we demonstrate how social media content can be used to predict real-world outcomes. In particular, we use the chatter from Twitter.com to forecast box-office revenues for movies. We show that a simple model built from the rate at which tweets are created about particular topics can outperform market-based predictors. We further demonstrate how sentiments extracted from Twitter can be further utilized to improve the forecasting power of social media.
Article
Full-text available
Map makers have for many years searched for a way to construct cartograms, maps in which the sizes of geographic regions such as countries or provinces appear in proportion to their population or some other analogous property. Such maps are invaluable for the representation of census results, election returns, disease incidence, and many other kinds of human data. Unfortunately, to scale regions and still have them fit together, one is normally forced to distort the regions' shapes, potentially resulting in maps that are difficult to read. Many methods for making cartograms have been proposed, some of them are extremely complex, but all suffer either from this lack of readability or from other pathologies, like overlapping regions or strong dependence on the choice of coordinate axes. Here, we present a technique based on ideas borrowed from elementary physics that suffers none of these drawbacks. Our method is conceptually simple and produces useful, elegant, and easily readable maps. We illustrate the method with applications to the results of the 2000 U.S. presidential election, lung cancer cases in the State of New York, and the geographical distribution of stories appearing in the news.
Article
Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e., can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public's response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%.
Most popular baby names
  • Social Security Administration
Social Security Administration. 2010. Most popular baby names. http://www.ssa.gov/oact/babynames.
Genealogy data: Frequently occurring surnames from census
  • U S Census
U.S. Census. 2000. Genealogy data: Frequently occurring surnames from census. http://www.census.gov/ genealogy/www/data/2000surnames.
Facebook demographics and statistics report 2010
  • P Corbett
Corbett, P. 2010. Facebook demographics and statistics report 2010. http://www.istrategylabs.com/2010/01/ facebook-demographics-and-statistics-report-\ 2010-145-growth-in-1-year.