ArticlePDF Available

The paradox of active users

Authors:
Commentary
The paradox of active users
Patrick Park and Michael Macy
Introduction
Twitter users vary widely in their level of activity.
Active Twitter users not only tweet more often than
others but they also tend to mention other users with
higher frequency and the set of time stamped global
positioning system (GPS) locations in their tweets are
more complete. The time, location, and connection
data along with the text content in the active users’
tweets render a more complete picture of their social
connections, preferences, attitudes, interests, and
spatial mobility.
However, the activity levels of users are heavily
skewed; most Twitter content is produced by a small
fraction of highly active users (measured by the number
of followers, tweets, retweets, and mentions), while the
vast majority of registered users are passive observers.
For example, there were only 302 million monthly
active users in the first quarter of 2015 among approxi-
mately one billion registered users (Twitter, 2015).
According to Twopcharts, a company monitoring
Twitter activity, 43% of the 550 million Twitter
accounts that had at least one tweet did not create a
single tweet in the previous year (Murphy, 2014).
In terms of content, 50% of the URLs consumed on
Twitter were generated by 0.05% of all users in 2011
(Wu et al., 2011).
This skewed distribution of overall activity poses
challenges for inferring unobserved user characteristics
(e.g. the representative geographic location of each
user). In particular, the methods devised to infer user
characteristics rely on and leverage central tendencies
in the data, treating highly active users as outliers
or aberrations. In effect, the users who emit the most
information and who are expected, in principle, to
be summarized and categorized most accurately, are
paradoxically the ones whose characteristics tend to
be discounted or misclassified.
In this study, we offer concrete examples to illustrate
this paradox and discuss how methods that do not
properly deal with the active users can increase
classification error and ultimately distort our under-
standing of online social relationships at the micro
level and structural properties of the communication
network at the macro level.
Geo-location inference
Inferring the representative geographic location of
social media users is an actively developing area of
research with wide applicability in both applied and
basic research endeavors that use social media data.
Social media-based early detection and prediction stu-
dies of seasonal flu surveillance or the prediction of
commercial movie success require knowledge about
users’ locations. Research in our lab also depends on
inference of user location, as in the study of diurnal
rhythms of affect using Twitter data (Golder and
Macy, 2011) or our on-going cross-national compara-
tive analysis of communication networks.
The state-of-the-art location inference method based
on label propagation has been shown to perform with
high coverage (90% of users geotagged) and accuracy
(median error of 6.38 km) (Compton et al., 2014). This
method relies on the fact that the majority of commu-
nication partners, or network neighbors, in the Twitter
@user mention network are geographically proximate.
If a user’s Twitter neighbors turn on their GPS while
tweeting, it is possible to estimate the focal user’s lati-
tude and longitude based on the distribution of those
neighbors’ locations. This estimated location could
then be used to further estimate the unknown locations
of the focal user’s other network neighbors who did not
enable GPS tracking in their tweets, and so on. In tech-
nical terms, the label-propagation algorithm attempts
to infer a candidate location of a given user based on
Cornell University, Ithaca, NY, USA
Corresponding author:
Patrick Park, 323 Uris Hall, Cornell University, Ithaca, NY 14853, USA.
Email: pp286@cornell.edu
Big Data & Society
July–December 2015: 1–4
!The Author(s) 2015
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/2053951715606164
bds.sagepub.com
Creative Commons Non Commercial CC-BY-NC: This article is distributed under the terms of the Creative Commons Attribution-
NonCommercial 3.0 License (http://www.creativecommons.org/licenses/by-nc/3.0/) which permits non-commercial use, reproduction
and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages
(https://us.sagepub.com/en-us/nam/open-access-at-sage).
by guest on September 29, 2016Downloaded from
one of her neighbors’ locations that minimizes the sum
of distances to the rest of her neighbors’ locations
(i.e. L1-median location). To ensure robust results,
the algorithm incorporates a dispersion threshold
(e.g. 100 km) where the algorithm accepts the candidate
L1-median location as its final estimate only if the
median distance from that candidate location to all
the other known and inferred locations of the network
neighbors does not exceed the dispersion threshold. If
the candidate location satisfies this condition, the algo-
rithm assigns that location to the focal user as its best
estimate and propagates it to estimate other network
neighbors’ locations. For example, if a user’s candidate
location inferred by the algorithm happens to be in
New York, but her network neighbors are geographic-
ally concentrated in both New York and San
Francisco, such that the median of the distances from
the candidate location in New York to all of her neigh-
bors’ locations exceeds 100 km, then the algorithm does
not assign that candidate location in New York as the
estimated location but instead classifies the user’s loca-
tion as unidentifiable.
Despite the impressive performance in both coverage
and accuracy of this new method, we find that predic-
tion error is a U-shaped function of both user activity
level, measured as tweets per day (Compton et al.,
2014), and network degree. Furthermore, the candidate
locations inferred for the highly active users tend to
exceed the dispersion threshold, diminishing the pro-
portion of inferable cases (i.e. classification coverage).
The paradox of lower coverage and low predictive
accuracy of active high-degree users who provide dis-
proportionately large amounts of information about
themselves stems from the assumption that there is a
single geographical location that best characterizes a
user’s location. Although this may be true in principle,
the network neighbors who are the source of inferring
this single location are constantly moving over time and
the time at which each of those neighbors coincided
with the focal user in space may be too long ago to
be relevant. The example of the user whose network
neighbors are concentrated in both New York and
San Fransisco illustrates this case. This user may
work in New York but reside in San Fransisco and
form geographically segregated personal and profes-
sional networks. Alternatively, this user may have
grown up in San Fransisco but moved to New York
for work. If the focal user is active on Twitter, her
work/social or past/present relationships will each be
given equal weight in the label-propagation algorithm.
For an active user who exhibits greater geographic
diversity in her network, the candidate location is less
likely to satisfy the dispersion threshold. The algorithm
might be improved by either assigning multiple plaus-
ible locations (i.e. assuming multiple representative
locations) or by adding more constraints to what con-
stitutes a representative location (e.g. initializing the
label-propagation algorithm to start from GPS loca-
tions in tweets created exclusively at night).
Individual vs. group account
classification
A broad range of prediction applications (Broniatowski
et al., 2013), behavioral modeling and social network
analyses using social media data of hundreds of mil-
lions of users build models with implicit assumptions
about the users. An important source of heterogeneity
of the users that is often neglected is whether a user is
an individual or a group account (e.g. a company’s offi-
cial Twitter account). Often, multiple individuals such
as a PR team manage a single group account, leaving
quite different behavioral traces from individual
accounts managed by single owners. For example,
group accounts tend to possess more followers on
Twitter and the followers are arguably less related to
the group accounts. The communication ties between
group accounts and their followers are also not likely to
be ‘‘social ties’’ in the conventional sense. Furthermore,
the objectives, language, and topical interests of group
accounts differ from those of individuals.
Researchers who work primarily with traditional
survey data with accurate and well-defined sampling
frames do not have to deal with the distinction between
groups vs. individuals. However, computational social
scientists and data scientists who use social media data
inevitably face this problem. Failure to correctly clas-
sify and filter out group accounts (or individual
accounts, depending on the research objective) could
lead to misleading characterizations and conclusions.
Imagine a network analysis that does not properly
filter out group accounts that tend to have higher con-
nectivity than individuals. Virtually all network met-
rics, from clustering and degree distribution to mean
geodesic, will be affected by the presence of group
accounts in the data.
The method we developed to distinguish group
and individual accounts is based on the cognitive
constraints of individuals in forming and maintaining
communication ties, which arguably applies to a lesser
extent to group accounts managed by multiple individ-
uals (Park et al., 2015). These constraints are captured,
for example, in the ratio of in-degree to out-degree of
each node’s immediate neighbors in the communication
network as well as in the level of concentration of com-
munication volume across one’s network neighbors
(Sarama
¨ki et al., 2014).
This method shares with the methods for geo-
location estimation the problem that highly active indi-
viduals may be misclassified due to their behavioral and
2Big Data & Society
by guest on September 29, 2016Downloaded from
structural similarities to groups. By mistakenly over-
filtering central individuals (e.g. opinion leaders),
the network would appear to be less clustered, more
fragmented, and with a longer mean geodesic.
A potential solution is to leverage the temporal
constraints on tweeting which individuals, but not
organizations, tend to exhibit (Tavares and Faisal,
2013). Each user’s inter-tweet delays and the temporal
distribution of tweets throughout the day can be used
to enhance the overall discriminatory power between
organizations and active individuals whose networks
look similar.
Social vs. coworker vs. acquaintance
tie classification
In the absence of respondent surveys or ethnographic
observation the nature of a communication tie
(e.g. professional vs. friendship vs. acquaintance)
could be inferred from time-location records of
mobile phone logs (Eagle et al., 2009; Toole et al.,
2015). The intuition behind this method is that
coworkers or professionals tend to be co-located
during work hours (e.g. in the same office building on
a weekday afternoon) while friends are more likely to
be co-located during off-work hours (e.g. in a bar on a
Friday evening). Acquaintances are likely to have few
colocations regardless of time. This approach, which
leverages time-location similarity (using cosine similar-
ity between hourly location occurrence vectors of two
individuals), yields accurate and convincing results with
mobile phone data that contain detailed time-location
records at regular time intervals for each individual
(i.e. whenever a mobile device communicates with a
cell-phone tower).
Nevertheless, blindly applying this method to
Twitter users with GPS tweet data could potentially
lead to biased results. Again, the active users will
appear to have high mobility with hundreds of GPS
locations captured in the data whereas the low activity
users may appear relatively immobile. Therefore, a tie
involving an active user is more likely to be classified as
either professional or friend than acquaintance.
Conclusion
Individuals traverse multiple locations in both physical
and network space. Twitter captures information about
these interactions and movements from which research-
ers can infer attributes of individuals and their social
relationships. In this essay, we considered three exam-
ples that are relevant to a broad range of academic and
practical applications. These examples highlight the
paradox of highly active users—those who generate
most Twitter content. Because they generate more
data points with which to measure their behavior,
highly active users are less vulnerable to random meas-
urement error, yet they are more vulnerable to system-
atic mis-classification when researchers make naive
assumptions about the distribution of user activity.
The paradox of highly active users can be addressed
by developing methods that handle the complexities
and multidimensionality of social life represented in
the data. The need to do so will intensify as an increas-
ing proportion of the population establishes their
online and social media presence with more complete
pictures of their lives painted in digital form.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest
with respect to the research, authorship, and/or publi-
cation of this article.
Funding
The author(s) disclosed receipt of the following financial sup-
port for the research, authorship, and/or publication of this
article: The authors would like to acknowledge the receipt of
research funding from MINERVA Initiative, Department of
Defense, National Science Foundation (SES-1357488, SES-
1434164 and SES-1226483) and National Research
Foundation of Korea (NRF-2013S1A3A2055285).
References
Broniatowski DA, Paul MJ and Dredze M (2013) National
and local influenza surveillance through Twitter: An ana-
lysis of the 2012–2013 influenza epidemic. PLoS ONE
8(12): e83672.
Compton R, Jurgens D and Allen D (2014) Geotagging one
hundred million twitter accounts with total variation mini-
mization. In: 3rd International Congress on Big Data,
October, IEEE BigData 2014, Washington, DC, pp.
393–401.
Eagle N, Pentland A and Lazer D (2009) Inferring friendship
network structure by using mobile phone data.
Proceedings of the National Academy of Sciences 106(36):
15274–15278.
Golder S and Macy M (2011) Diurnal and seasonal
mood vary with work, sleep, and daylength across diverse
cultures. Science 333(6051): 1878–1881.
Murphy D (2014) 44 percent of Twitter accounts have never
tweeted. PC Magazine.
Park P, Compton R and Lu T-C (2015) Network-based group
account classification. In: Agarwal N, et al. (eds) Lecture
notes in computer science: Social computing, behavioral-
cultural modeling and prediction, pp. 164–172.
Sarama
¨ki JEA, Leicht EL, Roberts SGB, et al. (2014)
Persistence of social signatures in human communication.
Proceedings of the National Academy of Sciences 111(3):
942–947.
Park and Macy 3
by guest on September 29, 2016Downloaded from
Tavares G and Faisal A (2013) Scaling-laws of human broad-
cast communication enable distinction between human,
corporate and robot Twitter users. PLoS One 8(7): e65774.
Toole J, Herra-Yague C, Schneider C, et al. (2015) Coupling
human mobility and social ties. arXiv 1502: 00690.
Twitter (2015) Twitter Q1 2015 earnings report.
Wu S, Hofman J, Mason W, et al. (2011) Who says what to
whom on Twitter. In: Proceedings of the 20th international
conference on World Wide Web (WWW 2011).
This article is part of a special theme on Colloquium: Assumptions of Sociality. To see a full list of all
articles in this special theme, please click here: http://bds.sagepub.com/content/colloquium-assump-
tions-sociality.
4Big Data & Society
by guest on September 29, 2016Downloaded from
... User activity is measured as the sum of tweets, retweets, and replies posted by a user account. Active user accounts publish the largest amount of Twitter content and primarily drive discussions on a given topic due to the volume of tweets they publish (Park and Macy, 2015). Visibility is measured as the number of retweets and replies received by a user account and is an indicator of an account's ability to reach and attract other accounts (Chae, 2014). ...
Article
Purpose Automated social media messaging tactics can undermine trust in health institutions and public health advice. As such, we examine automated software programs (ASPs) and social bots in the Twitter anti-vaccine discourse before and after the release of COVID-19 vaccines. Design/methodology/approach We compare two Twitter datasets comprising user accounts and associated English-language tweets featuring the keywords “#antivaxx” or “anti-vaxx.” The first dataset, from 2018 (pre-COVID vaccine), includes 3,154 user accounts and 6,380 tweets. The second comprises 327,067 accounts and 545,268 tweets published during the 12 months following December 1, 2020 (post-COVID vaccine). Using Information Laundering Theory (ILT), the datasets were examined manually and through user analytics and machine learning to identify activity, visibility, verification status, vaccine position, and ASP or bot technology use. Findings The post-COVID vaccine dataset showed an increase in highly probable bot accounts (31.09%) and anti-vaccine accounts. However, both datasets were dominated by pro-vaccine accounts; most highly active (59%) and highly visible (50%) accounts classified as probable bots were pro-vaccine. Originality/value This research is the first to compare bot behaviors in the “#antivaxx” discourse before and after the release of COVID-19 vaccines. The prevalence of mostly benevolent probable bot accounts suggests a potential overstatement of the threat posed by anti-vaccine accounts using ASPs or bot technologies. By highlighting bots as intermediaries that disseminate both pro- and anti-vaccine content, we extend ILT by identifying a benevolent variant and offering insights into bots as “pathways” to generating mainstream information.
... Some social media activities may be especially useful to assess as betweenperson differences, while others would be more suitable to study as within-person or person-specific effects or may warrant a focus on a specific subset of users (e.g., those who post frequently). Most importantly, across a variety of activities only a few participants seem highly active while the majority are "likers" and "lurkers" (Park & Macy, 2015;Van Driel et al., 2019). Moreover, the DDPs showed that levels of activity may vary greatly over time, between and within individual users. ...
Article
Full-text available
Studies assessing the effects of social media use are largely based on measures of time spent on social media. In recent years, scholars increasingly ask for more insights in social media activities and content people engage with. Data Download Packages (DDPs), the archives of social media platforms that each European user has the right to download, provide a new and promising method to collect timestamped and content-based information about social media use. In this paper, we first detail the experiences and insights of a data collection of 110 Instagram DDPs gathered from 102 adolescents. We successively discuss the challenges and opportunities of collecting and analyzing DDPs to help future researchers in their consideration of whether and how to use DDPs. DDPs provide tremendous opportunities to get insight in the frequency, range, and content of social media activities, from browsing to searching and posting. Yet, collecting, processing, and analyzing DDPs is also complex and laborious, and demands numerous procedural and analytical choices and decisions. © 2022 The Author(s). Published with license by Taylor & Francis Group, LLC.
... Researchers can input key search terms or hashtags into the platform and retrieve specific data on topics. Subsequently, this large pool of data has been used extensively to research human behaviour on a global scale (Park & Macy, 2015 Limited studies exist using Twitter to understand SRC. Sullivan et al. (2012) used a prospective observational study design to examine SRC content on Twitter over a seven-day period in July 2010. ...
Article
Full-text available
The purpose of this study was to explore the cycling community’s online interactions with sports-related concussion within competitive cycling. Through an analysis of twitter data (n=196), this study examined the limited discourse related to the problem of concussion in cycling. The results found overall engagement and awareness of concussion in cycling was low but has been increasing year on year from 2008 to 2019. Thematic analysis of the data found three main themes within the online cycling community on Twitter: 1) Increasing awareness of concussion as a problem for the sport 2) A narrative of apathy in policy by governing bodies and 3) The need for better education as a result of misperceptions of concussion. Overall, these findings contribute to the limited research in the field of concussion in competitive cycling and outline the utility of social media as a platform to disseminate educational resources around the safe management of concussion in the sport.
... The other limitation of big data analytics is that despite its size, big data can still be biased; the agents, applications, and devices producing and collecting the data can themselves be either selective or manipulated. This points to the paradox that despite its name, big data is likely to be either "small," representing only a subset of social transactions among particular demographics and thereby capturing partial and/or fragmented information (McFarland & McFarland, 2015;O'Brien, 2016;Park & Macy, 2015;Shaw, 2015); or "artifactual," whereby social forces, including censorship, political robots, and system error manipulate the process of information production, leading to the proliferation of artifacts, errors, and anomalies (see Lazer & Radford, 2017). ...
... The other limitation of big data analytics is that despite its size, big data can still be biased; the agents, applications, and devices producing and collecting the data can themselves be either selective or manipulated. This points to the paradox that despite its name, big data is likely to be either "small," representing only a subset of social transactions among particular demographics and thereby capturing partial and/or fragmented information (McFarland & McFarland, 2015;O'Brien, 2016;Park & Macy, 2015;Shaw, 2015); or "artifactual," whereby social forces, including censorship, political robots, and system error manipulate the process of information production, leading to the proliferation of artifacts, errors, and anomalies (see Lazer & Radford, 2017). ...
... The other limitation of big data analytics is that despite its size, big data can still be biased; the agents, applications, and devices producing and collecting the data can themselves be either selective or manipulated. This points to the paradox that despite its name, big data is likely to be either "small," representing only a subset of social transactions among particular demographics and thereby capturing partial and/or fragmented information (McFarland & McFarland, 2015;O'Brien, 2016;Park & Macy, 2015;Shaw, 2015); or "artifactual," whereby social forces, including censorship, political robots, and system error manipulate the process of information production, leading to the proliferation of artifacts, errors, and anomalies (see Lazer & Radford, 2017). ...
Book
Chen, He and Yan present a range of applications of multiple-source big data to core areas of contemporary sociology, demonstrating how a theory-guided approach to macrosociology can help to understand social change in China, especially where traditional approaches are limited by constrained and biased data. In each chapter of the book, the authors highlight an application of theory-guided macrosociology that has the potential to reinvigorate an ambitious, open-minded and bold approach to sociological research. These include social stratification, social networks, medical care, and online behaviours among many others. This research approach focuses on macro-level social process and phenomena by using quantitative models to statistically test for associations and causalities suggested by a clearly hypothesised social theory. By deploying theory-oriented macrosociology where it can best assure macro-level robustness and reliability, big data applications can be more relevant to and guided by social theory. An essential read for sociologists with an interest in quantitative and macro-scale research methods, which also provides fascinating insights into Chinese society as a demonstration of the utility of its methodology.
... (2020), which found that PSMU explained 81% of the variance in total SMU, whereas ASMU explained 49% of this variance. It is also consistent with a study by Park and Macy (2015), who found that most Twitter content is produced by a small percentage of highly active users while most of the remaining users are passive observers. ...
Preprint
Full-text available
A recurring claim in the literature is that active social media use (ASMU) leads to increases in well-being, whereas passive social media use (PSMU) leads to decreases in well-being. The aim of this review was to investigate the validity of this claim by comparing the operationalizations and results of studies into the association of ASMU and PSMU with well-being (e.g., happiness) and ill-being (e.g., depression). We found 40 survey-based studies, which utilized a hodgepodge of 36 operationalizations of ASMU and PSMU and which yielded 172 associations of ASMU and/or PSMU with well-/ill-being. Most studies did not support the hypothesized associations of ASMU and PSMU with well-/ill-being. Time spent on ASMU and PSMU may be too coarse to lead to meaningful associations with well-/ill-being. Therefore, future studies should take characteristics of the content of social media (e.g., the valence), its senders (e.g., pre-existing mood), and its receivers (e.g., differential susceptibility) into account.
... Big data can be generated by natural actors, physical phenomena, and artificial actors (Zwitter, 2014). Natural actors are not necessarily individuals, an account can hide a collective (Park and Macy, 2015), and individuals can have multiple accounts. As a result, non-random errors are constantly embedded in data. ...
Article
Full-text available
Starting from an analysis of frequently employed definitions of big data, it will be argued that, to overcome the intrinsic weaknesses of big data, it is more appropriate to define the object in relational terms. The excessive emphasis on volume and technological aspects of big data, derived from their current definitions, combined with neglected epistemological issues gave birth to an objectivistic rhetoric surrounding big data as implicitly neutral, omni-comprehensive, and theory-free. This rhetoric contradicts the empirical reality that embraces big data: (1) data collection is not neutral nor objective; (2) exhaustivity is a mathematical limit; and (3) interpretation and knowledge production remain both theoretically informed and subjective. Addressing these issues, big data will be interpreted as a methodological revolution carried over by evolutionary processes in technology and epistemology. By distinguishing between forms of nominal and actual access, we claim that big data promoted a new digital divide changing stakeholders, gatekeepers, and the basic rules of knowledge discovery by radically shaping the power dynamics involved in the processes of production and analysis of data.
Article
Full-text available
Zusammenfassung Während der Coronapandemie haben sich die ohnehin schon von Personal- und Zeitmangel geprägten Arbeitsbedingungen für Pflegekräfte weiter verschärft und es hat sich ein Diskurs über moralische Verletzungen entfacht. In diesem Beitrag untersuchen wir, wie solche Erfahrungen artikuliert werden. Dazu werten wir Twitter-Daten zum Thema ,moralischverletzt‘ mittels einer qualitativen Inhaltsanalyse aus. Unsere Ergebnisse zeigen, dass Konflikte zwischen einem ethischen (Berufs-)Anspruch und dem praktischen Berufsalltag zu einem Gefühl moralischer Verletzung führen, das über materielle Bedingungen hinausgeht. Es geht nicht nur um Geld- oder Zeitmangel, sondern auch um Brüche in Normen und sozialen Reziprozitätsgefügen. Die Verletzungswahrnehmung bezieht sich auf den auf einem Professionalitätsanspruch beruhenden Leistungswert, gute Pflege leisten zu wollen, doch dies aufgrund von Zeitdruck, Ressourcenmangel oder strukturellen Hindernissen nicht umsetzen zu können.
Article
Full-text available
In the era of digitization and Open Access, article-level metrics are increasingly employed to distinguish influential research works and adjust research management strategies. Tagging individual articles with digital object identifiers allows exposing them to numerous channels of scholarly communication and quantifying related activities. The aim of this article was to overview currently available article-level metrics and highlight their advantages and limitations. Article views and downloads, citations, and social media metrics are increasingly employed by publishers to move away from the dominance and inappropriate use of journal metrics. Quantitative article metrics are complementary to one another and often require qualitative expert evaluations. Expert evaluations may help to avoid manipulations with indiscriminate social media activities that artificially boost altmetrics. Values of article metrics should be interpreted in view of confounders such as patterns of citation and social media activities across countries and academic disciplines.
Article
Full-text available
Studies using massive, passively collected data from communication technologies have revealed many ubiquitous aspects of social networks, helping us understand and model social media, information diffusion and organizational dynamics. More recently, these data have come tagged with geographical information, enabling studies of human mobility patterns and the science of cities. We combine these two pursuits and uncover reproducible mobility patterns among social contacts. First, we introduce measures of mobility similarity and predictability and measure them for populations of users in three large urban areas. We find individuals' visitations patterns are far more similar to and predictable by social contacts than strangers and that these measures are positively correlated with tie strength. Unsupervised clustering of hourly variations in mobility similarity identifies three categories of social ties and suggests geography is an important feature to contextualize social relationships. We find that the composition of a user's ego network in terms of the type of contacts they keep is correlated with mobility behaviour. Finally, we extend a popular mobility model to include movement choices based on social contacts and compare its ability to reproduce empirical measurements with two additional models of mobility. © 2015 The Author(s) Published by the Royal Society. All rights reserved.
Article
Full-text available
Significance We combine cell phone data with survey responses to show that a person’s social signature, as we call the pattern of their interactions with different friends and family members, is remarkably robust. People focus a high proportion of their communication efforts on a small number of individuals, and this behavior persists even when there are changes in the identity of the individuals involved. Although social signatures vary between individuals, a given individual appears to retain a specific social signature over time. Our results are likely to reflect limitations in the ability of humans to maintain many emotionally close relationships, both because of limited time and because the emotional “capital” that individuals can allocate between family members and friends is finite.
Article
Full-text available
Social media have been proposed as a data source for influenza surveillance because they have the potential to offer real-time access to millions of short, geographically localized messages containing information regarding personal well-being. However, accuracy of social media surveillance systems declines with media attention because media attention increases "chatter" - messages that are about influenza but that do not pertain to an actual infection - masking signs of true influenza prevalence. This paper summarizes our recently developed influenza infection detection algorithm that automatically distinguishes relevant tweets from other chatter, and we describe our current influenza surveillance system which was actively deployed during the full 2012-2013 influenza season. Our objective was to analyze the performance of this system during the most recent 2012-2013 influenza season and to analyze the performance at multiple levels of geographic granularity, unlike past studies that focused on national or regional surveillance. Our system's influenza prevalence estimates were strongly correlated with surveillance data from the Centers for Disease Control and Prevention for the United States (r = 0.93, p < 0.001) as well as surveillance data from the Department of Health and Mental Hygiene of New York City (r = 0.88, p < 0.001). Our system detected the weekly change in direction (increasing or decreasing) of influenza prevalence with 85% accuracy, a nearly twofold increase over a simpler model, demonstrating the utility of explicitly distinguishing infection tweets from other chatter.
Article
Full-text available
Human behaviour is highly individual by nature, yet statistical structures are emerging which seem to govern the actions of human beings collectively. Here we search for universal statistical laws dictating the timing of human actions in communication decisions. We focus on the distribution of the time interval between messages in human broadcast communication, as documented in Twitter, and study a collection of over 160,000 tweets for three user categories: personal (controlled by one person), managed (typically PR agency controlled) and bot-controlled (automated system). To test our hypothesis, we investigate whether it is possible to differentiate between user types based on tweet timing behaviour, independently of the content in messages. For this purpose, we developed a system to process a large amount of tweets for reality mining and implemented two simple probabilistic inference algorithms: 1. a naive Bayes classifier, which distinguishes between two and three account categories with classification performance of 84.6% and 75.8%, respectively and 2. a prediction algorithm to estimate the time of a user's next tweet with an [Formula: see text]. Our results show that we can reliably distinguish between the three user categories as well as predict the distribution of a user's inter-message time with reasonable accuracy. More importantly, we identify a characteristic power-law decrease in the tail of inter-message time distribution by human users which is different from that obtained for managed and automated accounts. This result is evidence of a universal law that permeates the timing of human decisions in broadcast communication and extends the findings of several previous studies of peer-to-peer communication.
Conference Paper
Full-text available
We study several longstanding questions in media communications research, in the context of the microblogging service Twitter, regarding the production, flow, and consumption of information. To do so, we exploit a recently introduced feature of Twitter known as "lists" to distinguish between elite users - by which we mean celebrities, bloggers, and representatives of media outlets and other formal organizations - and ordinary users. Based on this classification, we find a striking concentration of attention on Twitter, in that roughly 50% of URLs consumed are generated by just 20K elite users, where the media produces the most information, but celebrities are the most followed. We also find significant homophily within categories: celebrities listen to celebrities, while bloggers listen to bloggers etc; however, bloggers in general rebroadcast more information than the other categories. Next we re-examine the classical "two-step flow" theory of communications, finding considerable support for it on Twitter. Third, we find that URLs broadcast by different categories of users or containing different types of content exhibit systematically different lifespans. And finally, we examine the attention paid by the different user categories to different news topics.
Article
Full-text available
Data collected from mobile phones have the potential to provide insight into the relational dynamics of individuals. This paper compares observational data from mobile phones with standard self-report survey data. We find that the information from these two data sources is overlapping but distinct. For example, self-reports of physical proximity deviate from mobile phone records depending on the recency and salience of the interactions. We also demonstrate that it is possible to accurately infer 95% of friendships based on the observational data alone, where friend dyads demonstrate distinctive temporal and spatial patterns in their physical proximity and calling patterns. These behavioral patterns, in turn, allow the prediction of individual-level outcomes such as job satisfaction.
Conference Paper
We propose a classification method for group vs. individual accounts on Twitter, based solely on communication network characteristics. While such a language-agnostic, network-based approach has been used in the past, this paper motivates the task from firmly established theories of human interactional constraints from cognitive science to sociology. Time, cognitive, and social role constraints limit the extent to which individuals can maintain social ties. These constraints are expressed in observable network metrics at the node (i.e. account) level which we identify and exploit for inferring group accounts.
Article
Geographically annotated social media is extremely valuable for modern information retrieval. However, when researchers can only access publicly-visible data, one quickly finds that social media users rarely publish location information. In this work, we provide a method which can geolocate the overwhelming majority of active Twitter users, independent of their location sharing preferences, using only publicly-visible Twitter data. Our method infers an unknown user's location by examining their friend's locations. We frame the geotagging problem as an optimization over a social network with a total variation-based objective and provide a scalable and distributed algorithm for its solution. Furthermore, we show how a robust estimate of the geographic dispersion of each user's ego network can be used as a per-user accuracy measure, allowing us to discard poor location inferences and control the overall error of our approach. Leave-many-out evaluation shows that our method is able to infer location for 101,846,236 Twitter users at a median error of 6.33 km, allowing us to geotag roughly 89\% of public tweets.
Article
We identified individual-level diurnal and seasonal mood rhythms in cultures across the globe, using data from millions of public Twitter messages. We found that individuals awaken in a good mood that deteriorates as the day progresses--which is consistent with the effects of sleep and circadian rhythm--and that seasonal change in baseline positive affect varies with change in daylength. People are happier on weekends, but the morning peak in positive affect is delayed by 2 hours, which suggests that people awaken later on weekends.
Network-based group account classification Lecture notes in computer science: Social computing, behavioral-cultural modeling and prediction
  • P Park
  • R Compton
  • T-C Lu