Available via license: CC BY-NC 3.0
Content may be subject to copyright.
Commentary
The paradox of active users
Patrick Park and Michael Macy
Introduction
Twitter users vary widely in their level of activity.
Active Twitter users not only tweet more often than
others but they also tend to mention other users with
higher frequency and the set of time stamped global
positioning system (GPS) locations in their tweets are
more complete. The time, location, and connection
data along with the text content in the active users’
tweets render a more complete picture of their social
connections, preferences, attitudes, interests, and
spatial mobility.
However, the activity levels of users are heavily
skewed; most Twitter content is produced by a small
fraction of highly active users (measured by the number
of followers, tweets, retweets, and mentions), while the
vast majority of registered users are passive observers.
For example, there were only 302 million monthly
active users in the first quarter of 2015 among approxi-
mately one billion registered users (Twitter, 2015).
According to Twopcharts, a company monitoring
Twitter activity, 43% of the 550 million Twitter
accounts that had at least one tweet did not create a
single tweet in the previous year (Murphy, 2014).
In terms of content, 50% of the URLs consumed on
Twitter were generated by 0.05% of all users in 2011
(Wu et al., 2011).
This skewed distribution of overall activity poses
challenges for inferring unobserved user characteristics
(e.g. the representative geographic location of each
user). In particular, the methods devised to infer user
characteristics rely on and leverage central tendencies
in the data, treating highly active users as outliers
or aberrations. In effect, the users who emit the most
information and who are expected, in principle, to
be summarized and categorized most accurately, are
paradoxically the ones whose characteristics tend to
be discounted or misclassified.
In this study, we offer concrete examples to illustrate
this paradox and discuss how methods that do not
properly deal with the active users can increase
classification error and ultimately distort our under-
standing of online social relationships at the micro
level and structural properties of the communication
network at the macro level.
Geo-location inference
Inferring the representative geographic location of
social media users is an actively developing area of
research with wide applicability in both applied and
basic research endeavors that use social media data.
Social media-based early detection and prediction stu-
dies of seasonal flu surveillance or the prediction of
commercial movie success require knowledge about
users’ locations. Research in our lab also depends on
inference of user location, as in the study of diurnal
rhythms of affect using Twitter data (Golder and
Macy, 2011) or our on-going cross-national compara-
tive analysis of communication networks.
The state-of-the-art location inference method based
on label propagation has been shown to perform with
high coverage (90% of users geotagged) and accuracy
(median error of 6.38 km) (Compton et al., 2014). This
method relies on the fact that the majority of commu-
nication partners, or network neighbors, in the Twitter
@user mention network are geographically proximate.
If a user’s Twitter neighbors turn on their GPS while
tweeting, it is possible to estimate the focal user’s lati-
tude and longitude based on the distribution of those
neighbors’ locations. This estimated location could
then be used to further estimate the unknown locations
of the focal user’s other network neighbors who did not
enable GPS tracking in their tweets, and so on. In tech-
nical terms, the label-propagation algorithm attempts
to infer a candidate location of a given user based on
Cornell University, Ithaca, NY, USA
Corresponding author:
Patrick Park, 323 Uris Hall, Cornell University, Ithaca, NY 14853, USA.
Email: pp286@cornell.edu
Big Data & Society
July–December 2015: 1–4
!The Author(s) 2015
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/2053951715606164
bds.sagepub.com
Creative Commons Non Commercial CC-BY-NC: This article is distributed under the terms of the Creative Commons Attribution-
NonCommercial 3.0 License (http://www.creativecommons.org/licenses/by-nc/3.0/) which permits non-commercial use, reproduction
and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages
(https://us.sagepub.com/en-us/nam/open-access-at-sage).
by guest on September 29, 2016Downloaded from
one of her neighbors’ locations that minimizes the sum
of distances to the rest of her neighbors’ locations
(i.e. L1-median location). To ensure robust results,
the algorithm incorporates a dispersion threshold
(e.g. 100 km) where the algorithm accepts the candidate
L1-median location as its final estimate only if the
median distance from that candidate location to all
the other known and inferred locations of the network
neighbors does not exceed the dispersion threshold. If
the candidate location satisfies this condition, the algo-
rithm assigns that location to the focal user as its best
estimate and propagates it to estimate other network
neighbors’ locations. For example, if a user’s candidate
location inferred by the algorithm happens to be in
New York, but her network neighbors are geographic-
ally concentrated in both New York and San
Francisco, such that the median of the distances from
the candidate location in New York to all of her neigh-
bors’ locations exceeds 100 km, then the algorithm does
not assign that candidate location in New York as the
estimated location but instead classifies the user’s loca-
tion as unidentifiable.
Despite the impressive performance in both coverage
and accuracy of this new method, we find that predic-
tion error is a U-shaped function of both user activity
level, measured as tweets per day (Compton et al.,
2014), and network degree. Furthermore, the candidate
locations inferred for the highly active users tend to
exceed the dispersion threshold, diminishing the pro-
portion of inferable cases (i.e. classification coverage).
The paradox of lower coverage and low predictive
accuracy of active high-degree users who provide dis-
proportionately large amounts of information about
themselves stems from the assumption that there is a
single geographical location that best characterizes a
user’s location. Although this may be true in principle,
the network neighbors who are the source of inferring
this single location are constantly moving over time and
the time at which each of those neighbors coincided
with the focal user in space may be too long ago to
be relevant. The example of the user whose network
neighbors are concentrated in both New York and
San Fransisco illustrates this case. This user may
work in New York but reside in San Fransisco and
form geographically segregated personal and profes-
sional networks. Alternatively, this user may have
grown up in San Fransisco but moved to New York
for work. If the focal user is active on Twitter, her
work/social or past/present relationships will each be
given equal weight in the label-propagation algorithm.
For an active user who exhibits greater geographic
diversity in her network, the candidate location is less
likely to satisfy the dispersion threshold. The algorithm
might be improved by either assigning multiple plaus-
ible locations (i.e. assuming multiple representative
locations) or by adding more constraints to what con-
stitutes a representative location (e.g. initializing the
label-propagation algorithm to start from GPS loca-
tions in tweets created exclusively at night).
Individual vs. group account
classification
A broad range of prediction applications (Broniatowski
et al., 2013), behavioral modeling and social network
analyses using social media data of hundreds of mil-
lions of users build models with implicit assumptions
about the users. An important source of heterogeneity
of the users that is often neglected is whether a user is
an individual or a group account (e.g. a company’s offi-
cial Twitter account). Often, multiple individuals such
as a PR team manage a single group account, leaving
quite different behavioral traces from individual
accounts managed by single owners. For example,
group accounts tend to possess more followers on
Twitter and the followers are arguably less related to
the group accounts. The communication ties between
group accounts and their followers are also not likely to
be ‘‘social ties’’ in the conventional sense. Furthermore,
the objectives, language, and topical interests of group
accounts differ from those of individuals.
Researchers who work primarily with traditional
survey data with accurate and well-defined sampling
frames do not have to deal with the distinction between
groups vs. individuals. However, computational social
scientists and data scientists who use social media data
inevitably face this problem. Failure to correctly clas-
sify and filter out group accounts (or individual
accounts, depending on the research objective) could
lead to misleading characterizations and conclusions.
Imagine a network analysis that does not properly
filter out group accounts that tend to have higher con-
nectivity than individuals. Virtually all network met-
rics, from clustering and degree distribution to mean
geodesic, will be affected by the presence of group
accounts in the data.
The method we developed to distinguish group
and individual accounts is based on the cognitive
constraints of individuals in forming and maintaining
communication ties, which arguably applies to a lesser
extent to group accounts managed by multiple individ-
uals (Park et al., 2015). These constraints are captured,
for example, in the ratio of in-degree to out-degree of
each node’s immediate neighbors in the communication
network as well as in the level of concentration of com-
munication volume across one’s network neighbors
(Sarama
¨ki et al., 2014).
This method shares with the methods for geo-
location estimation the problem that highly active indi-
viduals may be misclassified due to their behavioral and
2Big Data & Society
by guest on September 29, 2016Downloaded from
structural similarities to groups. By mistakenly over-
filtering central individuals (e.g. opinion leaders),
the network would appear to be less clustered, more
fragmented, and with a longer mean geodesic.
A potential solution is to leverage the temporal
constraints on tweeting which individuals, but not
organizations, tend to exhibit (Tavares and Faisal,
2013). Each user’s inter-tweet delays and the temporal
distribution of tweets throughout the day can be used
to enhance the overall discriminatory power between
organizations and active individuals whose networks
look similar.
Social vs. coworker vs. acquaintance
tie classification
In the absence of respondent surveys or ethnographic
observation the nature of a communication tie
(e.g. professional vs. friendship vs. acquaintance)
could be inferred from time-location records of
mobile phone logs (Eagle et al., 2009; Toole et al.,
2015). The intuition behind this method is that
coworkers or professionals tend to be co-located
during work hours (e.g. in the same office building on
a weekday afternoon) while friends are more likely to
be co-located during off-work hours (e.g. in a bar on a
Friday evening). Acquaintances are likely to have few
colocations regardless of time. This approach, which
leverages time-location similarity (using cosine similar-
ity between hourly location occurrence vectors of two
individuals), yields accurate and convincing results with
mobile phone data that contain detailed time-location
records at regular time intervals for each individual
(i.e. whenever a mobile device communicates with a
cell-phone tower).
Nevertheless, blindly applying this method to
Twitter users with GPS tweet data could potentially
lead to biased results. Again, the active users will
appear to have high mobility with hundreds of GPS
locations captured in the data whereas the low activity
users may appear relatively immobile. Therefore, a tie
involving an active user is more likely to be classified as
either professional or friend than acquaintance.
Conclusion
Individuals traverse multiple locations in both physical
and network space. Twitter captures information about
these interactions and movements from which research-
ers can infer attributes of individuals and their social
relationships. In this essay, we considered three exam-
ples that are relevant to a broad range of academic and
practical applications. These examples highlight the
paradox of highly active users—those who generate
most Twitter content. Because they generate more
data points with which to measure their behavior,
highly active users are less vulnerable to random meas-
urement error, yet they are more vulnerable to system-
atic mis-classification when researchers make naive
assumptions about the distribution of user activity.
The paradox of highly active users can be addressed
by developing methods that handle the complexities
and multidimensionality of social life represented in
the data. The need to do so will intensify as an increas-
ing proportion of the population establishes their
online and social media presence with more complete
pictures of their lives painted in digital form.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest
with respect to the research, authorship, and/or publi-
cation of this article.
Funding
The author(s) disclosed receipt of the following financial sup-
port for the research, authorship, and/or publication of this
article: The authors would like to acknowledge the receipt of
research funding from MINERVA Initiative, Department of
Defense, National Science Foundation (SES-1357488, SES-
1434164 and SES-1226483) and National Research
Foundation of Korea (NRF-2013S1A3A2055285).
References
Broniatowski DA, Paul MJ and Dredze M (2013) National
and local influenza surveillance through Twitter: An ana-
lysis of the 2012–2013 influenza epidemic. PLoS ONE
8(12): e83672.
Compton R, Jurgens D and Allen D (2014) Geotagging one
hundred million twitter accounts with total variation mini-
mization. In: 3rd International Congress on Big Data,
October, IEEE BigData 2014, Washington, DC, pp.
393–401.
Eagle N, Pentland A and Lazer D (2009) Inferring friendship
network structure by using mobile phone data.
Proceedings of the National Academy of Sciences 106(36):
15274–15278.
Golder S and Macy M (2011) Diurnal and seasonal
mood vary with work, sleep, and daylength across diverse
cultures. Science 333(6051): 1878–1881.
Murphy D (2014) 44 percent of Twitter accounts have never
tweeted. PC Magazine.
Park P, Compton R and Lu T-C (2015) Network-based group
account classification. In: Agarwal N, et al. (eds) Lecture
notes in computer science: Social computing, behavioral-
cultural modeling and prediction, pp. 164–172.
Sarama
¨ki JEA, Leicht EL, Roberts SGB, et al. (2014)
Persistence of social signatures in human communication.
Proceedings of the National Academy of Sciences 111(3):
942–947.
Park and Macy 3
by guest on September 29, 2016Downloaded from
Tavares G and Faisal A (2013) Scaling-laws of human broad-
cast communication enable distinction between human,
corporate and robot Twitter users. PLoS One 8(7): e65774.
Toole J, Herra-Yague C, Schneider C, et al. (2015) Coupling
human mobility and social ties. arXiv 1502: 00690.
Twitter (2015) Twitter Q1 2015 earnings report.
Wu S, Hofman J, Mason W, et al. (2011) Who says what to
whom on Twitter. In: Proceedings of the 20th international
conference on World Wide Web (WWW 2011).
This article is part of a special theme on Colloquium: Assumptions of Sociality. To see a full list of all
articles in this special theme, please click here: http://bds.sagepub.com/content/colloquium-assump-
tions-sociality.
4Big Data & Society
by guest on September 29, 2016Downloaded from