Conference PaperPDF Available

Planetary-Scale Views on an Instant-Messaging Network


Abstract and Figures

We present a study of anonymized data capturing a month of high-level communication activities within the whole of the Microsoft Messenger instant-messaging system. We examine characteristics and patterns that emerge from the collective dynamics of large numbers of people, rather than the actions and characteristics of individuals. The dataset contains summary properties of 30 billion conversations among 240 million people. From the data, we construct a communication graph with 180 million nodes and 1.3 billion undirected edges, creating the largest social network constructed and analyzed to date. We report on multiple aspects of the dataset and synthesized graph. We find that the graph is well-connected and robust to node removal. We investigate on a planetary-scale the oft-cited report that people are separated by "six degrees of separation" and find that the average path length among Messenger users is 6.6. We find that people tend to communicate more with each other when they have similar age, language, and location, and that cross-gender conversations are both more frequent and of longer duration than conversations with the same gender.
Content may be subject to copyright.
arXiv:0803.0939v1 [physics.soc-ph] 6 Mar 2008
Planetary-Scale Views on an Instant-Messaging Network
Jure Leskovec
Machine Learning Department
Carnegie Mellon University
Pittsburgh, PA, USA
Eric Horvitz
Microsoft Research
Redmond, WA, USA
Microsoft Research Technical Report
June 2007
We present a study of anonymized data capturing a month of high-level communi-
cation activities within the whole of the Microsoft Messenger instant-messaging system.
We examine characteristics and patterns that emerge from the collective dynamics of
large numbers of people, rather than the actions and characteristics of individuals.
The dataset contains summary properties of 30 billion conversations among 240 mil-
lion people. From the data, we construct a communication graph with 180 million
nodes and 1.3 billion undirected edges, creating the largest social network constructed
and analyzed to date. We report on multiple aspects of the dataset and synthesized
graph. We find that the graph is well-connected and robust to node removal. We
investigate on a planetary-scale the oft-cited report that people are separated by “six
degrees of separation” and find that the average path length among Messenger users
is 6.6. We also find that people tend to communicate more with each other when they
have similar age, language, and location, and that cross-gender conversations are both
more frequent and of longer duration than conversations with the same gender.
Shorter version of this work appears in the WWW ’08: Proceedings of the 16th international conference
on World Wide Web, 2008.
This work was performed while the first author was an intern at Microsoft Research.
2Leskovec & Horvitz
1 Introduction
Large-scale web services provide unprecedented opportunities to capture and analyze behav-
ioral data on a planetary scale. We discuss findings drawn from aggregations of anonymized
data representing one month (June 2006) of high-level communication activities of people
using the Microsoft Messenger instant-messaging (IM) network. We did not have nor seek
access to the content of messages. Rather, we consider structural properties of a commu-
nication graph and study how structure and communication relate to user demographic
attributes, such as gender, age, and location. The data set provides a unique lens for study-
ing patterns of human behavior on a wide scale.
We explore a dataset of 30 billion conversations generated by 240 million distinct users
over one month. We found that approximately 90 million distinct Messenger accounts were
accessed each day and that these users produced about 1 billion conversations, with approx-
imately 7 billion exchanged messages per day. 180 million of the 240 million active accounts
had at least one conversation on the observation period. We found that 99% of the conversa-
tions occurred between 2 people, and the rest with greater numbers of participants. To our
knowledge, our investigation represents the largest and most comprehensive study to date of
presence and communications in an IM system. A recent report (6) estimated that approx-
imately 12 billion instant messages are sent each day. Given the estimate and the growth
of IM, we estimate that we captured approximately half of the world’s IM communication
during the observation period.
We created an undirected communication network from the data where each user is
represented by a node and an edge is placed between users if they exchanged at least one
message during the month of observation. The network represents accounts that were active
during June 2006. In summary, the communication graph has 180 million nodes, representing
users who participated in at least one conversation, and 1.3 billion undirected edges among
active users, where an edge indicates that a pair of people communicated. We note that this
graph should be distinguished from a buddy graph where two people are connected if they
appear on each other’s contact lists. The buddy graph for the data contains 240 million
nodes and 9.1 billion edges. On average each account has approximately 50 buddies on a
contact list.
To highlight several of our key findings, we discovered that the communication network
is well connected, with 99.9% of the nodes belonging to the largest connected component.
We evaluated the oft-cited finding by Travers and Milgram that any two people are linked
to one another on average via a chain with “6-degrees-of-separation” (17). We found that
the average shortest path length in the Messenger network is 6.6 (median 6), which is half
a link more than the path length measured in the classic study. However, we also found
that longer paths exist in the graph, with lengths up to 29. We observed that the network
is well clustered, with a clustering coefficient (19) that decays with exponent 0.37. This
decay is significantly lower than the value we had expected given prior research (11). We
found strong homophily (9, 12) among users; people have more conversations and converse for
longer durations with people who are similar to themselves. We find the strongest homophily
for the language used, followed by conversants’ geographic locations, and then age. We
found that homophily does not hold for gender; people tend to converse more frequently
and with longer durations with the opposite gender. We also examined the relation between
communication and distance, and found that the number of conversations tends to decrease
Planetary-Scale IM Network 3
with increasing geographical distance between conversants. However, communication links
spanning longer distances tend to carry more and longer conversations.
2 Instant Messaging
The use of IM has been become widely adopted in personal and businesss communications.
IM clients allow users fast, near-synchronous communication, placing it between synchronous
communication mediums, such as real-time voice interactions, and asynchronous communi-
cation mediums like email (18). IM users exchange short text messages with one or more
users from their list of contacts, who have to be on-line and logged into the IM system at the
time of interaction. As conversations and messages exchanged within them are usually very
short, it has been observed that users employ informal language, loose grammar, numerous
abbreviations, with minimal punctuation (10). Contact lists are commonly referred to as
buddy lists and users on the lists are referred to as buddies.
2.1 Research on Instant Messaging
Several studies on smaller datasets are related to this work. Avrahami and Hudson (3)
explored communication characteristics of 16 IM users. Similarly, Shi et al. (13) analyzed
IM contact lists submitted by users to a public website and explored a static contact network
of 140,000 people. Recently, Xiao et al. (20) investigated IM traffic characteristics within
a large organization with 400 users of Messenger. Our study differs from the latter study
in that we analyze the full Messenger population over a one month period, capturing the
interaction of user demographic attributes, communication patterns, and network structure.
2.2 Data description
To construct the Microsoft Instant Messenger communication dataset, we combined three
different sources of data: (1) user demographic information, (2) time and user stamped
events describing the presence of a particular user, and (3) communication session logs,
where, for all participants, the number of exchanged messages and the periods of time spent
participating in sessions is recorded.
We use the terms session and conversation interchangeably to refer to an IM interaction
among two or more people. Although the Messenger system limits the number of people
communicating at the same time to 20, people can enter and leave a conversation over time.
We note that, for large sessions, people can come and go over time, so conversations can be
long with many different people participating. We observed some very long sessions with
more than 50 participants joining over time.
All of our data was anonymized; we had no access to personally identifiable information.
Also, we had no access to text of the messages exchanged or any other information that
could be used to uniquely identify users. We focused on analyzing high-level characteristics
and patterns that emerge from the collective dynamics of 240 million people, rather than the
actions and characteristics of individuals. The analyzed data can be split into three parts:
presence data,communication data, and user demographic information:
Presence events: These include login, logout, first ever login, add, remove and block
a buddy, add unregistered buddy (invite new user), change of status (busy, away,
be-right-back, idle, etc.). Events are user and time stamped.
4Leskovec & Horvitz
Communication: For each user participating in the session, the log contains the
following tuple: session id, user id, time joined the session, time left the session,
number of messages sent, number of messages received.
User data: For each user, the following self-reported information is stored: age,
gender, location (country, ZIP), language, and IP address. We use the IP address to
decode the geographical coordinates, which we then use to position users on the globe
and to calculate distances.
We gathered data for 30 days of June 2006. Each day yielded about 150 gigabytes of
compressed text logs (4.5 terabytes in total). Copying the data to a dedicated eight-processor
server with 32 gigabytes of memory took 12 hours. Our log-parsing system employed a
pipeline of four threads that parse the data in parallel, collapse the session join/leave events
into sets of conversations, and save the data in a compact compressed binary format. This
process compressed the data down to 45 gigabytes per day. Processing the data took an
additional 4 to 5 hours per day.
A special challenge was to account for missing and dropped events, and session “id
recycling” across different IM servers in a server farm. As part of this process, we closed a
session 48 hours after the last leave session event. We closed sessions automatically if only
one user was left in the conversation.
3 Usage & population statistics
We shall first review several statistics drawn from aggregations of users and their commu-
nication activities.
3.1 Levels of activity
Over the observation period, 242,720,596 users logged into Messenger and 179,792,538 of
these users were actively engaged in conversations by sending or receiving at least one IM
message. Over the month of observation, 17,510,905 new accounts were activated. As
a representative day, on June 1 2006, there were almost 1 billion (982,005,323) different
sessions (conversations among any number of people), with more than 7 billion IM messages
sent. Approximately 93 million users logged in with 64 million different users becoming
engaged in conversations on that day. Approximately 1.5 million new users that were not
registered within Microsoft Messenger were invited to join on that particular day.
We consider event distributions on a per-user basis in Figure 1. The number of logins
per user, displayed in Figure 1(a), follows a heavy-tailed distribution with exponent 3.6. We
note spikes in logins at 20 minute and 15 second intervals, which correspond to an auto-login
function of the IM client. As shown in Figure 1(b), many users fill up their contact lists
rather quickly. The spike at 600 buddies undoubtedly reflects the maximal allowed length
of contact lists.
Figure 2(a) displays the number of users per session. In Messenger, multiple people
can participate in conversations. We observe a peak at 20 users, the limit on the number
of people who can participate simultaneously in a conversation. Figure 2(b) shows the
distribution over the session durations, which can be modeled by a power-law distribution
with exponent 3.6.
Planetary-Scale IM Network 5
γ = 3.6
number of Login events per user
Login every
20 minutes
Login every
15 seconds
γ = 2.2
number of AddBuddy events per user
(a) Login (b) AddBuddy
Figure 1: Distribution of the number of events per user. (a) Number of logins per user. (b) Number
of buddies added per user.
Number of users per session
20 102
Conversation duration
Figure 2: (a) Distribution of the number of people participating in a conversation. (b) Distribu-
tion of the durations of conversations. The spread of durations can be described by a power-law
Next, we examine the distribution of the durations of periods of time when people are
logged on to the system. Let (tij, toj) denote a time ordered (tij< toj< tij+1 ) sequence
of online and offline times of a user, where tijis the time of the jth login, and tojis the
corresponding logout time. Figure 3(a) plots the distribution of tojtijover all jover
all users. Similarly, Figure 3(b) shows the distribution of the periods of time when users
are logged off, i.e. tij+1 tojover all jand over all users. Fitting the data to power-law
distributions reveals exponents of 1.77 and 1.3, respectively. The data shows that durations
of being online tend to be shorter and decay faster than durations that users are offline.
We also notice periodic effects of login durations of 12, 24, and 48 hours, reflecting daily
periodicities. We observe similar periodicities for logout durations at multiples of 24 hours.
Weekly dynamics of MSN Messenger is also quite interesting. Figure 4 shows the number
of logins, status change and add buddy events by day of the week over a period of 5 weeks
starting in June 2006. We count the number of particular events per day of the week, and
6Leskovec & Horvitz
login duration
= 9.7e5 x−1.77 R2=1.00
logout duration
= 6.9e5 x−1.34 R2=1.00
Figure 3: (a) Distribution of login duration. (b) Duration of times when people are not logged into
the system (times between logout and login).
Mon Tue Wed Thu Fri Sat Sun
3.5x 108
Number of Login events
Mon Tue Wed Thu Fri Sat Sun
7x 107
Number of Status events
Mon Tue Wed Thu Fri Sat Sun
3x 107
Number of Add New Buddy events
(a) Login (b) Status (c) Add buddy
Figure 4: Number of events per day of the week. We collected the data over a period of 5 weeks
starting on May 29 2006.
we use the data from 5 weeks to compute the error bars. Figure 4(a) shows the average
number of logins per day of the week over a 5 week period. Note that number of login events
is larger than the number of distinct users logging in, since a user can login multiple times
a day. Figure 4(b) plots the average number of status change evens per day of the week.
Status events include a group of 8 events describing the current status of the users, i.e.,
away, be right back, online, busy, idle, at lunch, and on the phone. Last, Figure 4(c) shows
the average number of add buddy events per day of the week. Add buddy event is triggered
every time user adds a new contact to their contact list.
3.2 Demographic characteristics of the users
We compared the demographic characteristics of the Messenger population with 2005 world
census data and found differences between the statistics for age and gender. The visualization
of this comparison displayed in Figure 5 shows that users with reported ages in the 15–35
span of years are strongly overrepresented in the active Messenger population. Focusing
Planetary-Scale IM Network 7
0.1 0.05 0 0.05 0.1
proportion of the population
World population
MSN population
Figure 5: World and Messenger user population age pyramid. Ages 15–30 are overrepresented in
the Messenger population.
10 20 30 40 50 60 70 80
9x 106
number of people
0−4 20−24 40−44 60−64 80−84 100+
difference with world population [%]
(a) Age distribution (b) Age difference
Figure 6: Distribution of self-reported ages of Messenger users and the difference of ages of Mes-
senger population with the world population. (a) Age distribution for all users, females, males and
unknown users. (b) Relative difference of Messenger population and the world population. Ages
15–30 are over-represented in the Messenger user population.
on the differences by gender, females are overrepresented for the 10–14 age interval. For
male users, we see overall matches with the world population for age spans 10–14 and 3539;
for women users, we see a match for ages in the span of 30–34. We note that 6.5% of the
population did not submit an age when creating their Messenger accounts.
To further illustrate the points above Figure 6 shows self-reported user age distribution
8Leskovec & Horvitz
conversation duration [min]
= 1.5e11 x−3.70 R2=0.99
time between conversations [min]
= 3.9e9 x−1.53 R2=0.99
1 day 2 days 3 days
Figure 7: Temporal characteristics of conversations. (a) Average conversation duration per user;
(b) time between conversations of users.
and the percent difference of particular age-group between MSN and the world population.
The distribution is skewed to the right and has a mode at age of 18. We also note that the
distribution has exponential tails.
4 Communication characteristics
We now focus on characteristics and patterns with communications. We limit the analysis
to conversations between two participants, which account for 99% of all conversations.
We first examine the distributions over conversation durations and times between con-
versations. Let user uhave Cconversations in the observation period. Then, for every
conversation iof user uwe create a tuple (tsu,i, teu,i , mu,i), where tsu,i denotes the start
time of the conversation, teu,i is the end time of the conversation, and mu,i is the number of
exchanged messages between the two users. We order the conversations by their start time
(tsu,i < tsu,i+1). Then, for every user u, we calculate the average conversation duration
d(u) = 1
CPiteu,i tsu,i, where the sum goes over all the u’s conversations. Figure 7(a)
shows the distribution of ¯
d(u) over all the users u. We find that the conversation length can
be described by a heavy-tailed distribution with exponent -3.7 and a mode of 4 minutes.
Figure 7(b) shows the intervals between consecutive conversations of a user. We plot the
distribution of tsu,i+1 tsu,i, where tsu,i+1 and tsu,i denote start times of two consecutive
conversations of user u. The power-law exponent of the distribution over intervals is 1.5.
This result is similar to the temporal distribution for other kinds of human communication
activities, e.g., waiting times of emails and letters before a reply is generated (4). The
exponent can be explained by a priority-queue model where tasks of different priorities
arrive and wait until all tasks with higher priority are addressed. This model generates a
task waiting time distribution described by a power-law with exponent 1.5.
However, the total number of conversations between a pair of users (Figure 8(a)), and
the total number of exchanged messages between a pair of users (Figure 8(b)) does not seem
to follow a power law. The distribution seems still to be heavy tailed but not power-law.
The fits represent the MLE estimates of a log-normal distribution.
Planetary-Scale IM Network 9
p(c) (probability)
c (conversations between a pair of users)
LogNormal(1.2, 1.25)
p(m) (probability)
m (total number of exchanged messages)
LogNormal(2.3, 1.7)
(a) Number of conversations (b) Exchanged messages
Figure 8: Conversation statistics: (a) Number of conversations of a user in a month; (b) Number
of messages exchanged per conversation;
5 Communication demographics
Next we examine the interplay of communication and user demographic attributes, i.e., how
geography, location, age, and gender influence observed communication patterns.
5.1 Communication by age
We sought to understand how communication among people changes with the reported
ages of participating users. Figures 9(a)-(d) use a heat-map visualization to communicate
properties for different age–age pairs. The rows and columns represent the ages of both
parties participating, and the color at each age–age cell captures the logarithm of the value
for the pairing. The color spectrum extends from blue (low value) through green, yellow,
and onto red (the highest value). Because of potential misreporting at very low and high
ages, we concentrate on users with self-reported ages that fall between 10 and 60 years.
Let a tuple (ai, bi, di, mi) denote the ith conversation in the entire dataset that occurred
among users of ages aiand bi. The conversation had a duration of diseconds during which
mimessages were exchanged. Let Ca,b ={(ai, bi, di, mi) : ai=abi=b}denote a set of
all conversations between users of ages aand b, respectively.
Figure 9(a) shows the number of conversations among people of different ages. For
every pair of ages (a, b) the color indicates the size of set Ca,b ,i.e., the number of different
conversations between users of ages aand b. We note that, as the notion of a conversation
is symmetric, the plots are symmetric. Most conversations occur between people of ages 10
to 20. The diagonal trend indicates that people tend to talk to people of similar age. This
is true especially for age groups between 10 and 30 years. We shall explore this observation
in more detail in Section 6.
Figure 9(b) displays a heat map for the average conversation duration, computed as
|Ca,b|PiCa,b di. We note that older people tend to have longer conversations. We observe
a similar phenomenon when plotting the average number of exchanged messages per conver-
sation, computed as 1
|Ca,b|PiCa,b mi, displayed in Figure 9(c). Again, we find that older
people exchange more messages, and we observe a dip for ages 25–45 and a slight peak
10 Leskovec & Horvitz
10 20 30 40 50 60
10 20 30 40 50 60
(a) Number of conversations (b) Conversation duration
10 20 30 40 50 60
10 20 30 40 50 60
(c) Messages per conversation (d) Messages per unit time
Figure 9: Communication characteristics of users by reported age. We plot age vs. age and the
color (z-axis) represents the intensity of communication.
for ages 15–25. Figure 9(d) displays the number of exchanged messages per unit time; for
each age pair, (a, b), we measure 1
di. Here, we see that younger people have
faster-paced dialogs, while older people exchange messages at a slower pace.
We note that the younger population (ages 10–35) are strongly biased towards com-
municating with people of a similar age (diagonal trend in Figure 9(a)), and that users
who report being of ages 35 years and above tend to communicate more evenly across ages
(rectangular pattern in Fig. 9(a)). Moreover, older people have conversations of the longest
durations, with a “valley” in the duration of conversations for users of ages 25–35. Such a dip
may represent shorter, faster-paced and more intensive conversations associated with work-
related communications, versus more extended, slower, and longer interactions associated
with social discourse.
5.2 Communication by gender
We report on analyses of properties of pairwise communications as a function of the self-
reported gender of users in conversations in Table 1. Let Cg,h ={(gi, hi, di, mi) : gi=
ghi=h}denote a set of conversations where the two participating users are of genders g
Planetary-Scale IM Network 11
Unknown Female Male
Unknown 1.3 3.6 3.7
Female 21.3 49.9
Male 20.2
Unknown Female Male
Unknown 277 301 277
Female 275 304
Male 252
(a) Conversations (b) Conversation duration
Unknown Female Male
Unknown 5.7 7.1 6.7
Female 6.6 7.6
Male 5.9
Unknown Female Male
Unknown 1.25 1.42 1.38
Female 1.43 1.50
Male 1.42
(c) Exchanged messages per conversation (d) Conversation intensity
Table 1: Cross-gender communication. Data is based on all two-person conversations from June
2006. (a) Percentage of conversations among users of different self-reported gender; (b) average
conversation length in seconds; (c) number of exchanged messages per conversation; (d) number of
exchanged messages per minute of conversation.
and h. Note that gtakes 3 possible values: female, male, and unknown (unreported).
Table 1(a) relays |Cg,h |for combinations of genders gand h. The table shows that approx-
imately 50% of conversations occur between male and female and 40% of the conversations
occur among users of the same gender (20% for each). A small number of conversations
occur between people who did not reveal their gender.
Similarly, Table 1(b) shows the average conversation length in seconds, broken down by
the gender of conversant, computed as 1
|Cg,h|PiCg,h di. We find that male–male conver-
sations tend to be shortest, lasting approximately 4 minutes. Female–female conversations
last 4.5 minutes on the average. Female–male conversations have the longest durations,
taking more than 5 minutes on average. Beyond taking place over longer periods of time,
more messages are exchanged in female–male conversations. Table 1(c) lists values for
|Cg,h|PiCg,h miand shows that, in female–male conversations, 7.6 messages are exchanged
per conversation on the average as opposed to 6.6 and 5.9 for female–female and male–male,
respectively. Table 1(d) shows the communication intensity computed as 1
The number of messages exchanged per minute of conversation for male–female conversa-
tions is higher at 1.5 messages per minute than for cross-gender conversations, where the
rate is 1.43 messages per minute.
We examined the number of communication ties, where a tie is established between
two people when they exchange at least one message during the observation period. We
computed 300 million male–male ties, 255 million female–female ties, and 640 million cross-
gender ties. The Messenger population consists of 100 million males and 80 million females
by self report. These findings demonstrate that ties are not heavily gender biased; based
on the population, random chance predicts 31% male–male, 20% female–female, and 49%
female–male links. We observe 25% male–male, 21% female–female, and 54% cross-gender
links, thus demonstrating a minor bias of female–male links.
The results reported in Table 1 run counter to prior studies reporting that communication
among individuals who resemble one other (same gender) occurs more often (see (9) and
12 Leskovec & Horvitz
Figure 10: Number of users at a particular geographic location. Color represents the number of
users. Notice the map of the world appears.
references therein). We identified significant heterophily, where people tend to communicate
more with people of the opposite gender. However, we note that link heterogeneity was very
close to the population value (8), i.e., the number of same- and cross-gender ties roughly
corresponds to random chance. This shows there is no significant bias in linking for gender.
However, we observe that cross-gender conversations tend to be longer and to include more
messages, suggesting that more effort is devoted to conversations with the opposite sex.
5.3 World geography and communication
We now focus on the influence of geography and distance among participants on communica-
tions. Figure 10 shows the geographical locations of Messenger users. The general location
of the user was obtained via reverse IP lookup. We plot all latitude/longitude positions
linked to the position of servers where users log into the service. The color of each dot
corresponds to the logarithm of the number of logins from the respective location, again
using a spectrum of colors ranging from blue (low) through green and yellow to red (high).
Although the maps are built solely by plotting these positions, a recognizable world map is
generated. We find that North America, Europe, and Japan are very dense, with many users
from those regions using Messenger. For the rest of the world, the population of Messenger
users appears to reside largely in coastal regions.
We can condition the densities and behaviors of Messenger users on multiple geographical
and socioeconomic variables and explore relationships between electronic communications
and other attributes. As an example, harnessed the United Nations gridded world population
data to provide estimates of the number of people living in each cell. Given this data,
and the data from Figure 10, we calculate the number of users per capita, displayed in
Figure 12. Now we see transformed picture where several sparsely populated regions stand
out as having a high usage per capita. These regions include the center of the United States,
Canada, Scandinavia, Ireland, Australia, and South Korea.
Figure 13 shows a heat map that represents the intensities of Messenger communications
Planetary-Scale IM Network 13
Figure 11: Number of users at particular geographic location superimposed on the map of the world.
Color represents the number of users.
Figure 12: Number of Messenger users per capita. Color intensity corresponds to the number of
users per capita in the cell of the grid.
on an international scale. To create this map, we place the world map on a fine grid, where
each cell of the grid contains the count of the number of conversations that pass through
that point by increasing the count of all cells on the straight line between the geo-locations
of pairs of conversants. The color indicates the number of conversations crossing each point,
providing a visualization of the key flows of communication. For example, Australia and New
Zealand have communications flowing towards Europe and United States. Similar flows hold
for Japan. We see that Brazilian communications are weighted toward Europe and Asia.
14 Leskovec & Horvitz
Figure 13: A communication heat map.
United States
United Kingdom
China Hong Kong SAR
France Portugal Turkey
Dominican Republic
Saudi Arabia
Palestinian Auth.
Figure 14: (a) Communication among countries with at least 10 million conversations in June 2006.
(b) Countries by average length of the conversation. Edge widths correspond to logarithms of
intensity of links.
We can also explore the flows of transatlantic and US transcontinental communications.
5.4 Communication among countries
Communication among people within different countries also varies depending on the loca-
tions of conversants. We examine two such views. Figure 14(a) shows the top countries by
the number of conversations between pairs of countries. We examined all pairs of countries
with more than 10 million conversations per month. The width of edges in the figure is
Planetary-Scale IM Network 15
Country Fraction of population
Iceland 0.35
Spain 0.28
Netherlands 0.27
Canada 0.26
Sweden 0.25
Norway 0.25
Bahamas, The 0.24
Netherlands Antilles 0.24
Belgium 0.23
France 0.18
United Kingdom 0.17
Brazil 0.08
United States 0.08
Table 2: Top 10 countries with most the largest number of Messenger users. Fraction of country’s
population actively using Messenger.
Country Conversations per user per day
Afghanistan 4.37
Netherlands Antilles 3.79
Jamaica 2.63
Cyprus 2.33
Hong Kong 2.27
Tunisia 2.25
Serbia 2.15
Dominican Republic 2.06
Bulgaria 2.07
Table 3: Top 10 countries by the number of conversations per user per day.
proportional to the logarithm of the number of conversations among the countries. We find
that the United States and Spain appear to serve as hubs and that edges appear largely
between historically or ethnically connected countries. As examples, Spain is connected
with the Spanish speaking countries in South America, Germany links to Turkey, Portugal
to Brazil, and China to Korea.
Figure 14(b) displays a similar plot where we consider country pairs by the average
duration of conversations. The width of the edges are proportional to the mean length of
conversations between the countries. The core of the network appears to be Arabic countries,
including Saudi Arabia, Egypt, United Arab Emirates, Jordan, and Syria.
Comparing the number of active users with the country population reveals interesting
findings. Table 2 shows the top 10 countries with the highest fraction of population using
Messenger. These are mainly northern European countries and Canada. Countries with
most of the users (US, Brazil) tend to have smaller fraction of population using Messenger.
Similarly, Table 3 shows the top 10 countries by the number of conversations per user per
16 Leskovec & Horvitz
Country Messages per user per day Minutes talking per user per day
Afghanistan 32.00 20.91
Netherlands Antilles 24.12 17.43
Serbia 22.41 12.01
Bosnia and Herzegovina 22.40 11.41
Macedonia 19.52 10.46
Cyprus 19.33 12.37
Tunisia 19.17 13.54
Bulgaria 18.94 11.38
Croatia 17.78 10.05
Table 4: Top 10 countries by the number of messages and minutes talking per user per day.
day. Here the countries are very diverse with Afghanistan topping the list. The Netherlands
Antilles appears on top 10 list for both the fraction of the population using Messenger and
the number of conversations per user.
Last, Table 4 shows the top 10 countries by the number of messages and minutes talking
per user per day. We note that the list of the countries is similar to those in Table 3.
Afghanistan still tops the list but now most of the talkative counties come from Eastern
Europe (Serbia, Bosnia, Bulgaria, Croatia).
5.5 Communication and geographical distance
We were interested in how communications change as the distance between people increases.
We had hypothesized that the number of conversations would decrease with geographical
distance as users might be doing less coordination with one another on a daily basis, and
where communication would likely require more effort to coordinate than might typically be
needed for people situated more locally. We also conjectured that, once initiated, conversa-
tions among people who are farther apart would be somewhat longer as there might be a
stronger need to catch up when the less-frequent conversations occurred.
Figure 15 plots the relation between communication and distance. Figure 15(a) shows
the distribution of the number of conversations between conversants at distance l. We
found that the number of conversations decreases with distance. However, we observe a
peak at a distance of approximately 500 kilometers. The other peaks and drops may reveal
geographical features. For example, a significant drop in communication at distance of 5,000
km (3,500 miles) may reflect the width of the Atlantic ocean or the distance between the
east and west coasts of the United States. The number of links rapidly decreases with
distance. This finding suggests that users may use Messenger mainly for communications
with others within a local context and environment. We found that the number of exchanged
messages and conversation lengths do not increase with distance (see plots (b)–(d) and
(f) of Figure 15). Conversation duration decreases with the distance, while the number
of exchanged messages remains constant before decreasing slowly. Figure 15(f) shows the
communications per link versus the distance among participants. The plot shows that longer
links, i.e., connections between people who are farther apart, are more frequently used than
shorter links. We interpret this finding to mean that people who are farther apart use
Messenger more frequently to communicate.
Planetary-Scale IM Network 17
0 0.5 1 1.5 2
x 104
10x 106
distance [km]
number of friendships
Raw data
0 0.5 1 1.5 2
x 104
6x 107
distance [km]
number of conversations
Raw data
(a) Number of links (b) Number of conversations
5000 10000 15000
distance [km]
time per conversation [min]
Raw data
5000 10000 15000
distance [km]
exchanged messages
Raw data
(c) Conversation duration (d) Exchanged messages
5000 10000 15000
distance [km]
conversations per friendship
Raw data
5000 10000 15000
distance [km]
messages per unit time
Raw data
(e) Conversations per link (f) Messages per unit time
Figure 15: Communication with the distance. (a) Number of links (pairs of people that commu-
nicate) with the distance. (b) Number of conversations between people at particular distance. (c)
Average conversation duration. (d) Number of exchanged messages per conversation. (e) Number
of conversations per link (per pair of communicating users). (f) Number of exchanged messages per
unit time.
18 Leskovec & Horvitz
Correlation Probability
Attribute Rnd Comm Rnd Comm
Age -0.0001 0.297 0.030 0.162
Gender 0.0001 -0.032 0.434 0.426
ZIP -0.0003 0.557 0.001 0.23
County 0.0005 0.704 0.046 0.734
Language -0.0001 0.694 0.030 0.798
Table 5: Correlation coefficients and probability of users sharing an attribute for random pairs of
people versus for pairs of people who communicate.
In summary, we observe that the total number of links and associated conversations
decreases with increasing distance among participants. The same is true for the duration
of conversations, the number of exchanged messages per conversation, and the number of
exchanged messages per unit time. However, the number of times a link is used tends to
increase with the distance among users. This suggests that people who are farther apart
tend to converse with IM more frequently, which perhaps takes the place of more expensive
long-distance voice telephony; voice might be used more frequently in lieu of IM for less
expensive local communications.
6 Homophily of communication
We performed several experiments to measure the level at which people tend to communicate
with similar people. First, we consider all 1.3 billion pairs of people who exchanged at
least one message in June 2006, and calculate the similarity of various user demographic
attributes. We contrast this with the similarity of pairs of users selected via uniform random
sampling across 180 million users. We consider two measures of similarity: the correlation
coefficient and the probability that users have the same attribute value, e.g., that users come
from the same countries.
Table 5 compares correlation coefficients of various user attributes when pairs of users are
chosen uniformly at random with coefficients for pairs of users who communicate. We can
see that attributes are not correlated for random pairs of people, but that they are highly
correlated for users who communicate. As we noted earlier, gender and communication are
slightly negatively correlated; people tend to communicate more with people of the opposite
Another method for identifying association is to measure the probability that a pair of
users will show an exact match in values of an attribute, i.e., identifying whether two users
come from the same country, speak the same language, etc. Table 5 shows the results for
the probability of users sharing the same attribute value. We make similar observations as
before. People who communicate are more likely to share common characteristics, including
age, location, language, and they are less likely to be of the same gender. We note that the
most common attribute of people who communicate is language. On the flip side, the amount
of communication tends to decrease with increasing user dissimilarity. This relationship is
highlighted in Figure 15, which shows how communication among pairs of people decreases
with distance.
Figure 16 further illustrates the results displayed in Table 5, where we randomly sample
Planetary-Scale IM Network 19
10 20 30 40 50 60 70 80
10 20 30 40 50 60 70 80
(a) Random (b) Communicate
Figure 16: Numbers of pairs of people of different ages. (a) Randomly selected pairs of people; (b)
people who communicate. Correlation between age and communication is captured by the diagonal
pairs of users from the Messenger user base, and then plot the distribution over reported
ages. As most of the population comes from the age group 10–30, the distribution of random
pairs of people reaches the mode at those ages but there is no correlation. Figure 16(b) shows
the distribution of ages over the pairs of people who communicate. Note the correlation,
as represented by the diagonal trend on the plot, where people tend to communicate more
with others of a similar age.
Next, we further explore communication patterns by the differences in the reported ages
among users. Figure 17(a) plots the number links in the communication network vs. the age
difference of the communicating pair of users. Similarly, Figure 17(b) plots on a log-linear
scale the number of conversations in the social network with participants of varying age
differences. Again we see that links and conversations are strongly correlated with the age
differences among participants. Figure 17(c) shows the average conversation duration with
the age difference among the users. Interestingly, the mean conversation duration peaks
at an age difference of 20 years between participants. We speculate that the peak may
correspond roughly to the gap between generations.
The plots reveal that there is strong homophily in the communication network for age;
people tend to communicate more with people of similar reported age. This is especially
salient for the number of buddies and conversations among people of the same ages. We also
observe that the links between people of similar attributes are used more often, to interact
with shorter and more intense (more exchanged messages) communications. The intensity of
communication decays linearly with the difference in age. In contrast to findings of previous
studies, we observe that the number of cross-gender communication links follows a random
chance. However, cross-gender communication takes longer and is faster paced as it seems
that people tend to pay more attention when communicating with the opposite sex.
Recently, using the data we generated, Singla and Richardson further investigated the
homophily within the Messenger network and found that people who communicate are also
more likely to search the web for content on similar topics (14).
20 Leskovec & Horvitz
0 20 40 60 80 100
age difference
number of friendships
0 20 40 60 80 100
age difference
number of conversations
(a) Number of links (b) Number of conversations
0 20 40 60 80
age difference
time per conversation [min]
0 20 40 60 80
age difference
exchanged messages
(c) Conversation duration (d) Exchanged messages
0 20 40 60 80
age difference
conversations per friendship
0 20 40 60 80
age difference
exchanged messages per unit time
(e) Conversations per link (f) Messages per unit time
Figure 17: Communication characteristics with age difference between the users. (a) Number of
links (pairs communicating) with the age difference. (b) Number of conversations. (c) Average
conversation duration with the age difference. (d) Average number of exchanged messages per
conversation as a function of the age difference between the users. (e) Number of conversations per
link in the observation period with the age difference. (f) Number of exchanged messages per unit
time as a function of age difference between the users.
Planetary-Scale IM Network 21
p(k) (probability)
k (number of conversants)
k-0.8 exp(-0.03k)
p(b) (Probability)
b (Number of buddies)
b-0.6 exp(-0.01b)
(a) Communication (b) Buddies
Figure 18: (a) Degree distribution of communication network (number of people with whom a
person communicates). (b) Degree distribution of the buddy network (length of the contact list).
7 The communication network
So far we have examined communication patterns based on pairwise communications. We
now create a more general communication network from the data. Using this network,
we can examine the typical social distance between people, i.e., the number of links that
separate a random pair of people. This analysis seeks to understand how many people can
be reached within certain numbers of hops among people who communicate. Also, we test
the transitivity of the network, i.e., the degree at which pairs with a common friend tend
to be connected.
We constructed a graph from the set of all two-user conversations, where each node
corresponds to a person and there is an undirected edge between a pair of nodes if the users
were engaged in an active conversation during the observation period (users exchanged at
least 1 message). The resulting network contains 179,792,538 nodes, and 1,342,246,427 edges.
Note that this is not simply a buddy network; we only connect people who are buddies and
have communicated during the observation period.
Figures 18–19 show the structural properties of the communication network. The network
degree distribution shown in Figure 18(a) is heavy tailed but does not follow a power-law
distribution. Using maximum likelihood estimation, we fit a power-law with exponential
cutoff p(k)kaebk with fitted parameter values a= 0.8 and b= 0.03. We found a strong
cutoff parameter and low power-law exponent, suggesting a distribution with high variance.
Figure 18(b) displays the degree distribution of a buddy graph. We did not have access
to the full buddy network; we only had access to data on the length of the user contact list
which allowed us to create the plot. We found a total of 9.1 billion buddy edges in the graph
with 49 buddies per user. We fit the data with a power-law distribution with exponential
cutoff and identified parameters of a= 0.6 and b= 0.01. The power-law exponent now is
even smaller. This model described the data well. We note a spike at 600 which is the limit
on the maximal number of buddies imposed by the Messenger software client. The maximal
number of buddies was increased to 300 from 150 in March 2005, and was later raised to
600. With the data from June 2006, we see only the peak at 600, and could not identify
bumps at the earlier constraints.
22 Leskovec & Horvitz
c (Clustering coefficient)
k (Degree)
c k-0.37
Weakly connected component size
largest component
(99.9% of the nodes)
(a) Clustering (b) Components
Figure 19: (a) Clustering coefficient; (b) distribution of connected components. 99.9% of the nodes
belong to the largest connected component.
0 5 10 15 20 25 30
p(l) (Probability)
l, (Path length in hops)
Number of nodes
Core of order K
k=60-68, n=79
(a) Diameter (b) k-cores
Figure 20: (a) Distribution over the shortest path lengths. Average shortest path has length 6.6,
the distribution reaches the mode at 6 hops, and the 90% effective diameter is 7.8; (b) distribution
of sizes of cores of order k.
Social networks have been found to be highly transitive, i.e., people with common friends
tend to be friends themselves. The clustering coefficient (19) has been used as a measure of
transitivity in the network. The measure is defined as the fraction of triangles around a node
of degree k(19). Figure 19(a) displays the clustering coefficient versus the degree of a node
for Messenger. Previous results on measuring the web graph as well as theoretical analyses
show that the clustering coefficient decays as k1(exponent 1) with node degree k(11).
For the Messenger network, the clustering coefficient decays very slowly with exponent 0.37
with the degree of a node and the average clustering coefficient is 0.137. This result suggests
that clustering in the Messenger network is much higher than expected—that people with
common friends also tend to be connected. Figure 19(b) displays the distribution of the
connected components in the network. The giant component contains 99.9% of the nodes
in the network against a background of small components, and the distribution follows a
power law.
Planetary-Scale IM Network 23
2-core 1-core
Figure 21: k-core decomposition of a small graph. Nodes contained in each closed line belong to a
given k-core. Inside each k-core all nodes have degree larger than k(after removing all nodes with
degree less than k.
7.1 How small is the small world?
Messenger data gives us a unique opportunity to study distances in the social network. To
our knowledge, this is the first time a planetary-scale social network has been available to
validate the well-known “6 degrees of separation” finding by Travers and Milgram (17). The
earlier work employed a sample of 64 people and found that the average number of hops for
a letter to travel from Nebraska to Boston was 6.2 (mode 5, median 5), which is popularly
known as the “6 degrees of separation” among people. We used a population sample that is
more than two million times larger than the group studied earlier and confirmed the classic
Figure 20(a) displays the distribution over the shortest path lengths. To approximate
the distribution of the distances, we randomly sampled 1000 nodes and calculated for each
node the shortest paths to all other nodes. We found that the distribution of path lengths
reaches the mode at 6 hops and has a median at 7. The average path length is 6.6. This
result means that a random pair of nodes in the Messenger network is 6.6 hops apart on
the average, which is half a link longer than the length measured by Travers and Milgram.
The 90th percentile (effective diameter (16)) of the distribution is 7.8. 48% of nodes can be
reached within 6 hops and 78% within 7 hops. So, we might say that, via the lens provided
on the world by Messenger, we find that there are about “7 degrees of separation” among
people. We note that long paths, i.e., nodes that are far apart, exist in the network; we
found paths up to a length of 29.
7.2 Network cores
We further study connectivity of the communication network by examining the k-cores (5)
of the graph. The concept of k-core is a generalization of the giant connected component.
The k-core of a network is a set of vertices K, where each vertex in Khas at least kedges
to other vertices in K(see Figure 21). The distribution of k-core sizes gives us an idea of
how quickly the network shrinks as we move towards the core.
The k-core of a graph can be obtained by deleting from the network all vertices of degree
less than k. This process will decrease degrees of some non-deleted vertices, so more vertices
24 Leskovec & Horvitz
will have degree less than k. We keep pruning vertices until all remaining vertices have
degree of at least k. We call the remaining vertices a k-core.
Figure 20 plots the number of nodes in a core of order k. We note that the core sizes
are remarkably stable up to a value of k20; the number of nodes in the core drops for
only an order of magnitude. After k > 20, the core size rapidly drops. The central part
of the communication network is composed of 79 nodes, where each of them has more than
68 edges inside the set. The structure of the Messenger communication network is quite
different from the Internet graph; it has been observed (2) that the size of a k-core of the
Internet decays as a power-law with k. Here we see that the core sizes remains very stable
up to a degree 20, and only then start to rapidly degrease. This means that the nodes
with degrees of less than 20 are on the fringe of the network, and that the core starts to
rapidly decrease as nodes of degree 20 or more are deleted.
7.3 Strength of the ties
It has been observed by Albert et al. (1) that many real-world networks are robust to node-
level changes or attacks. Researchers have showed that networks like the World Wide Web,
Internet, and several social networks display a high degree of robustness to random node
removals, i.e., one has to remove many nodes chosen uniformly at random to make the
network disconnected. On the contrary, targeted attacks are very effective. Removing a few
high degree nodes can have a dramatic influence on the connectivity of a network.
Let us now study how the Messenger communication network is decomposed when
“strong,” i.e., heavily used, edges are removed from the network. We consider several
different definitions of “heavily used,” and measure the types of edges that are most impor-
tant for network connectivity. We note that a similar experiment was performed by Shi et
al (13) in the context of a small IM buddy network. The authors of the prior study took the
number of common friends at the ends of an edge as a measure of the link strength. As the
number of edges here is too large (1.3 billion) to remove edges one by one, we employed the
following procedure: We order the nodes by decreasing value per a measure of the intensity
of engagement of users; we then delete nodes associated with users in order of decreasing
measure and we observe the evolution of the properties of the communication network as
nodes are deleted.
We consider the following different measures of engagement:
Average sent: The average number of sent messages per user’s conversation
Average time: The average duration of user’s conversations
Links: The number of links of a user (node degree), i.e., number of different people he
or she exchanged messages with
Conversations: The total number of conversations of a user in the observation period
Sent messages: The total number of sent messages by a user in the observation period
Sent per unit time: The number of sent messages per unit time of a conversation
Total time: The total conversation time of a user in the observation period
Planetary-Scale IM Network 25
0 2 4 6 8 10 12 14 16
x 107
Deleted nodes
Component size
Avg. sent
Avg. time
Sent messages
Sent per unit time
Total time
Figure 22: Relative size of the largest connected component as a function of number of nodes
At each step of the experiment, we remove 10 million nodes in order of the specific
measure of engagement being studied. We then determine the relative size of the largest
connected component, i.e., given the network at particular step, we find the fraction of the
nodes belonging to the largest connected component of the network.
Figure 22 plots the evolution of the fraction of nodes in the largest connected component
with the number of deleted nodes. We plot a separate curve for each of the seven different
measures of engagement. For comparison, we also consider the random deletion of the nodes.
The decomposition procedure highlighted two types of dynamics of network change with
node removal. The size of the largest component decreases rapidly when we use as measures
of engagement the number of links, number of conversations, total conversation time, or
number of sent messages. In contrast, the size of the largest component decreases very
slowly when we use as a measure of engagement the average time per conversation, average
number of sent messages, or number of sent messages per unit time. We were not surprised
to find that the size of the largest component size decreases most rapidly when nodes are
deleted in order of the decreasing number of links that they have, i.e., the number of people
with whom a user at a node communicates. Random ordering of the nodes shrinks the
component at the slowest rate. After removing 160 million out of 180 million nodes with the
random policy, the largest component still contains about half of the nodes. Surprisingly,
when deleting up to 100 million nodes, the average time per conversation measure shrinks
the component even more slowly than the random deletion policy.
Figure 23 displays plots of the number of removed edges from the network as nodes
are deleted. Similar to the relationships in Figure 22, we found that deleting nodes by the
inverse number of edges removes edges the fastest. As in Figure 23, the same group of
node ordering criteria (number of conversations, total conversation time or number of sent
26 Leskovec & Horvitz
0 2 4 6 8 10 12 14 16
x 107
14 x 108
Deleted nodes
Deleted edges
Avg. sent
Avg. time
Sent messages
Sent per unit time
Total time
Figure 23: Number of removed edges as nodes are deleted by order of different measures of engage-
messages) removes edges from the networks as fast as the number of links criteria. However,
we find that random node removal removes edges in a linear manner. Edges are removed
at a lower rate when deleting nodes by average time per conversation, average numbers of
sent messages, or numbers of sent messages per unit time. We believe that these findings
demonstrate that users with long conversations and many messages per conversation tend
to have smaller degrees—even given the findings displayed in Figure 22, where we saw that
removing these users is more effective for breaking the connectivity of the network than for
random node deletion. Figure 23 also shows that using the average number of messages per
conversation as a criterion removes edges in the slowest manner. We believe that this makes
sense intuitively: If users invest similar amounts of time to interacting with others, then
people with short conversations will tend to converse with more people in a given amount
of time than users having long conversations.
8 Conclusion
We have reviewed a set of results stemming from the generation and analysis of an anonymized
dataset representing the communication patterns of all people using a popular IM system.
The methods and findings highlight the value of using a large IM network as a worldwide
lens onto aggregate human behavior.
We described the creation of the dataset, capturing high-level communication activities
and demographics in June 2006. The core dataset contains more than 30 billion conversations
among 240 million people. We discussed the creation and analysis of a communication
graph from the data containing 180 million nodes and 1.3 billion edges. The communication
Planetary-Scale IM Network 27
network is largest social network analyzed to date. The planetary-scale network allowed
us to explore dependencies among user demographics, communication characteristics, and
network structure. Working with such a massive dataset allowed us to test hypotheses such
as the average chain of separation among people across the entire world.
We discovered that the graph is well connected, highly transitive, and robust. We re-
viewed the influence of multiple factors on communication frequency and duration. We
found strong influences of homophily in activities, where people with similar characteristics
tend to communicate more, with the exception of gender, where we found that cross-gender
conversations are both more frequent and of longer duration than conversations with users of
the same reported gender. We also examined the path lengths and validated on a planetary
scale earlier research that found “6 degrees of separation” among people.
We note that the sheer size of the data limits the kinds of analyses one can perform. In
some cases, a smaller random sample may avoid the challenges with working with terabytes of
data. However, it is known that sampling can corrupt the structural properties of networks,
such as the degree distribution and the diameter of the graphs (15). Thus, while sampling
may be valuable for managing complexity of analyses, results on network properties with
partial data sets may be rendered unreliable. Furthermore, we need to consider the full data
set to reliably measure the patterns of age and distance homophily in communications.
In other directions of research with the dataset, we have pursued the use of machine
learning and inference to learn predictive models that can forecast such properties as com-
munication frequencies and durations of conversations among people as a function of the
structural and demographic attributes of conversants. Our future directions for research
include gaining an understanding of the dynamics of the structure of the communication
network via a study of the evolution of the network over time.
We hope that our studies with Messenger data serves as an example of directions in
social science research, highlighting how communication systems can provide insights about
high-level patterns and relationships in human communications without making incursions
into the privacy of individuals. We hope that this first effort to understand a social network
on a genuinely planetary scale will embolden others to explore human behavior at large
We thank Dan Liebling for help with generated world map plots, and Dimitris Achlioptas
and Susan Dumais for helpful suggestions.
[1] R. Albert, H. Jeong, and A.-L. Barabasi. Error and attack tolerance of complex net-
works. Nature, 406:378, 2000.
[2] J. I. Alvarez-Hamelin, L. Dall’Asta, A. Barrat, and A. Vespignani. Analysis and visual-
ization of large scale networks using the k-core decomposition. In ECCS ’05: European
Conference on Complex Systems, 2005.
[3] D. Avrahami and S. E. Hudson. Communication characteristics of instant messaging:
28 Leskovec & Horvitz
effects and predictions of interpersonal relationships. In CSCW ’06, pages 505–514,
[4] A.-L. Barabasi. The origin of bursts and heavy tails in human dynamics. Nature,
435:207, 2005.
[5] V. Batagelj and M. Zaversnik. Generalized cores. ArXiv, (cs.DS/0202039), Feb 2002.
[6] IDC Market Analysis. Worldwide Enterprise Instant Messaging Applications 2005–2009
Forecast and 2004 Vendor Shares: Clearing the Decks for Substantial Growth. 2005.
[7] J. Leskovec and E. Horvitz. Worldwide Buzz: Planetary-Scale Views on an Instant-
Messaging Network. Tech. report MSR-TR-2006-186, 2006.
[8] P. V. Marsden. Core discussion networks of americans. American Sociological Review,
52(1):122–131, 1987.
[9] M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in
social networks. Annual Review of Sociology, 27(1):415–444, 2001.
[10] B. A. Nardi, S. Whittaker, and E. Bradner. Interaction and outeraction: instant mes-
saging in action. In CSCW ’00: Proceedings of the 2000 ACM conference on Computer
supported cooperative work, pages 79–88, 2000.
[11] E. Ravasz and A.-L. Barabasi. Hierarchical organization in complex networks. Physical
Review E, 67(2):026112, 2003.
[12] E. M. Rogers and D. K. Bhowmik. Homophily-heterophily: Relational concepts for
communication research. Public Opinion Quarterly, 34:523–538, 1970.
[13] X. Shi, L. A. Adamic, and M. J. Strauss. Networks of strong ties. Physica A Statistical
Mechanics and its Applications, 378:33–47, May 2007.
[14] P. Singla and M. Richardson. Yes, there is a correlation - from social networks to
personal behavior on the web. In WWW ’08, 2008.
[15] M. P. Stumpf, C. Wiuf, R. M. May. Subnets of scale-free networks are not scale-free:
sampling properties of networks. PNAS, 102(12), 2005.
[16] S. L. Tauro, C. Palmer, G. Siganos, and M. Faloutsos. A simple conceptual model for
the internet topology. In GLOBECOM ’01, vol. 3, pages 1667 – 1671, 2001.
[17] J. Travers and S. Milgram. An experimental study of the small world problem. So-
ciometry, 32(4), 1969.
[18] A. Voida, W. C. Newstetter, and E. D. Mynatt. When conventions collide: the tensions
of instant messaging attributed. In CHI ’02, pages 187–194, 2002.
[19] D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature,
393:440–442, 1998.
[20] Z. Xiao, L. Guo, and J. Tracey. Understanding instant messaging traffic characteristics.
In ICDCS ’07, 2007.
... After early studies on the structure of social networks by Gurevitch [2] and de Sola Pool and Kochen [3], Milgram performed his 1967 famous set of experiments on social distancing [4] (see also Ref. [5]) where, with a limited sample of 1000 individuals, it was shown that people in the U.S. are indeed connected by a small number of acquaintances. Later on, Dodds et al. recreated Milgram's experiments with Internet email users [6] by tracking 24 163 chains aimed at 18 targets from 13 countries and confirmed that the average number of steps in the chains was around six. Furthermore, many experiments conducted at a planetary scale on various social networks verified the ubiquitous character of this feature: (i) a 2007 study by Leskovec and Horvitz (with a dataset of 30 billion conversations among 240 million Microsoft Messenger users) revealed the average path length to be six [7] (see also Ref. [8]), (ii) the average degree of separation between two randomly selected Twitter users was found to be 3.435 [9], and (iii) Facebook's network in 2011 (721 million users with 69 billion friendship links) displayed an average distance between nodes of 4.74 [10]. ...
... In Fig. 3(a), the three vertices 1,2,7 are originally part of a 1-independent set. Now, if vertex 7 forms the yellow edges (7,1) and (7,2), it is removed from the set but it does not change the distance between nodes 1 and 2. It only contributes to the multiplicity of shortest paths between nodes 1 and 2. As the number of alternative shortest paths may be very large in large sized networks, the minimum possible benefit obtained from gluing a 1-independent set (as node 7 would do by forming edges with nodes 1 and 2) may be very small with the growth of the network's size. From the latter point it follows that the presence of independent sets of large size may be compatible with the Nash equilibrium. ...
... Fig. 2, nodes 1 and 2 are colored in light blue. As vertex 7 forms the yellow edges (7, 1) and (7,2) it is removed from the 1-independent set (this change is depicted by coloring the lowest part of the node in yellow), but the two new connections do not remove nodes 1 and 2 from the 1-independent set, since they only contribute to the multiplicity of the shortest paths between 1 and 2. (b) When only black links are considered, vertices 1,2,7 form a 2-independent set. As the yellow connections are formed, vertex 7 reduces the distance between nodes 1 and 2 from at least 3 down to 2. As a consequence, nodes 1 and 2 can only be part of a 1-independent set. ...
Full-text available
A wealth of evidence shows that real-world networks are endowed with the small-world property, i.e., that the maximal distance between any two of their nodes scales logarithmically rather than linearly with their size. In addition, most social networks are organized so that no individual is more than six connections apart from any other, an empirical regularity known as the six degrees of separation. Why social networks have this ultrasmall-world organization, whereby the graph’s diameter is independent of the network size over several orders of magnitude, is still unknown. We show that the “six degrees of separation” is the property featured by the equilibrium state of any network where individuals weigh between their aspiration to improve their centrality and the costs incurred in forming and maintaining connections. We show, moreover, that the emergence of such a regularity is compatible with all other features, such as clustering and scale-freeness, that normally characterize the structure of social networks. Thus, our results show how simple evolutionary rules of the kind traditionally associated with human cooperation and altruism can also account for the emergence of one of the most intriguing attributes of social networks.
... In this section, we provide an overview of prior work that has researched disagreement and helped in shaping the research questions of this study. Previous studies have investigated subgroups of opinions [29], power relationships [6,10], interactions [8,22,44], and the relevances of participants [1]. A big part of the literature in peer-production focused on the study of disagreements, investigating (i) controversies and conflicts, and (ii) argumentation. ...
Full-text available
In this work, we study disagreement in discussions around Wikidata, an online knowledge community that builds the data backend of Wikipedia. Discussions are important in collaborative work as they can increase contributor performance and encourage the emergence of shared norms and practices. While disagreements can play a productive role in discussions, they can also lead to conflicts and controversies, which impact contributor well-being and their motivation to engage. We want to understand if and when such phenomena arise in Wikidata, using a mix of quantitative and qualitative analyses to identify the types of topics people disagree about, the most common patterns of interaction, and roles people play when arguing for or against an issue. We find that decisions to create Wikidata properties are much faster than those to delete properties and that more than half of controversial discussions do not lead to consensus. Our analysis suggests that Wikidata is an inclusive community, considering different opinions when making decisions, and that conflict and vandalism are rare in discussions. At the same time, while one-fourth of the editors participating in controversial discussions contribute with legit and insightful opinions about Wikidata's emerging issues, they do not remain engaged in the discussions. We hope our findings will help Wikidata support community decision making, and improve discussion tools and practices.
... While bounded geometry assumptions play a central role in the previous theoretical analysis of GCR [27] and other recent works [32,33], the empirical growth rates of r-balls in social networks have not been well documented. The average degree, degree distribution, and path-length distribution of large-scale social networks have all been the subject of extensive empirical investigations [34][35][36], with the path length distribution being the central object of study in the large literature on "degrees of separation" inspired by Travers and Milgram [37]. Less attention has been given to the empirical structure of neighborhood sizes at different distances, though some intuition for the relationship between friend counts and friend-of-friend counts can be derived from the prior works [36,38,39]. ...
Full-text available
The global average treatment effect (GATE) is a primary quantity of interest in the study of causal inference under network interference. With a correctly specified exposure model of the interference, the Horvitz–Thompson (HT) and Hájek estimators of the GATE are unbiased and consistent, respectively, yet known to exhibit extreme variance under many designs and in many settings of interest. With a fixed clustering of the interference graph, graph cluster randomization (GCR) designs have been shown to greatly reduce variance compared to node-level random assignment, but even so the variance is still often prohibitively large. In this work, we propose a randomized version of the GCR design, descriptively named randomized graph cluster randomization (RGCR), which uses a random clustering rather than a single fixed clustering. By considering an ensemble of many different clustering assignments, this design avoids a key problem with GCR where the network exposure probability of a given node can be exponentially small in a single clustering. We propose two inherently randomized graph decomposition algorithms for use with RGCR designs, randomized 3-net and 1-hop-max, adapted from the prior work on multiway graph cut problems and the probabilistic approximation of (graph) metrics. We also propose weighted extensions of these two algorithms with slight additional advantages. All these algorithms result in network exposure probabilities that can be estimated efficiently. We derive structure-dependent upper bounds on the variance of the HT estimator of the GATE, depending on the metric structure of the graph driving the interference. Where the best-known such upper bound for the HT estimator under a GCR design is exponential in the parameters of the metric structure, we give a comparable upper bound under RGCR that is instead polynomial in the same parameters. We provide extensive simulations comparing RGCR and GCR designs, observing substantial improvements in GATE estimation in a variety of settings.
... Our second finding is that Reddit users who are geographically close are more likely to interact, even if we were to remove the interactions that took place in city-or state-related subreddits. This finding is in line with previous literature, which showed that the probability of interaction in any social network exponentially falls with physical distance 72,81,82 . ...
Full-text available
Past research has attributed the circulation of online news to two main factors—individual characteristics (e.g., a person’s information literacy) and social media effects (e.g., algorithm-mediated information diffusion)—and has overlooked a third one: the critical mass created by the offline self-segregation of Americans into like-minded geographical regions such as states (a phenomenon called ‘The Big Sort’). We hypothesized that this latter factor matters for the online spreading of news not least because online interactions, despite having the potential of being global, end up being localized: interaction probability is known to rapidly decay with distance. Upon analysis of more than 8M Reddit comments containing news links spanning four years, from January 2016 to December 2019, we found that Reddit did not work as an ‘hype machine’ for news (as opposed to what previous work reported for other platforms, circulation was not mainly caused by platform-facilitated network effects). Rather, news circulation in Reddit worked as a supply-and-demand system: news items scaled linearly with the number of users in each state (with a scaling exponent β\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta$$\end{document} ≈1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx 1$$\end{document}, and a goodness of fit R2≈0.95\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2\approx 0.95$$\end{document}). Furthermore, deviations from such a universal pattern were best explained by state-level personality and cultural factors (R2≈{0.12,0.39}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2\approx \{0.12, 0.39\}$$\end{document}), rather than socioeconomic conditions (R2≈{0.15,0.29}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2\approx \{0.15, 0.29\}$$\end{document}) or political characteristics (R2≈{0.06,0.21}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2\approx \{0.06, 0.21\}$$\end{document}). Higher-than-expected circulation of any type of news was found in states characterised by residents who tend to be less diligent in terms of their personality (low in conscientiousness) and by loose cultures understating the importance of adherence to norms (low in cultural tightness). Interestingly, the combination of those factors with low levels of education was then associated with the circulation of a particular type of news, that is, misinformation. These results suggest that online interactions are geographically bounded and, as such, news circulation cannot be studied purely as an Internet phenomenon but should be grounded into a user’s offline cultural environment, which has become increasingly segregated over the decades, and is admittedly hard to change.
Full-text available
Large-scale human social network structure is typically inferred from digital trace samples of online social media platforms or mobile communication data. Instead, here we investigate the social network structure of a complete population, where people are connected by high-quality links sourced from administrative registers of family, household, work, school, and next-door neighbors. We examine this multilayer social opportunity structure through three common concepts in network analysis: degree, closure, and distance. Findings present how particular network layers contribute to presumably universal scale-free and small-world properties of networks. Furthermore, we suggest a novel measure of excess closure and apply this in a life-course perspective to show how the social opportunity structure of individuals varies along age, socio-economic status, and education level.
Full-text available
We present a study of anonymized data capturing a month of high-level communication activities within the whole of the Microsoft Messenger instant-messaging system. We examine characteristics and patterns that emerge from the collective dynamics of large numbers of people, rather than the actions and characteristics of individuals. The dataset contains summary properties of 30 billion conversations among 240 million people. From the data, we construct a communication graph with 180 million nodes and 1.3 billion undirected edges, creating the largest social network constructed and analyzed to date. We report on multiple aspects of the dataset and synthesized graph. We find that the graph is well-connected and robust to node removal. We investigate on a planetary-scale the oft-cited report that people are separated by "six degrees of separation" and find that the average path length among Messenger users is 6.6. We also find that people tend to communicate more with each other when they have similar age, language, and location, and that cross-gender conversations are both more frequent and of longer duration than conversations with the same gender.
Conference Paper
Full-text available
We discuss findings from observation, interviews, and textual analysis of instant messaging use in a university research lab setting. We propose a method for characterizing the tensions that permeate instant messaging texts and that expose the collision between conventions of verbal and written communication. Given this method, we suggest a design space for exploring potential design choices in instant messaging clients. Finally, we recommend an analysis of communicative conventions as a fruitful lens through which designers might anticipate or circumvent design tensions in emergent computer-mediated communication technologies
Aspects of interpersonal networks in which Americans discuss "important matters" are examined using data from the 1985 General Social Survey. These are the first survey network data representative of the American population. The networks are small, kin-centered, relatively dense, and homogeneous in comparison with the sample of respondents. Bivariate examination of subgroup differences by age, education, race/ethnicity, sex, and size of place indicates that network range is greatest among the young, the highly educated, and metropolitan residents. Sex differences consist primarily of differences in kin/nonkin composition of networks.
Similarity breeds connection. This principle - the homophily principle - structures network ties of every type, including marriage, friendship, work, advice, support, information transfer, exchange, comembership, and other types of relationship. The result is that people's personal networks are homogeneous with regard to many sociodemographic, behavioral, and intrapersonal characteristics. Homophily limits people's social worlds in a way that has powerful implications for the information they receive, the attitudes they form, and the interactions they experience. Homophily in race and ethnicity creates the strongest divides in our personal environments, with age, religion, education, occupation, and gender following in roughly that order. Geographic propinquity, families, organizations, and isomorphic positions in social systems all create contexts in which homophilous relations form. Ties between nonsimilar individuals also dissolve at a higher rate, which sets the stage for the formation of niches (localized positions) within social space. We argue for more research on: (a) the basic ecological processes that link organizations, associations, cultural communities, social movements, and many other social forms; (b) the impact of multiplex ties on the patterns of homophily; and (c) the dynamics of network change over time through which networks and other social entities co-evolve.
Conference Paper
Instant Messaging is a popular medium for both social and work- related communication. In this paper we report an investigation of the effect of interpersonal relationship on underlying basic communication characteristics (such as messaging rate and duration) using a large corpus of instant messages. Our results show that communication characteristics differ significantly for communications between users who are in a work relationship and between users who are in a social relationship. We used our findings to inform the creation of statistical models that predict the relationship between users without the use of message content - achieving an accuracy of nearly 80% for one such model. We discuss the results of our analyses and potential uses of these models.
Arbitrarily selected individuals (N=296) in Nebraska and Boston are asked to generate acquaintance chains to a target person in Massachusetts, employing "the small world method" (Milgram, 1967). Sixty-four chains reach, the target person. Within this group the mean number of intermediaries between starters and targets is 5.2. Boston starting chains reach the target person with fewer intermediaries than those starting in Nebraska; subpopulations in the Nebraska group do not differ among themselves. The funneling of chains through sociometric "stars" is noted, with 48 per cent of the chains passing through three persons before reaching the target. Applications of the method to studies of large scale social structure are discussed.
Conference Paper
Characterizing the relationship that exists between a person's social group and his/her personal behavior has been a long standing goal of social network analysts. In this paper, we apply data mining techniques to study this relationship for a population of over 10 million people, by turning to online sources of data. The analysis reveals that people who chat with each other (using instant messaging) are more likely to share interests (their Web searches are the same or topically similar). The more time they spend talking, the stronger this relationship is. People who chat with each other are also more likely to share other personal characteristics, such as their age and location (and, they are likely to be of opposite gender). Similar findings hold for people who do not necessarily talk to each other but do have a friend in common. Our analysis is based on a well-defined mathematical formulation of the problem, and is the largest such study we are aware of.