Twitter users change word usage according to conversation-partner social identity
Nadine Tamburrinia, Marco Cinnirellab, Vincent A. A. Jansena, John Brydena,∗
aSchool of Biological Sciences, Royal Holloway University of London, Egham TW20 0EX, UK
bDepartment of Psychology, Royal Holloway University of London, Egham TW20 0EX, UK
This paper investigates how people express social identity at a large scale on a social network. We looked at
communities of users on the Twitter web site, and tested two established social-psychology theories that are usually
performed at local scale. We found evidence of Communication Accommodation Theory, where community members
vary their language characteristics depending on which community they are communicating with. We also found the
level of linguistic variation correlated with how isolated a community was: evidence that there is Convergence between
linked members. This demonstrates the power of methods which analyse subtle human behaviour on social networks.
Keywords: Twitter; Community structure; Social identity; Language accommodation; Linguistic convergence; Social
•We study social identity on communities found on
the Twitter web site
•We ﬁnd users adjust word usage according to the
community of the interlocutor
•More isolated communities change word usage to a
•Large scale studies of Twitter can test hypotheses on
Social identity is that proportion of an individual’s self-
concept that derives from membership of a social group
(Tajfel and Turner, 1979). Group aﬃliation has functions
of enhancing cooperation (Boyd and Richerson, 2009) and
allowing individuals to deﬁne others through the group
they belong to, in the same way that the individual de-
ﬁnes him or herself through the identity of their own group
(Ashforth and Mael, 1989). Group members share be-
haviour and social norms. This shared behaviour in social
groups is thought to be generated through processes on
social networks such as convergence of behaviour due to
social relationships (Hormuth, 1990; Ethier and Deaux,
The way we use language is strongly associated with
our social identity (Scott, 2007). The convergence of be-
haviour, proposed by social identity theory, is often stud-
ied through the language used within social groups. This
demonstrates how language is more than just a means of
communication and sociolinguistic studies have shown that
varieties of a language can be strongly associated with so-
cial or cultural groups (Gumperz, 1958; Labov, 1966; Car-
roll, 2008; Bryden et al., 2013).
By using language as a proxy for social behaviour, stud-
ies have been able to understand how expression of social
identity is often strongly context dependent: people will
behave diﬀerently depending on which social identity has
the strongest salience in the current situation (Hogg and
Reid, 2006). Studies show how this often manifests in the
accommodation of language according to the social iden-
tity of the interlocutor (Giles, 1973; Gallois et al., 2005).
Individuals negotiate the social distance between them-
selves and the person with whom they are conversing, and
are therefore in control of its creation and maintenance
(Shepard et al., 2001). For example, Iwasaki and Horie
(2000) reported how Thai speakers would adjust their lin-
guistic registers when interacting with strangers. These
studies look at speciﬁc groups or social situations, but we
do not know whether this behaviour can be found at a large
scale across many groups where these groups are allowed
to freely interact with one another.
Online social networking platforms are providing us
with a large scale platform to study human behaviour.
With over 200 million monthly active users (Costolo, 2013),
the Twitter social network is particularly useful due to its
publicly accessible nature (Virk, 2011) and network size.
The analysis of large networks brings with it considerable
statistical power that allows for the detection of patterns
that in traditional, smaller scale network studies would be
undetectable. Twitter functions as a micro-blogging web-
site, working on the premise of users sharing their opinions
and thoughts in brief messages (maximum 140 characters),
Preprint submitted to Social Networks August 15, 2014
which are referred to as “tweets”. An investigation into the
reasons why people post on the Twitter web site by (Java
et al., 2007) found that about one eighth of posts were
conversational messages rendering Twitter as a prime re-
source for public access to naturally occurring communi-
cation (Danescu-Niculescu-Mizil et al., 2011) making this
public resource an excellent place to study the expression
of social identity.
The study of how identity aﬀects our use of language
online is a growing ﬁeld. There is evidence for communica-
tion accommodation between oﬄine conversation partners
(Danescu-Niculescu-Mizil et al., 2011) showing that syn-
tax, pitch, gestures, word choice, length or form can diﬀer
according to interlocutor. Evidence for linguistic conver-
gence online is mixed with studies ﬁnding evidence both for
(e.g. Riordan et al., 2013) and against (e.g. Christopher-
son, 2011) the existence of convergence in online communi-
cation. The anonymity sometimes engendered in computer
mediated environments can act to enhance the signiﬁcance
of social identity in contexts where a relevant shared group
membership is salient to users (Postmes et al., 2000). Con-
sequently, social identity can be heightened which explains
why some group phenomena, such as polarisation of atti-
tudes, and stereotyping, can seem enhanced in some on-
line environments (e.g. Postmes et al., 2001). This is ev-
ident due to collective identity amongst communities of
websites of environmental activists (Ackland and O’Neil,
2011). However, such studies of social identity in computer-
mediated-communication are still in their relative infancy
and this research aims to contribute to the further devel-
opment of this ﬁeld by looking to expressly link commu-
nication accommodation and convergence to social groups
that have formed on Twitter.
In order to identify online groups, we look to the study
of complex networks. In this ﬁeld, the term communities is
used to denote parts of the network that are more strongly
linked within themselves than to the rest of the network, a
phenomenon that has been observed in many human social
networks (Porter et al., 2009). In this sense, communities
are an emergent property of network structure. Much work
has gone into developing methods to detect such groups
from topological analysis (Fortunato, 2010), and the ex-
tent to which this is possible has been termed modularity
(Newman, 2006). The communities found in this way are
usually associated with groups of friends or acquaintances,
or similarity in traits (Porter et al., 2009; Bryden et al.,
2011; Traud et al., 2012) and have also been shown to share
language features (Bryden et al., 2013). We hypothesise
that communities found in online networks will share so-
cial identity and consequently we expect to ﬁnd that they
demonstrate communication accommodation and conver-
In this study we focus on a speciﬁc aspect of behaviour
that is strongly associated with social identity, asking whether
individuals will shift their linguistic behaviour according
to which social group they are messaging. The data of
online communities that we used came from a previous
study of the Twitter web site (Bryden et al., 2013). We
tested for communication accommodation by looking to
see if users varied speciﬁc language characteristics accord-
ing to whether they had sent conversational messages to
members of the same community or to members from other
communities. We tested for convergence by looking to see
whether this level of language variation for a community
correlated with how strongly linked a community is within
The data upon which we did our tests was a network
of 189,000 Twitter users. To identify users to download
we used a snowball-sample where, for each user sampled,
all their tweets which mentioned other users (using the
‘@’ symbol) were recorded and any new users referenced
added to a list of users from which the next user to be
sampled was picked. Starting from a random user, conver-
sational tweets, time-stamped between January 2007 to
November 2009 were sampled from the Twitter web site
during December 2009, yielding over 200 million messages.
The network was formed of bidirectional links, where both
nodes had sent at least one message to one another, and
weighted by the number of tweets sent between the two
users linked. We ignored messages that were copies of
other messages (so called retweets, which are identiﬁed by
a case-insensitive search for the text ‘RT’). In total the
network had 75 million messages (tweets) directed from
users of the network to one another.
The network was partitioned into communities using a
modularity maximisation algorithm (Blondel et al., 2008)
and a partition of the network was found where 91% of the
tweets were sent by users to other users within the same
community. For each community, characteristic words were
generated that were used more commonly in that group
than the global average (see Supplementary Information
for characteristic words). These allowed us to identify
English speaking groups and also qualitatively summarise
shared characteristics of each group. For more informa-
tion on how characteristic words were generated, and an
argument that the network sampled was representative of
the complete Twitter network, see Bryden et al. (2013).
To investigate changes in language characteristics, we
divided messages into two collections: internal messages
that were sent to other members of the same group, and
external messages that were sent to members of diﬀerent
communities. For each group, we made sure that both
collections were of the same size by discarding messages
at random from the larger collection. The diﬀerence in
word usage between the samples from the two classes was
To calculate diﬀerences between word usage between
the two samples we used text similarity measures. We used
two diﬀerent text measures (Gomaa and Fahmy, 2013) to
conﬁrm that the result was not an artefact generated by
one of the measures. For a word wwe deﬁne numbers of
usages of win the internal and external samples as λi(w)
and λe(w) respectively. The ﬁrst measure was the Eu-
clidean distance between relative word usage frequencies
for each collection, given by,
The second measure was the quantitative version of the
Jaccard distance measure (Gallagher, 1999) which is one
minus the multiset intersection of the two samples divided
by the multiset union. This is given by,
1−Pwmin [λi(w), λe(w)]
Pvmax [λi(v), λe(v)] .(2)
To look at other linguistic features that can be indica-
tive of changes in linguistic style (see, e.g., Bryden et al.,
2013; Wagner et al., 2013), we also calculated diﬀerences
between word-ending frequencies (using both Euclidean
and Jaccard distances) and apostrophe frequencies. Dif-
ferences between apostrophe frequencies were calculated
by calculating the frequency of apostrophes per word used
by each of the two collections and then calculating the
absolute diﬀerence between these two values.
The partitioning of the sample network of Twitter users
yielded 414 groups, with 42 groups having more than 250
users. A variety of languages were found with diﬀerent
groups using diﬀerent languages. To eliminate the eﬀects
of a user simply changing between diﬀerent languages de-
pending on which group they were speaking to, we did the
study on the 24 groups (of a size greater than 250 users)
that used the English language which were selected in a
previous study (Bryden et al., 2013, and see methods).
With these English-speaking groups, we formed col-
lections of internal and external messages for each group,
and then measured the Euclidean distance in word usage
frequencies between the two collections. Since diﬀerences
of word-usage frequencies can arise because users within
a group communicate about one or a limited number of
subjects, we also measured distances of word-ending us-
age frequencies and apostrophe usage frequencies to look
at markers of linguistic style. We found a variety of dis-
tances between internal and external messages in all three
measurements (Figure 1).
There is a variety of distances between the internal and
external word usages in Figure 1. It is possible that these
diﬀerences in word usage could have happened by random
chance. To test this on a group-by-group basis we used
a bootstrap by resampling (with replacement) new ran-
dom pairs of collections of messages from the union of the
original internal and external collections used to generate
Figure 1. By calculating linguistic distances between the
Figure 1: A comparison of the 24 English speaking groups in the
Twitter network showing the extent of linguistic variation between
internal and external tweets. The bars show Euclidean distances on
a group-by-group basis between internal and external tweets for the
three measurements: word-usage frequencies (solid bars at the top
of each plot), word-ending frequencies (slashed bars in the middle)
and apostrophe usage (crossed bars at the bottom). For each mea-
surement, all groups were scaled so that the values ranged between
0.0 and 1.0. Each group has a short description and a group num-
ber. The short description was generated by qualitatively inspecting
unusual words generated for each group (see Supplementary Infor-
newly sampled pairs of collections, we can conﬁrm that the
diﬀerence found between the original group didn’t happen
by chance. Repeating this procedure 1,000 times for each
group, we calculated the p-value: the proportion of resam-
pled collections for which a linguistic distance exceeded
that of the original internal and external collections. In
fact, using both the Euclidean and Jaccard measures, none
of the distances between the word, or word-ending usages,
of the resamples exceeded that of the original collections
(p≤0.001). This showed that the users we studied do
indeed change their word and word-ending usage accord-
ing to whether they are messaging other members of the
group or not. For distance between internal and external
apostrophe usages, 17 of the 24 groups were signiﬁcant
The diﬀerence between the language use of external
and internal messages raises a question as to how much
this change in language characteristics is due to the sender
of a message conforming to the language use of the re-
ceiver. An alternative scenario may be that external mes-
sages may have their own language patterns. We investi-
gated this by comparing, using both the Jaccard and the
Euclidean measures, the external messages to and from
a focal community against the internal messages of every
community. We found that the most similar community in
each case was the original focal community. This indicates
that the change in language characteristics is indeed due
to the sender of a message conforming to the language use
of the receiver.
The groups of Twitter users analysed in this work were
generated by partitioning the sampled network of Twitter
users such that the proportion of messages sent within the
Figure 2: Linguistic variation between internal and external tweets
increases with the proportion of tweets sent within a group. a) dis-
tance between word-usage (circles with regression line, two-tailed
p= 5.6×10−6), b) distance between word-endings (triangles and re-
gression line, two-tailed p= 0.052), c) distance between apostrophe
usage (crosses and regression line, two-tailed p= 0.0074).
groups was maximised: so called modularity maximisation
(Blondel et al., 2008; Newman and Girvan, 2004). This
generated closely interlinked groups that are relatively iso-
lated from the rest of the network. We assessed whether
there is any relationship between the level of isolation of
a group, measured as the proportion of messages sent by
that group to other members of the same group, and the
amount of linguistic variation between internal and ex-
ternal messages. We found that the distances between
word and apostrophe usage correlated signiﬁcantly with
the proportion of messages sent within the groups (Fig-
ure 2). This indicates that the more a group was isolated
from the rest of the network, the more it showed linguistic
We did not ﬁnd a signiﬁcant correlation for word-ending
variation against the proportion of internal tweets (Figure
2, panel b). A visual inspection of the ﬁgure reveals that
one of the groups is an outlier from the rest across all three
measurements of linguistic variation. This group (number
93) is made up of a network of people that organise on-
line parties called ‘pawpawties’ to raise money for animal
charities (Manning, 2009). It is intriguing that this group,
which largely exists on Twitter, has much stronger lan-
guage accommodation features compared to similar groups
which appear to have much stronger oﬄine interaction.
When we remove this outlier from the regression, we ﬁnd
that there is a signiﬁcant correlation for word-ending vari-
ation against the proportion of internal tweets (two-tailed
Our work demonstrates how computational methods
can be used to study social processes on large-scale social
networks. Our study was done on an unrestricted large-
scale sample of Twitter where individuals interact freely
with one another. We used topological analysis to identify
social groups in the network and then demonstrated how
linguistic behaviour will change according to the group
membership of the interlocutors. This shows how sub-
tle trends in linguistic behaviour aggregate to form social
identity through communication accommodation and lin-
The work illustrates an important methodological tool
for studying social processes on large scale social networks.
Measurements of social behaviour, especially language fea-
tures, rarely appear to conform to a normal distribution
and are thus diﬃcult to analyse with traditional statistical
methods. In this work we use a bootstrap method which,
through resampling our data, is independent of whichever
distribution the original measurements might come from.
The bootstrap is a simple, but powerful, tool for statisti-
cal analysis of subtle social processes at such a large scale
(Efron and Tibshirani, 1993).
Our study has found evidence of behaviour on the Twit-
ter social network that is consistent with theory on social
identity. The results show that people are aware, either
implicitly or explicitly, of the social identity of their inter-
locutor and change their language usage accordingly. This
demonstrates that interaction networks with limited com-
munication channels are still sophisticated enough to allow
their members to express social identities. We have also
found that the extent to which members change their lan-
guage characteristics depends on how isolated their group
is from the rest of the network. This shows that social con-
vergence between several individuals is strongly related to
the proportion of their total interaction that they spend
within the group.
This study is compatible with other studies of linguis-
tic variation within and between groups (e.g. Bell, 1984;
Gregory and Carroll, 1978), and the idea that communi-
ties may develop unique linguistic styles which can become
intertwined with, and markers of, their identity. Our ﬁnd-
ing of linguistic diﬀerences between internal and external
tweets echoes sociolinguistic work on situational ﬂuctua-
tions in linguistic registers (e.g. Iwasaki and Horie, 2000)
and supports a social identity perspective that views such
linguistic variation as part of the process of social cate-
An important diﬀerence with previous studies is the
scale at which this study took place. For instance, previ-
ous studies that have looked at convergence did not ﬁnd
signiﬁcance with sample sizes of 30 conversations (Christo-
pherson, 2011). Our approach surpasses the boundaries
of survey or interview, and laboratory or ﬁeld based in-
vestigations, with millions of conversations being analysed
yielding signiﬁcant statistical power. While the environ-
ment of Twitter is somewhat speciﬁc and does not relate
to many other on- and oﬄine environments, the fact that
our results here were replicated for each community tested
indicates that our result is likely to be generalisable.
The diﬀerences in word usage between the internal and
external messages of each group may be due to each group
sharing interests in certain subjects. To go beyond sub-
ject areas, we also looked at word endings and apostrophe
usage. This is consistent with theory which shows how
groups become associated with particular communication
styles, members may reference those styles in their com-
municative acts as a means of claiming or expressing the
identity in question (Rampton, 1995).
Our study was restricted to English language groups
because a large proportion of the groups in our sample of
Twitter used English. While there were groups that spoke
other languages in our data, we did not have the quantity
of data to adequately resolve sub-groups for non-English
speaking Twitter users. We would expect, with more data,
to be able to resolve sub-groups for non-English speaking
users, and thus be able to test the theory across many
It is possible that the sampling algorithm that we orig-
inally used to sample the Twitter messages may have some
introduced some biases which would mean that our sample
is not representative of Twitter as a whole. A sampling
process used can have some bias toward Twitter users that
have had messages sent to them. To mitigate this, we made
sure that unsampled users were only placed once on the list
of users to be sampled, even if they have been messaged
by several previously sampled users. The second issue is
that there may be a bias toward certain communities - es-
pecially toward the community of the user ﬁrst sampled.
We cover this in more detail in a previous paper (Bryden
et al., 2013) arguing that the sampler will move to random
communities relatively quickly. We found that our sam-
pling method detected a broad variety of communities and
this indicates the sample is likely to be representative of
Interesting future topics which are possible extensions
of our work include theory on out-groups, where theory
such as Communication Accommodation Theory and the
Social Identity Model of Deindividuation predict diver-
gence when interlocutors message certain external groups.
We didn’t ﬁnd any evidence of this in our study as we
found that external messages for a particular group were
still closer to the internal messages of the group than any
other. Further investigations of how language character-
istics converge and/or diverge over time may shed some
light on this topic and be of interest in their own right.
Finally, we may also be able to improve an algorithm that
predicts the groups of individuals based on their language
patterns (Bryden et al., 2013), by comparing an individ-
ual’s language use against that of only the internal tweets
of the groups.
Even though the conversations we studied on Twitter
were made up of very short text messages which are pub-
lically posted, these results indicate that many complex
features of normal oﬄine communication take place on-
line. While such behaviour may not be evident at a small
scale, the large quantities of data used in this study meant
that we were able to identify these subtle patterns. This
indicates that future studies on social identity, social be-
haviour and cooperation are likely to prove fruitful.
Thanks to Shaun Wright, Tim Harrison and the anony-
mous reviewers. This work was supported by the Eco-
nomic and Social Research Council (grant ES/L000113/1).
Ackland, R., O’Neil, M., 2011. Online collective identity: The case
of the environmental movement. Social Networks 33 (3), 177–190.
Ashforth, B. E., Mael, F., 1989. Social identity theory and the orga-
nization. Academy of Management Review 14 (1), 20–39.
Bell, A., 1984. Language style as audience design. Language in
society 13 (2), 145–204.
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E., 2008.
Fast unfolding of communities in large networks. Journal of Sta-
tistical Mechanics: Theory and Experiment 2008 (10), P10008.
Boyd, R., Richerson, P. J., 2009. Culture and the evolution of hu-
man cooperation. Philosophical Transactions of the Royal Society
B: Biological Sciences 364 (1533), 3281–3288.
Bryden, J., Funk, S., Geard, N., Bullock, S., Jansen, V. A., 2011. Sta-
bility in ﬂux: Community structure in dynamic networks. Journal
of The Royal Society Interface 8 (60), 1031–1040.
Bryden, J., Funk, S., Jansen, V. A., 2013. Word usage mirrors com-
munity structure in the online social network Twitter. EPJ Data
Science 2 (1), 1–9.
Carroll, K. S., 2008. Puerto rican language use on MySpace.com.
Centro Journal 20 (1), 96–111.
Christopherson, L., 2011. Can u help me plz?? Cyberlanguage
accommodation in virtual reference conversations. Proceedings
of the American Society for Information Science and Technology
48 (1), 1–9.
Costolo, R., 2013. Twitter, Inc.: Initial public oﬀering.
Danescu-Niculescu-Mizil, C., Gamon, M., Dumais, S., 2011. Mark
my words!: Linguistic style accommodation in social media. In:
Proceedings of the 20th international conference on World wide
web. p. 745–754.
Efron, B., Tibshirani, R. J., Jan. 1993. An Introduction to the Boot-
strap. Chapman and Hall/CRC, New York.
Ethier, K. A., Deaux, K., 1994. Negotiating social identity when con-
texts change: Maintaining identiﬁcation and responding to threat.
Journal of Personality and Social Psychology 67 (2), 243.
Fortunato, S., 2010. Community detection in graphs. Physics
Reports 486 (3), 75–174.
Gallagher, E. D., 1999. COMPAH documentation.
Gallois, C., Ogay, T., Giles, H., 2005. Communication accommo-
dation theory: A look back and a look ahead. In: Gudykunst,
W. B. (Ed.), Theorizing About Intercultural Communication.
Sage, Thousand Oaks, CA, p. 121–148.
Giles, H., 1973. Accent mobility: A model and some data. Anthro-
pological Linguistics 15 (2), 87–105.
Gomaa, W. H., Fahmy, A. A., 2013. A survey of text similarity ap-
proaches. International Journal of Computer Applications 86 (13),
Gregory, M., Carroll, S., 1978. Language and situation: Language
varieties and their social contexts. Routledge and Kegan Paul,
London, Henley and Boston.
Gumperz, J. J., 1958. Dialect diﬀerences and social stratiﬁcation in
a north indian village. American Anthropologist 60 (4), 668–682.
Hogg, M. A., Reid, S. A., 2006. Social identity, self-categorization,
and the communication of group norms. Communication Theory
16 (1), 7–30.
Hormuth, S. E., 1990. The ecology of the self: Relocation and
self-concept change. Cambridge University Press.
Iwasaki, S., Horie, P. I., 2000. Creating speech register in Thai
conversation. Language in Society 29 (04), 519–554.
Java, A., Song, X., Finin, T., Tseng, B., 2007. Why we twitter:
Understanding microblogging usage and communities. In: Pro-
ceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop
on Web mining and social network analysis. p. 56–65.
Labov, W., 1966. The linguistic variable as structural unit. Wash-
ington Linguistics Review 3, 4–22.
Manning, S., 2009. Animal lovers throw ‘pawpawties’ for charity.
Animal-lovers- throw-pawpawties- for-charity- 3285436.
Newman, M. E., 2006. Modularity and community structure in net-
works. Proceedings of the National Academy of Sciences 103 (23),
Newman, M. E., Girvan, M., 2004. Finding and evaluating commu-
nity structure in networks. Physical Review E 69 (2), 026113.
Porter, M. A., Onnela, J.-P., Mucha, P. J., 2009. Communities in
networks. Notices of the AMS 56 (9), 1082–1097.
Postmes, T., Spears, R., Lea, M., 2000. The formation of group
norms in computer-mediated communication. Human Communi-
cation Research 26 (3), 341–371.
Postmes, T., Spears, R., Sakhel, K., De Groot, D., 2001. Social
inﬂuence in computer-mediated communication: The eﬀects of
anonymity on group behavior. Personality and Social Psychology
Bulletin 27 (10), 1243–1254.
Rampton, B., 1995. Crossing: Language and ethnicity among ado-
lescents. Longman, London.
Riordan, M. A., Markman, K. M., Stewart, C. O., 2013. Commu-
nication accommodation in instant messaging an examination of
temporal convergence. Journal of Language and Social Psychol-
ogy 32 (1), 84–95.
Scott, C. R., 2007. Communication and social identity theory: Ex-
isting and potential connections in organizational identiﬁcation
research. Communication Studies 58 (2), 123–138.
Shepard, C. A., Giles, H., Le Poire, B. A., 2001. Communication
accommodation theory. In: Robinson, W. P., Giles, H. (Eds.),
The new handbook of language and social psychology. John Wiley,
New York, p. 33–56.
Tajfel, H., Turner, J. C., 1979. An integrative theory of intergroup
conﬂict. In: Austin, W. G., Worchel, S. (Eds.), The social
psychology of intergroup relations. Brooks/Cole, Monterey CA,
Traud, A. L., Mucha, P. J., Porter, M. A., 2012. Social structure
of facebook networks. Physica A: Statistical Mechanics and its
Applications 391 (16), 4165–4180.
Virk, A., 2011. Twitter: The strength of weak ties. University of
Auckland Business Review 13 (1), 19–21.
Wagner, C., Asur, S., Hailpern, J., 2013. Religious politicians and
creative photographers: Automatic user categorization in twitter.
In: ASE/IEEE International Conference on Social Computing