ArticlePDF Available

Mark My Words! Linguistic Style Accommodation in Social Media

Authors:

Abstract and Figures

The psycholinguistic theory of communication accommodation accounts for the general observation that participants in conversations tend to converge to one another's communicative behavior: they coordinate in a variety of dimensions including choice of words, syntax, utterance length, pitch and gestures. In its almost forty years of existence, this theory has been empirically supported exclusively through small-scale or controlled laboratory studies. Here we address this phenomenon in the context of Twitter conversations. Undoubtedly, this setting is unlike any other in which accommodation was observed and, thus, challenging to the theory. Its novelty comes not only from its size, but also from the non real-time nature of conversations, from the 140 character length restriction, from the wide variety of social relation types, and from a design that was initially not geared towards conversation at all. Given such constraints, it is not clear a priori whether accommodation is robust enough to occur given the constraints of this new environment. To investigate this, we develop a probabilistic framework that can model accommodation and measure its effects. We apply it to a large Twitter conversational dataset specifically developed for this task. This is the first time the hypothesis of linguistic style accommodation has been examined (and verified) in a large scale, real world setting. Furthermore, when investigating concepts such as stylistic influence and symmetry of accommodation, we discover a complexity of the phenomenon which was never observed before. We also explore the potential relation between stylistic influence and network features commonly associated with social status.
Content may be subject to copyright.
Mark My Words!
Linguistic Style Accommodation in Social Media
Cristian Danescu-Niculescu-Mizil
Cornell University
Ithaca, NY 14853,USA
cristian@cs.cornell.edu
Michael Gamon
Microsoft Research
Redmond, WA 98053,USA
mgamon@microsoft.com
Susan Dumais
Microsoft Research
Redmond, WA 98053,USA
sdumais@microsoft.com
ABSTRACT
The psycholinguistic theory of communication accommoda-
tion accounts for the general observation that participants in
conversations tend to converge to one another’s communica-
tive behavior: they coordinate in a variety of dimensions in-
cluding choice of words, syntax, utterance length, pitch and
gestures. In its almost forty years of existence, this theory
has been empirically supported exclusively through small-
scale or controlled laboratory studies. Here we address this
phenomenon in the context of Twitter conversations. Un-
doubtedly, this setting is unlike any other in which accom-
modation was observed and, thus, challenging to the theory.
Its novelty comes not only from its size, but also from the
non real-time nature of conversations, from the 140 charac-
ter length restriction, from the wide variety of social rela-
tion types, and from a design that was initially not geared
towards conversation at all. Given such constraints, it is
not clear a priori whether accommodation is robust enough
to occur given the constraints of this new environment. To
investigate this, we develop a probabilistic framework that
can model accommodation and measure its effects. We ap-
ply it to a large Twitter conversational dataset specifically
developed for this task. This is the first time the hypothesis
of linguistic style accommodation has been examined (and
verified) in a large scale, real world setting.
Furthermore, when investigating concepts such as stylis-
tic influence and symmetry of accommodation, we discover
a complexity of the phenomenon which was never observed
before. We also explore the potential relation between stylis-
tic influence and network features commonly associated with
social status.
Categories and Subject Descriptors: J.4 [Computer
Applications]: Social and behavioral sciences
General Terms: Measurement, Experimentation, Theory
Keywords: linguistic style accommodation, linguistic con-
vergence, social media, Twitter conversations
The research described herein was conducted while the first
author was a summer intern at Microsoft Research.
Copyright is held by the International World Wide Web Conference Com-
mittee (IW3C2). Distribution of these papers is limited to classroom use,
and personal use by others.
WWW 2011, March 28–April 1, 2011, Hyderabad, India.
ACM 978-1-4503-0632-4/11/03.
Language is a social art.
—Willard van Orman Quine, Word and Object
1. INTRODUCTION
The theory of communication accommodation was devel-
oped to account for the general observation that in conver-
sations people tend to nonconsciously converge to one an-
other’s communicative behavior: they coordinate in a vari-
ety of dimensions including choice of words, syntax, pausing
frequency, pitch and gestures [12]. In the last forty years,
this phenomenon has received significant attention and nu-
merous studies indicate that such convergence occurs almost
instantly for a very diverse set of communication patterns
(see Table 1 for examples). These findings suggest that the
communicative behavior of conversational partners are “pat-
terned and coordinated, like a dance” [32]. However, up to
now this “dance” was exclusively studied in controlled labo-
ratory experiments or through small scale studies. The work
presented here demonstrates for the first time the robustness
of accommodation theory in a large scale, real world envi-
ronment: Twitter.
Conversations on Twitter: a new hope.Even though
not originally developed as a conversation medium, Twitter
turns out to be a fertile ground for dyadic interactions. It is
estimated that a quarter of all its users hold conversations
with other users on this platform [22] and that around 37%
of all tweets are conversational [36]. The fact that these
conversations are public renders Twitter one of the largest
publicly available resources of naturally occurring conversa-
tions.
Undoubtedly, Twitter conversations are unlike those used
in previous studies of accommodation. One of the main dif-
ferences is that these conversations are not face-to-face and
do not happen in real-time. Like with email, a user does not
need to immediately reply to another user’s message; this
might affect the incentive to use accommodation as a way to
increase communication efficiency. Another difference is the
(famous) restriction of 140 characters per message, which
might constrain the freedom one user has to accommodate
the other. It is not a priori clear whether accommodation is
robust enough to occur under these new constraints.
Also, with very few exceptions, accommodation was only
tested in the initial phase of the development of relations
between people (i.e., during the acquaintance process) [11].
The relations between Twitter users, on the other hand, are
expected to cover a much wider spectrum of development,
ranging from newly-introduced to old friends (or enemies).
WWW 2011 – Session: Information Spread
March 28–April 1, 2011, Hyderabad, India
745
Thus, also from this perspective, the Twitter environment
constitutes a new challenge to the theory.
Linguistic style accommodation.One of the dimensions
on which people were shown to accommodate is linguistic
style [32, 39, 14], where style denotes the components of
language that are unrelated to content: how things are said
as opposed to what is said. This work will focus on this type
of accommodation. This is a rather important dimension,
since, even though only 0.05% of the English vocabulary is
composed of style words (such as articles and prepositions),
an estimated 55% of all words people employ are style words
[38]. These numbers do not necessarily hold on Twitter,
where one might expect style to be sacrificed in favor of
content given the length constraint; however, some recent
studies also advocate for the importance of style in Twitter
[8, 24, 35]. Linguistic style has also been central to a series
of NLP applications like authorship attribution and forensic
linguistics [30, 41, 18, 23], gender detection [25, 31, 17] and
personality type detection [1].
Linguistic style is also known to be, for the most part, gen-
erated and processed nonconsciously [26], and thus a suit-
able vehicle for studying the phenomenon of accommoda-
tion, which itself is assumed to occur nonconsciously.
Probabilistic framework.Previous work on accommoda-
tion relied mainly on simple correlation-based measures. A
new framework is necessary in order to correctly model and
measure the effects of linguistic style accommodation in a
real world, uncontrolled environment. The main desirables
from such a framework are:
Comparability: the effects of accommodation on dif-
ferent components of style should be comparable.
Expressivity: the framework should be expressive enough
to permit the evaluation of particular properties of ac-
commodation (discussed in Section 2).
Purity: accommodation should not be confounded with
other phenomena.
The last of these desirables is probably the hardest to
achieve and thus deserves some discussion here. The main
challenge is to distinguish accommodation from the effects
of homophily: people that converse are likely to employ a
similar linguistic style simply because they know each other.
As detailed in Section 5, we control for this effect by using
the temporal aspect specific to accommodation: a person
can accommodate to her conversational partner only after
receiving her input. Another type of potential confusion
is that between linguistic style accommodation and topic
accommodation; this is avoided in this work by a careful
selection of the stylistic features following a methodology
employed in psycholinguistic literature (discussed in Section
4).
Stylistic influence and symmetry.Another advantage
of the proposed framework is that a new concept of stylistic
influence emerges naturally: given two conversational part-
ners, one can influence the style of the other more than vice-
versa. This concept is a finer-grained version of the concept
of symmetry of accommodation proposed in the psycholin-
guisic literature [12]: accommodation can occur symmetri-
cally when both participants in a conversation accommodate
to each other or asymmetrically when only one accommo-
dates. In the latter case, the non-accommodating partici-
pant can either maintain her default behavior, or adjust her
behavior in the opposite direction from that of the accom-
modating participant (i.e., diverge). We are able to show
that imbalance in stylistic influence between Twitter users
is preponderant and that symmetry in accommodation is
dependent on the stylistic dimension (Figure 4); for exam-
ple, users are more likely to accommodate symmetrically
on the use of 1st person singular pronouns but to accom-
modate asymmetrically on the use of prepositions. This is
the first time such a rich complexity of the accommodation
phenomenon is revealed.
A variety of studies relate accommodation and social sta-
tus. For example, it was hypothesized that a person of lower
status will try to accommodate to a person of higher status
in order to gain her approval [11, 37]. We take the first steps
towards understanding the relation between the concepts of
stylistic influence and social status, as reflected in Twitter
network features, like number of followers and number of
friends, that could be considered (rough) proxies for social
status (Section 6.3). Rather surprisingly, we observe almost
no correlation between these features and stylistic influence.
Applicability.Apart from its appealing theoretical impor-
tance, accommodation also has a variety of potential prac-
tical uses. Based on the premise that accommodation has a
subtle positive effect on interpersonal communication, Giles
et al. [13] discusses applications of accommodation in me-
diating police-civilian interactions. On a similar note, Tay-
lor and Thomas [39] shows its relevance in the context of
hostage negotiations. Accommodation was also shown to be
practical in the treatment of mental disability [16] and psy-
chotherapy [9]. In Section 8 we also venture into proposing
three new potential applications specific to linguistic style
accommodation. We believe that by providing a way to
model accommodation and by demonstrating its robustness
in a real world environment, the present work provides a
framework which supports a wider implementation of such
applications.
2. COMMUNICATION ACCOMMODATION
THEORY
The psycholinguistic theory of communication accommo-
dation was developed around the following main hypothesis:
in dyadic conversations the participants converge to one an-
other’s communicative behavior in terms of a wide range of
dimensions [12], both verbal and non-verbal. Table 1 pro-
vides a sample of such converging dimensions. Many studies
seem to indicate that the communicative behaviors of the
participants “are patterned and coordinated, like a dance”
[32].
Among various properties of accommodation discussed in
the literature, here we briefly review a few that are relevant
to our work. First, one should keep in mind that the co-
ordination occurs nonconsciously. Second, accommodation
does not necessarily occur simultaneously on all dimensions,
as shown in [9]. Moreover, convergence on some dimen-
sions does not exclude divergence on others: for example, [3]
showed that when conversing with males, females converged
on frequency of pauses but diverged on laughter. Another
property that is relevant to this work is that of symme-
WWW 2011 – Session: Information Spread
March 28–April 1, 2011, Hyderabad, India
746
Table 1: Examples of dimensions for which accom-
modation was observed and the respective studies.
Dimension Canonical study
Posture Condon and Ogston, 1967
Pause length Jaffe and Feldstein, 1970
Utterance length Matarazzo and Wiens, 1973
Self-disclosure Derlenga et al., 1973
Head nodding Hale and Burgoon, 1984
Backchannels White, 1989
Linguistic style Niederhoffer and Pennebaker, 2002
try of accommodation: accommodation can occur symmet-
rically when both participants in a conversation accommo-
date to each other or asymmetrically when only one accom-
modates. For example, White [40] presents a study in which
Americans accommodate to Japanese on the frequency of
backchannels (e.g., ‘hmm’, ‘uh-huh’) but the Japanese did
not reciprocate. Asymmetric accommodation has two fla-
vors, depending on the behavior of the non-accommodating
participant:
Default asymmetry: the non-accommodating partici-
pant maintains her default behavior (like in the previ-
ous example);
Divergent asymmetry: the non-accommodating par-
ticipant adjusts her behavior in the opposite direction
from that of the accommodating participant (i.e., di-
verges) [12].
It is also worth pointing out that the subject of this work
is instant accommodation, occuring from one conversational
turn to another. Long-term accommodation is considered to
be a separate phenomenon with potentially different proper-
ties [9, 12]. With a few notable exceptions [9, 32], empirical
support for long-term accommodation is absent mostly due
to the necessity of longitudinal data.
Various potential explanations for why accommodation
occurs have been proposed. One hypothesis is that accom-
modation occurs from a desire to increase communicational
efficiency [37]. Another hypothesis is that a person’s conver-
gence to another person’s communicative patterns is (non-
consciously) driven by the desire gain the other’s social ap-
proval [11, 37]. Yet another possible motivation is that ac-
commodation is used to “maintain a positive social identity”
[20] with the other. The last two hypotheses and several
other studies draw a clear relation between social status and
accommodation (see also [12]), which will become relevant
later in our discussion.
In the present work the focus is on linguistic style accom-
modation, and therefore the work of Niederhoffer and Pen-
nebaker [32] is particularly relevant, being the first study
to quantify this phenomenon. It consists of two controlled
laboratory experiments (involving 94 dyads) and one study
based on transcripts of the Watergate tapes (conversations
between Nixon and 3 of his aides) in which coordination on
various linguistic style dimensions, like usage of prepositions,
adverbs and tentative words is shown to occur between the
participants.
In its almost forty years of existence, communication ac-
commodation theory was empirically supported exclusively
through small scale studies or controlled laboratory experi-
ments. Also the respective studies focused mainly on real-
time interactions (mostly face-to-face, but sometimes com-
puter mediated like in [32]). With this work we aim to
change this state of affairs and demonstrate the robustness
of this theory in a large scale, real world environment where
conversations are not as richly supported as they are in real-
time interactions.
3. CONVERSATIONAL DATA
As discussed in Section 1, Twitter is a good environment
for our study not only because of its fertility in dyadic in-
teractions, but also because it poses new challenges to the
theory of communication accommodation in terms of robust-
ness.
Drawing from this resource, Ritter et al. [36] builds the
largest conversational corpus available to date, made up of
1.3 million conversations between 300,000 users. We will re-
fer to this corpus as conversational dataset A. In spite of its
size, this corpus presents some major drawbacks with respect
to the purpose of this paper. First, it has a low density of
conversations per pair of conversing users: on average only
4.3 conversations per user; this is not sufficient to model
the linguistic style of each pair individually (as required by
the accommodation framework proposed in this work and
detailed in Section 5). Also, more than half of the pairs
of users in this dataset only have unidirectional interaction,
i.e., one of the users in a pair never writes to the other.
This would not introduce a bias with respect to the type
of conversations and relations studied (unidirectional inter-
action are generally not classified as normal conversations),
but would also drastically limit the potential to compare
accommodation between users.
To overcome these limitations, we construct a new conver-
sational dataset with very high density of conversations per
pair and with reciprocated interactions. We start from con-
versational dataset A and select all pairs in which both users
initiated a conversation at least 2 times. We then collect all
tweets posted by these users using the Twitter API1and
then reconstruct all the conversations between the selected
pair. The resulting dataset contains 15 million tweets which
make up the complete2public twitter activity (a.k.a. pub-
lic timeline) of 7,800 users; for each user Twitter metadata
(such as the number of friends, the number of followers, the
location, etc.) is also available. From these tweets we recon-
structed 215,000 conversations between the 2,200 pairs of
users with reciprocal relations selected from conversational
dataset A, using the same methodology for reconstructing
conversations employed in [36]3. This conversational dataset
is complete, in the sense that all twitter conversations ever
held within each pair are available. To the best of our knowl-
edge, this is the largest complete conversational dataset.
The diversity of the user relations and conversations con-
tained in this conversational dataset, dubbed conversational
dataset B., is illustrated in the following table summarizing
per-pair statistics:
1http://apiwiki.twitter.com/
2Complete up to a maximum 3200 most recent tweets per
user, a limitation imposed by the Twitter API.
3Additionally, we remove self replies and retweets from the
data on the belief that they do not make part of a proper
dyadic interaction.
WWW 2011 – Session: Information Spread
March 28–April 1, 2011, Hyderabad, India
747
Mean Median Min Max
Number of conversations 98 60 1 1744
Average number of turns 2.7 2.6 2 16.8
Days of contact 270 257 1 886
The main unit of interaction in this work is a conversa-
tional turn, which is defined as two consecutive tweets in a
conversation. The two tweets in a turn are always sent by
different users and are not re-tweets. Conversational dataset
A contains 2.6 million turns and conversational dataset B
contains 420,000 turns.4
4. MEASURING LINGUISTIC STYLE
Miller [28] shows that style and topic are processed dif-
ferently in the brain. The distinction between the two is
important in our investigation of linguistic style accommo-
dation. In order to measure style and avoid confusion with
topic we follow a psycholinguistic methodology used in a va-
riety of applications, known as the LIWC Linguistic Inquiry
Word Count (LIWC) method.
LIWC [34] measures word use in psychologically meaning-
ful categories (e.g., articles, auxiliary verbs, positive emo-
tions). It uses over 60 such categories, and dictionaries of
words related to each category. This method has been used
in a variety of applications (summarized in [38]) including to
identify social relations, mental health, and individual traits
such as gender, age and relative status. More importantly
LIWC is the basis of all recent work on linguistic style ac-
commodation [32, 39, 14] to which we want to relate.
Following the example of these studies, we eliminate all
categories related to topic, such as Leisure,Religion or Death.
We refer to the 50 remaining dimensions as style dimensions.
In order to facilitate the presentation of the empirical results,
we will focus our discussion on a subset of 14 dimensions that
we call strictly non-topical style dimensions:
Dimension Examples Size
Article an, the 3
Certainty always, never 83
Conjunction but, whereas 28
Discrepancy should, would 76
Exclusive without, exclude 17
Inclusive with, include 18
Indefinite pronoun it, those 46
Negation not, never 57
Preposition to, with 60
Quantifier few, much 89
Tentative maybe, perhaps 155
1st person singular pronoun I, me 12
1st person plural pronoun we, us 12
2nd person pronoun you, your 20
For completeness, we mention that all the results presented
in this paper also holds for all the other style dimensions
(see [38] for a complete list), unless otherwise noted.
We say that a tweet exhibits a given stylistic dimension
if it contains at least one word from the respective LIWC
vocabulary. A tweet can exhibit multiple dimensions and,
in fact, the vast majority do.
Although we experimented with different methods of ex-
tending the LIWC vocabularies with Twitter-specific expres-
sions, we prefer to keep in line with previous literature on
linguistic style matching by using the original vocabularies.
4We are unable to make the data public at the time of pub-
lication in consideration of the Twitter terms of service.
5. PROBABILISTIC FRAMEWORK
This section introduces a probabilistic framework that can
model the phenomenon of accommodation. In defining such
a framework, the desirable properties discussed in Section
1 are accounted for: comparability, expressivity and purity.
Although designed to be applicable to any type of conversa-
tional data and style dimensions, for notational consistency
with the rest of the paper, we use the term “tweet” to refer
to a conversational utterance.
5.1 Stylistic cohesion
We start by addressing the more general phenomenon of
stylistic cohesion. It reflects the intuition that tweets be-
longing to the same conversation are closer stylistically than
tweets that do not. Cohesion is defined by comparing the
probability that a stylistic dimension is exhibited in tweets
that are part of a conversation with the probability that the
same dimension is exhibited in unrelated tweets. If the for-
mer equals the latter, it means that the distribution of the
stylistic dimension is the same whether tweets are part of a
conversation or not. If the former is larger than the latter,
it means that tweets in a conversation tend to “agree” with
respect to the stylistic dimension. If the former is smaller
than the latter, it means that tweets in a conversation tend
to “disagree” with respect to the stylistic dimension. For-
mally, for a given dimension C, the measure of stylistic co-
hesion can be expressed through the following probabilistic
expression:
Coh(C),PTCRC|TRPTCRC(1)
where TC(respectively RC) is the event in which a tweet T
(respectively R) exhibits C, and TRis the condition that
tweets Tand Rform a conversational turn5. Thus, demon-
strating that cohesion is observable for stylistic dimension
Cis reduced to showing that Coh(C)>0.
It should be emphasized that accommodation is only one
of the possible causes for stylistic cohesion. Another ex-
planation can be the indirect effect of homophily already
discussed in Section 1: people that converse are likely to
employ a similar linguistic style simply because they know
each other or are like each other (we will refer to this as
background style similarity). This observation motivates the
need for a measure which can exclusively target accommo-
dation, discussed next.
5.2 Stylistic accommodation
When defining a probabilistic framework for linguistic style
accommodation it is important to control for the effects of
background style similarity (and provide the purity desider-
ata introduced in Section 1). Here this is achieved by mea-
suring accommodation for each user pair separately and by
taking into account the distinctive temporal nature of ac-
commodation: a user can accommodate to her conversa-
tional partner only after receiving her input. In doing so,
the concern is eliminated because a confusion with back-
ground style similarity effects, like homophily, would not be
expected to cause differences within a single pair depending
on whether one or the other user in a pair initiates a conver-
sational turn. Therefore, the goal is to measure for a given
5The sample space considered throughout this work is the
set of all possible ordered conversational tweet pairs.
WWW 2011 – Session: Information Spread
March 28–April 1, 2011, Hyderabad, India
748
pair of users aand bwho engage in a conversation whether
the use of a stylistic dimension Cin the initial tweet (of
user a) increases the probability of that stylistic dimension
in the reply (of user b) beyond what is normally expected
from user b(when replying to user a).
Formally, for a given stylistic dimension Cand pair of
users (a, b), the accommodation of user bto user ais mea-
sured by how much the fact that user aexhibits Cin a tweet
Taincreases the probability of bto also exhibit Cin a reply
to Ta:
Acc(a,b)(C),PTC
b|TC
a, TbTaPTC
b|TbTa(2)
where TC
a(respectively TC
b) is the event in which a tweet
posted by user a(respectively b) exhibits C, and TbTais
the condition that Tbis a reply to Ta. This condition, present
in both the minuend and the subtrahend6, has the role of
restricting this measure of accommodation only to replies of
bto a, therefore controlling for differences in the background
linguistic similarity between users. Also note that by using
the condition instead of the condition employed in the
definition of cohesion (1), we embed the distinctive temporal
aspect of accommodation mentioned earlier. This ability to
integrate temporal disparity is an essential advantage of this
framework over the correlation based measures previously
used in studies of stylistic accommodation [32, 39, 14]7.
Since the main goal is to address global accommodation
(as opposed to the within-pair accommodation described
above), the accommodation for a given dimension Cis de-
fined as:
Acc(C) = E[Acc(a,b)(C)] (3)
where the expectation is taken over all possible conversing
pairs (a, b). Under this framework, proving that accommo-
dation is observable for stylistic dimension Cis reduced to
showing that Acc(C)>0.
5.3 Stylistic influence and symmetry
One important property of the way Acc is defined is its
asymmetry: the accommodation of user bto user aon stylis-
tic dimension Cis potentially different from the accommo-
dation of user ato user bon the same stylistic dimension.
The notion of stylistic influence arises naturally:
I(a,b)(C),Acc(a,b)(C)Acc(b,a)(C) (4)
for a given stylistic dimension C. If I(a,b)(C)>0 we can say
that baccommodates more to aon Cthan bdoes to a.
A related concept is accommodation symmetry (discussed
in Section 2), which is tied to to the accommodation measure
in the following way. Given that baccommodates to a, i.e
Acc(a,b)(C)>0, we have
Symmetry when Acc(b,a)(C)>0,
Default asymmetry when Acc(b,a)(C) = 0,
Divergent asymmetry when: Acc(b,a)(C)<0
6Where by minuend we mean the left term of a subtraction
and by subtrahend we mean to the right one.
7As a concrete example, correlation does not distinguish be-
tween the the case in which the initial tweet exhibits Cbut
the reply does not, and the reverse case in which the the
initial tweet does not exhibit Cbut the reply does.
6. EMPIRICAL VALIDATION
Equipped with the probabilistic framework introduced in
the previous section, here we proceed with an empirical val-
idation of the accommodation phenomenon on the conver-
sation data described in Section 3. As previously discussed,
this setting is fundamentally different from all other circum-
stances in which the theory of communication accommoda-
tion was validated, therefore challenging its robustness.
6.1 Validation of stylistic cohesion
We start by asking whether Twitter conversations are
characterized by stylistic cohesion, since this is a precon-
dition for accommodation. The stylistic cohesion model de-
scribed in Section 5.1 does not distinguish between users
and therefore can be directly applied to the conversational
dataset A (introduced in Section 3).
In order to demonstrate that cohesion is exhibited in our
data we estimate the two probabilities involved in (1) as
follows. We estimate the first probability as the fraction of
all turns in which both tweets exhibit dimension C:
b
PTCRC|TR=|˘(t, r)|tr, tC, rC¯|
| {(t, r)|tr} | (5)
where tCdenotes the condition that a tweet texhibits C.8
To estimate the second probability, we first construct a
set of “fake turns” by randomly pairing together tweets from
the entire conversational data (regardless of their authors).
We can then write:
b
PTCRC=|˘(t, r)|t6↔ r, tC, rC¯|
| {(t, r)|t6↔ r} | (6)
where tCis the condition that the tweet texhibits Cand
t6↔ ris the condition that the tweets tand rare paired
together in a fake turn.
Establishing that cohesion is exhibited in the data corre-
sponds to rejecting the null hypothesis of these two probabil-
ities being equal. Fisher’s exact test9rejects this hypothesis
with p-value smaller than 0.0001 for each of the strictly non-
topical style dimensions. Figure 1 shows the estimates of the
two probabilities for each of these style dimensions (the dif-
ference between the two is shown in red/dark). While this
result is not surprising, it is a necessary precondition for ver-
ifying the more subtle hypothesis of accommodation that we
are going to address next.
6.2 Validation of stylistic accommodation
We now proceed to answer the main question of this work:
does the hypothesis of stylistic accommodation proposed in
the psycholinguistic literature hold in social media conver-
sations? Since the probabilistic framework for accommoda-
tion is applied at the level of user pairs, the conversational
dataset B is employed for this analysis.
For each ordered user pair (a, b) and stylistic dimension C,
we estimate the minuend in (2) as the fraction of b’s replies
to ain which b’s tweet tbexhibits C:
8Lowercase letters are used to represent tweets that make
up our dataset, distinguish them from the uppercase letters
that refer to probabilistic events in the framework defined
in Section 5.
9We use this exact variant of the χ2test since for some style
dimensions the expected counts are low.
WWW 2011 – Session: Information Spread
March 28–April 1, 2011, Hyderabad, India
749
Figure 1: The effect of stylistic cohesion observed
as the difference between b
P`TCRC|TR´(com-
posite bars) and b
P`TCRC´(blue bars). The differ-
ences, shown in red/dark, are statistically significant
(p<0.0001). The dimensions are shown in decreas-
ing order of the difference.
b
PTC
b|TbTa=|˘(ta, tb)|tbta, tC
b¯|
| {(ta, tb)|tbta} | (7)
Similarly, the subtrahend is estimated as:
b
PTC
b|TC
a, TbTa=|˘(ta, tb)|tbta, tC
b, tC
a¯|
| {(ta, tb)|tbta, tC
a} | (8)
We can then measure the amount of accommodation d
Acc(C)
exhibited in our dataset as the difference between the mean
of the set of subtrahend estimations
nb
PTC
b|TC
a, TbTa|(a, b)Pairso
and the mean of the minuend estimations
nb
PTC
b|TbTa|(a, b)Pairso,
where Pairs is the set of all ordered pairs10. Figure 2 com-
pares these means — the former is illustrated in red/right,
the latter in blue/left — for each strictly non-topical stylis-
tic dimension. All the differences are statistically significant
with a p-value smaller than 0.0001 according to a two-tailed
paired t-test11 for all strictly non-topical style dimensions
with the exception of the 2nd person pronoun stylistic di-
mension for which the difference is not statistically signifi-
cant.
Even though our focus is on the strictly non-topical style
dimensions, for completeness we also measured accommoda-
10We discard all user pairs for which the denominator of any
of these two estimations is less than 10.
11In order to allay concerns regarding the independence as-
sumption of this test, for each two users aand bwe only con-
sider one of the two possible ordered pairs (a, b) and (b, a).
tion on the remaining 36 dimensions and observed a statisti-
cally significant effect for all of them except for Fillers (like
‘blah’, ‘yaknow’) for which the data was insufficient.
Note that by design, our probabilistic framework allows
comparison between the accommodation effects exhibited
for each dimension C(i.e., fulfills the comparability desider-
ata introduced in Section 1). Here are some of the compar-
isons worth pointing out:
Users accommodate significantly more on tentativeness
than on certainty (p-value smaller than 0.01 according
to an independent t-test).12
Users accommodate significantly more on negative emo-
tions than on positive emotions (not illustrated,
d
Acc(Neg. emo.) = 0.07, d
Acc(P os. emo.) = 0.04;
p-value smaller than 0.01 according to an independent
t-test for the difference).
1st person singular pronoun vs. 2nd person pronoun.
In retrospect, the fact that accommodation is not ex-
hibited for the 2nd person pronoun dimension seems
natural: words like ‘you’ have a different meaning for
two participants involved in a conversation. However,
the same holds for the 1st person singular pronoun di-
mension for which accommodation is observed. This
could be explained by the social-psychology hypothesis
of disclosure reciprocity in dyadic relationships [6].
With the results presented here we are able to verify that
accommodation does indeed hold in large scale, real world
conversational setting with properties that a priori seemed
challenging to the theory. In the remainder of this section
we will use our framework to investigate what properties lin-
guistic style accommodation exhibits in this conversational
setting.
6.3 Stylistic influence and symmetry
Here we seek to understand the role that the concept of
stylistic influence (introduced in Section 5.3) has in Twitter
conversations. We start by asking whether stylistic influ-
ence is prevalent in the data: in general, is there a balance
between the amount two participants in a conversation ac-
commodate? Or, on the contrary, is one user stylistically
dominating the other?
In terms of our framework, we can test whether in ex-
pectation there is an imbalance of accommodation between
participants in a conversation by verifying whether we can
reject the null hypothesis Eˆabs(I(a,b)(C))˜= 0, where the
expectation is taken over all conversing pairs (a, b). Using
definition (4), this is reduced to rejecting:
Eˆabs `Acc(a,b)(C)Acc(b,a)(C)´˜= 0.
and further to rejecting:
Eˆmax `Acc(a,b)(C), Acc(b,a)(C)´˜=
Eˆmin `Acc(a,b)(C), Acc(b,a)(C)´˜
where the first term is the expected accommodation of the
most accommodating users (where the accommodation is
12Therefore doubt appears to be more “contagious” than con-
fidence.
WWW 2011 – Session: Information Spread
March 28–April 1, 2011, Hyderabad, India
750
Figure 2: The effect of accommodation d
Acc(C)for each strictly non-topical stylistic dimension C
observed as the difference between the means of nb
P`TC
b|Tb, Ta´|(a, b)Pairso(blue, left) and
nb
P`TC
b|TC
a, TbTa´|(a, b)Pairso(red, right). All the differences are statistically significant (p<0.0001),
except for the 2nd person pronoun category. The dimensions are ordered according to the amount of accom-
modation observed.
always compared within each pair), and can by estimated
the mean of:
nmax d
Acc(a,b)(C),d
Acc(b,a)(C)|(a, b)Pairso,
and the second term is the expected accommodation of the
least accommodating users, estimated by the mean of:
nmin d
Acc(a,b)(C),d
Acc(b,a)(C)|(a, b)Pairso.
Using the same method for estimating Acc(a,b)(C) dis-
cussed in Section 6.2, we reject this hypothesis for all strictly
non-topical style dimensions C(paired t-test with p-value
smaller than 0.0001)13. Figure 3 illustrates the difference
between the expected accommodation of the least accom-
modating users (red/left) and that of the most accommo-
dating users (blue/right) in a pair. A difference in the type
of imbalance between dimension is revealed; for example,
while for 1st person plural pronouns in general the least ac-
commodating users still match the style of the most accom-
modating participants (even though significantly less than
vice-versa), for certainty the least accommodating users in
general diverge from the style of the most accommodating
participants.
To further investigate this intriguing difference between
style dimensions, we turn our attention to the property of
symmetry. Figure 4 shows the percentage split between sym-
metrically accommodating pairs (blue/left), asymmetrically
default accommodating pairs (yellow/center) and asymmet-
rically diverging accommodating pairs (red/right), as de-
fined in Section 5.3.
The conclusions that can be drawn from analyzing these
results is that accommodation is a much more complex be-
havior than previously reported in the literature, where it
was assumed that only one type of accommodation occurs
13The same holds for all the other dimensions except Fillers
for which the data was insufficient.
for a given dimension14. But as it can be observed in Fig-
ure 4 all three types of accommodation have a considerable
stake. Furthermore, in all previous work on linguistic style
accommodation, no distinction was made between the type
of accommodation occurring for each dimension. However,
our study indicates a clear difference between dimensions:
Symmetric accommodation is dominant for 1st pron.
pl.,Discrepancy and Indef. pron.;
Asymmetric accommodation (of both types) is domi-
nant in most of the other dimensions;
Asymmetric diverging accommodation is dominant for
2nd person pronoun.
A potential explanation for the fact that such a complex
accommodation behavior was not previously observed may
be the difference between the Twitter conversational set-
ting and that traditionally used in the literature (discussed
in Section 1), especially in the spectrum of relation types
covered (mostly limited to one type in the previous stud-
ies). Another explanation may be the increased expressibil-
ity of our probabilistic framework over the correlation based
framework used in previous studies.
6.4 Relation to social status
As pointed out in Section 2, the psycholinguistic literature
draws clear a connection between the social status of a user
and its tendency to accommodate. Therefore, it is natural to
ask whether stylistic influence correlates with differences in
social status between the users and we take the first steps to
address this question. For lack of a better proxy, we employ
user features that were previously reported to be related
to social influence on Twitter [2]. For each pair of users
14Here we refer to any dimension of accommodation, like the
ones in Table 1, not only to linguistic style dimensions.
WWW 2011 – Session: Information Spread
March 28–April 1, 2011, Hyderabad, India
751
Figure 3: The effect of stylistic influence for each strictly non-topical stylistic dimension Cob-
served as the difference between the means of nmin d
Acc(a,b)(C),d
Acc(b,a)(C)|(a, b)Pairso(red, left) and
nmax d
Acc(a,b)(C)|(a, b)Pairso(blue, right). All the differences are statistically significant (p<0.0001).
The dimensions are shown in decreasing order of the difference.
in our data we compare: #followers, #followees, #posts,
#days on Twitter, #posts per day and ownership of a per-
sonal website. We find that for all style dimensions none of
these features correlate strongly with stylistic influence; the
largest positive Pearson correlation coefficient obtained was
0.15 between #followees and stylistic influence on 1st pron.
pl.. Also, for the task of predicting the most influential user
in each pair a decision tree classifier15 rendered relatively
poor results. The best improvement over the majority class
baseline was of only 7% for the 1st pron. pl. dimension (in
this case the most predictive features were the difference in
#friends and the difference in ownership of a personal web-
site). All this suggests that stylistic influence appears to
be only weakly connected to these social features. However,
one should take this observation with a grain of salt: the
proxies for social status available on Twitter and employed
here are far from ideal. Future work should seek to use bet-
ter proxies for social status, possibly in environments with
richer social data.
7. RELATED WORK
Here we briefly touch on related work not already dis-
cussed. Much of the research in understanding social me-
dia focuses on the network relations between users. More
recently, this line of work has been complimented with a
rich analysis of the content of posts as well as structural
relations among posters. In one early study combing these
two dimensions of analysis, Paolillo [33] examined linguistic
variations associated with strong and weak ties in an early
internet chat relay system. The strength of friendship ties
on Facebook was related by Gilbert and Karahalios [10] to
various language features including intimacy words, positive
and negative emotions. Eisenstein et al. [8] investigated the
role geographic variation of language has in Twitter and
Kiciman [24] examined the extent to which differences in
language models of Twitter posts (as measured by perplex-
ity) were related to metadata associated with the senders.
15We used the Weka implementation of the C4.5 decision
tree, available at www.cs.waikato.ac.nz/ml/weka/
Figure 4: The percentage of accommodating pairs
that exhibit each of the three types of accommoda-
tion: symmetric, default asymmetry and diverging
assymetry.
Such work demonstrates the importance of linguistic style
variations in Twitter which also plays a crucial role in our
study.
Latent variable models have also been used to summarize
more general linguistic patterns in social media. Ramage et
al. [35] developed a partially supervised learning model (La-
beled LDA) to summarize key linguistic trends in a large col-
lection of millions of Twitter posts. They identified four gen-
eral types of dimensions, which they characterized as sub-
stance, status, social and style. These included dimensions
about events, ideas, things, or people (substance), related
to some socially communicative end (social), related to per-
WWW 2011 – Session: Information Spread
March 28–April 1, 2011, Hyderabad, India
752
sonal updates (status), and indicative of broader trends of
language use (style). This representation was used to im-
prove filtering of tweets and recommendations of people to
follow. In the task of tweet ranking, a different approach is
taken by [7] which employs twitter specific features in con-
junction to textual content. Another way of characterizing
key trends in text data is to use known distinctions or di-
mensions. In addition to the already discussed work based
on LIWC, see [29, 33] for other examples of analyses of lin-
guistic variation with respect to position or status in a social
network. Of particular interest in our work is the distinction
between linguistic style and content. Style or function words
make up about 55% of the words that we speak, read and
hear according to Tausczik and Pennebaker [38], similar to
findings of Ramage et al. [35] in their analysis of Twitter.
In our research, we use LIWC to characterize the linguistic
style of posts as well as individuals.
One particularly interesting type of linguistic activity in
social media has to do with conversations, that is with ex-
changes between one or more individuals. Twitter conversa-
tions are the main focus in this work. Java et al. [22] found
that 21% of users in their study used Twitter for conversa-
tional purposes (as measured by the use of @, a convention
to address a post to a particular user), and that 12.5% of
all posts were part of conversations. Honeycutt and Herring
[19] analyzed conversational exchanges on the Twitter public
timeline, focusing on the function of the @ sign. They found
that short dyadic conversations occur frequently, along with
some longer multi-participant conversations. Ritter et al.
[36] developed an unsupervised learning approach to iden-
tify conversational structure from open-topic conversations.
Specifically they trained an LDA model which combined
conversational (speech acts) and content topics on a corpus
of 1.3 million Twitter conversations, and discovered inter-
pretable speech acts (reference broadcast, status, question,
reaction, comment, etc.) by clustering utterances with sim-
ilar conversational roles. In our research, we build on this
data set and extend it to include the complete conversational
history of individuals over a period of almost one year.
Since the notion of linguistic style is central to this work,
we also want to point out other instances in which it plays an
important role. Linguistic style was shown to be crucial in
the area of authorship attribution and in forensic linguistics
(for an overview see [23]). To identify an author, it is neces-
sary to look beyond content into the — often subconscious
— stylistic properties of the language. Simple measures like
word length, word complexity, sentence length and vocab-
ulary complexity were at the forefront of earlier research
into attribution problems (e.g. [41, 18]). Since Mosteller
and Wallace’s seminal work on the Federalist Papers [30],
however, a trend has emerged to focus on the distribution of
function words as a diagnostic for authorship, a method that
in various incarnations now dominates the research. Other
areas using similar methods include gender detection from
text [25, 31, 17] and personality type detection [1].
8. CONCLUSIONS AND FUTURE WORK
In this paper we have shown that the hypothesis of lin-
guistic style accommodation can be confirmed in a real life,
large scale dataset of Twitter conversations. We have de-
veloped a probabilistic framework that allows us to mea-
sure accommodation and, importantly, to distinguish ef-
fects of style accommodation from those of homophily and
topic-accommodation. We also have demonstrated how this
framework allows us to formalize and investigate the notions
of stylistic influence and asymmetric accommodation.
It is important to highlight that our findings are anything
but obvious, given that Twitter is a medium unlike any other
setting in which the phenomenon was previously observed.
Its novelty comes not only from its size, but also from the
wide variety of social relation types, from the non real-time
nature of conversations, from the 140 character length re-
striction and from a design that was initially not geared
towards conversation at all. This work demonstrates that
accommodation is robust enough to occur under these new
constraints, presumably because it is deeply ingrained in hu-
man social behavior.
We believe that this line of research has a number of nat-
ural extensions. One question we have not addressed is the
issue of long-term accommodation: can we measure accom-
modation over a longer period of time, from the first interac-
tion of two users on? Answering this question is challenging
because it requires richer longitudinal data. It would also
be very interesting to explore interplay between the accom-
modating behavior and the type of social relation.
As for practical applications, on the premise that accom-
modation renders conversations more pleasant and effective,
we posit that having the linguistic style of automated di-
alogue systems match that of the user would increase the
quality of the interaction. Personalized ranking of tweets
could also benefit by selecting tweets with styles that match
that of the tweets issued by the target user. Finally, given
the evidence that this work brings to support the universal-
ity of the accommodation phenomenon, we envision its use
in detection of forged conversations.16
Finally, we hope that our findings will stimulate further
research and refinements of the communication accommo-
dation theory in the psycholinguistic world.
Acknowledgments We thank Lillian Lee for inspiring conversa-
tions, Munmun De Choudhury, Scott Counts, Sumit Basu, Dan
Liebling, Magdalena Naro˙zniak, Alexandru Niculescu-Mizil, Tim
Paek, Bo Pang, Chris Quirk for technical advice and the anony-
mous reviewers for helpful comments. This paper is based upon
work supported in part by the NSF grant IIS-0910664.
9. REFERENCES
[1] S. Argamon, S. Dhawle, and M. Koppel. Lexical
predictors of personality type. Joint Annual Meeting
of the Interface and the Classification Society of North
America, 2005.
[2] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J.
Watts. Everyone’s an influencer: Quantifying influence
on Twitter. WSDM, 2011.
[3] F. Bilous and R. Krauss. Dominance and
accommodation in the conversational behavior of
16We are inspired here by the use of Benford’s law in detect-
ing forged financial reports [4]. Though potentially not as
common as such forgery, situations in which conversational
transcripts are contested are not infrequent. One recent ex-
ample is the October 2010 release of phone conversations
between top Romanian political leaders and a compromised
business man. Another one is the controversy surrounding
the reality TV shows “The Jersey Shore” and “The Hill”
which are suspected of being scripted.
WWW 2011 – Session: Information Spread
March 28–April 1, 2011, Hyderabad, India
753
same- and mixed-gender dyads. Language and
Communication, 8, 183-194, 1988.
[4] W. T. Cho and B. Gaines. Breaking the (Benford)
law. The American Statistician, 61(3):218–223, 2007.
[5] Condon and Ogston. A segmentation of behavior.
Journal of psychiatric research, 1973.
[6] V. Derlega, M. Harris, and A. Chaikin. Self-disclosure
reciprocity, liking and the deviant. Journal of
Experimental Social Psychology, 9(4):277–284, 1973.
[7] Duan, Jiang, Qin, Zhou, and Shum. An empirical
study on learning to rank of tweets. COLING, 2010.
[8] J. Eisenstein, B. O’Connor, N. A. Smith, and E. P.
Xing. A latent variable model for geographic lexical
variation. EMNLP, 2010.
[9] K. Ferrara. Accommodation in therapy. In Contexts of
accommodation: developments in applied
sociolinguistics. Cambridge University Press, 1991.
[10] E. Gilbert and K. Karahalios. Predicting tie strength
with social media. HCI, 2009.
[11] H. Giles. Communication accommodation theory. In
Engaging theories in interpersonal communication:
multiple perspectives. Sage Publications, 2008.
[12] H. Giles, J. Coupland, and N. Coupland.
Accommodation theory: Communication, context, and
consequences. In Contexts of accommodation:
developments in applied sociolinguistics. Cambridge
University Press, 1991.
[13] H. Giles, M. Willemyns, C. Gallois, and M. Anderson.
Accommodating a new frontier: The context of law
enforcement. K. Fiedler (Ed.), Social Communication.
New York: Psychology Press., 2006.
[14] A. L. Gonzales, J. T. Hancock, and J. W. Pennebaker.
Language style matching as a predictor of social
dynamics in small groups. Communication Research,
37(1):3–19, Feb 2010.
[15] J. Hale and J. Burgoon. Models of reactions to
changes in nonverbal immediacy. Journal of Nonverbal
Behavior, 8(4):287–314, 1984.
[16] H. Hamilton. Accommodation and mental disability.
In Contexts of accommodation: developments in
applied sociolinguistics. Cambridge University Press,
1991.
[17] S. Herring and J. Paolillo. Gender and genre variation
in weblogs. Journal of Sociolinguistics, Jan 2006.
[18] D. Holmes. Authorship attribution. Computers and
the Humanities, 28(2):87–106, Apr 1994.
[19] C. Honeycutt and S. Herring. Beyond microblogging:
Conversation and collaboration via twitter. HICSS,
2009.
[20] D. A. Infante, A. S. Rancer, and D. F. Womack.
Building communication theory. Waveland Press, 2006.
[21] J. Jaff´e and S. Feldstein. Rhythms of dialogue.
Academic Press, 1970.
[22] A. Java, X. Song, T. Finin, and B. Tseng. Why we
twitter: understanding microblogging usage and
communities. WebKDD/SNA-KDD, Aug 2007.
[23] P. Juola. Authorship Attribution. Now Publishers,
2008.
[24] E. Kiciman. Language differences and metadata
features on Twitter. Web N-gram Workshop, 2010.
[25] M. Koppel, S. Argamon, and A. Shimoni.
Automatically categorizing written texts by author
gender. Literary and Linguistic Computing, 2002.
[26] W. Levelt and S. Kelter. Surface form and memory in
question answering. Cognitive Psychology,
14(1):78–106, 1982.
[27] J. Matarazzo and A. Wiens. Interview: Research on
Its Anatomy and Structure. Chicago: Aldine, 1973.
[28] G. Miller. The science of words. Scientific American
Library. Scientific American Library, 1996.
[29] L. Milroy and J. Milroy. Social network and social
class: Toward an integrated sociolinguistic model.
Language in Society, 21(1):1–26, Mar 1992.
[30] F. Mosteller and D. Wallace. Inference in an
authorship problem. Journal of the American
Statistical Association, 58(302):275–309, Jun 1963.
[31] A. Mukherjee and B. Liu. Improving gender
classification of blog authors. EMNLP, 2010.
[32] K. Niederhoffer and J. Pennebaker. Linguistic style
matching in social interaction. Journal of Language
and Social Psychology, 2002.
[33] J. Paolillo. Language variation on internet relay chat:
A social network approach. Journal of Sociolinguistics,
2001.
[34] J. W. Pennebaker, R. J. Booth, and M. E. Francis.
Linguistic Inquiry and Word Count (LIWC): A
computerized text analysis program. LIWC.net, 2007.
[35] D. Ramage, S. Dumais, and D. Liebling.
Characterizing microblogs with topic models.
International AAAI Conference on Weblogs and Social
Media, 2010.
[36] A. Ritter, C. Cherry, and B. Dolan. Unsupervised
modeling of Twitter conversations. NAACL, 2010.
[37] R. L. Street and H. Giles. Speech accommodation
theory. In Social cognition and communication. Sage
Publications, 1982.
[38] Y. R. Tausczik and J. W. Pennebaker. The
psychological meaning of words: LIWC and
computerized text analysis methods. Journal of
Language and Social Psychology, 29(1):24–54, Mar
2010.
[39] P. Taylor and S. Thomas. Linguistic style matching
and negotiation outcome. Negotiation and Conflict
Management Research, 1(3):263–281, 2008.
[40] S. White. Backchannels across cultures: A study of
Americans and Japanese. Language in Society,
18(1):59–76, Mar 1989.
[41] G. Yule. On sentence-length as a statistical
characteristic of style in prose: With application to
two cases of disputed authorship. Biometrika,
30(3/4):363–390, Jan 1939.
WWW 2011 – Session: Information Spread
March 28–April 1, 2011, Hyderabad, India
754
... This has been shown to produce more successful conversations than conversations without entrainment (Reitter and Moore, 2007;Nenkova et al., 2008). In written and spoken domains of communication, people entrain on multiple dimensions of language production: diction and syntax (Danescu-Niculescu-Mizil et al., 2011), speaking rate, voice quality, and pause frequency (Giles et al., 1991;Levitan and Hirschberg, 2011;Chen et al., 2023), jokes and laughter (Schmidt et al., 2014), and facial expression and gesture Figure 1: The novelty of our work comes from incorporating acoustic-prosodic features in the study of entrainment in code-switched speech. Here we highlight the value of identifying multiple dimensions and feature sets of entrainment in CSW. ...
... Greater evidence of entrainment has been correlated with greater task success and improved social outcomes (Reitter and Moore, 2007;Levitan et al., 2012). Analysis of entrainment patterns in relation to speakers' demographic characteristics has revealed interactions with gender dynamics (Bilous and Krauss, 1988;Levitan et al., 2012;Cabarrão et al., 2016) and power differentials (Danescu-Niculescu-Mizil et al., 2011). Although some work has studied entrainment in languages other than English, few have examined entrainment in multilingual contexts where CSW occurs, giving rise to our first research question: RQ1: Do previously established patterns of entrainment in monolingual settings generalize to code-switched settings -is entrainment a universal phenomenon of spoken communication? ...
... Prior research suggests that power differentials between speakers and other sociolinguistic factors tend to have an influence on such asymmetric entraining behavior, e.g. Danescu-Niculescu-Mizil et al. (2011). We suspect that L2 speaker proficiency could additionally play a role in this, although this has not yet been explored, to our knowledge, because existing data sets do not include this information. ...
... Ritter et al. [51] developed an unsupervised learning approach to identify conversational structure from open-topic conversations. Danescu-Niculescu-Mizil et al. [17] studied how people adopt linguistic styles while in conversation on Twitter. Eisenstein et al. [20] studied the role of geography and demographics on the language in Twitter. ...
Preprint
Full-text available
Compounding of natural language units is a very common phenomena. In this paper, we show, for the first time, that Twitter hashtags which, could be considered as correlates of such linguistic units, undergo compounding. We identify reasons for this compounding and propose a prediction model that can identify with 77.07% accuracy if a pair of hashtags compounding in the near future (i.e., 2 months after compounding) shall become popular. At longer times T = 6, 10 months the accuracies are 77.52% and 79.13% respectively. This technique has strong implications to trending hashtag recommendation since newly formed hashtag compounds can be recommended early, even before the compounding has taken place. Further, humans can predict compounds with an overall accuracy of only 48.7% (treated as baseline). Notably, while humans can discriminate the relatively easier cases, the automatic framework is successful in classifying the relatively harder cases.
... They train an LDA model on a combined dataset of conversational (speech acts) and content topics of 1.3 million Twitter conversations, and identify interpretable speech acts (reference broadcast, status, question, reaction, comment, etc.) by clustering the similar conversational roles. Danescu-Niculescu-Mizil et al. in [35] build on this data set in [34] and extend it to include the complete conversational history of individuals over a period of almost one year. They study how people adopt linguistic styles while in conversation on Twitter. ...
Preprint
Twitter is one of the most popular social media. Due to the ease of availability of data, Twitter is used significantly for research purposes. Twitter is known to evolve in many aspects from what it was at its birth; nevertheless, how it evolved its own linguistic style is still relatively unknown. In this paper, we study the evolution of various sociolinguistic aspects of Twitter over large time scales. To the best of our knowledge, this is the first comprehensive study on the evolution of such aspects of this OSN. We performed quantitative analysis both on the word level as well as on the hashtags since it is perhaps one of the most important linguistic units of this social media. We studied the (in)formality aspects of the linguistic styles in Twitter and find that it is neither fully formal nor completely informal; while on one hand, we observe that Out-Of-Vocabulary words are decreasing over time (pointing to a formal style), on the other hand it is quite evident that whitespace usage is getting reduced with a huge prevalence of running texts (pointing to an informal style). We also analyze and propose quantitative reasons for repetition and coalescing of hashtags in Twitter. We believe that such phenomena may be strongly tied to different evolutionary aspects of human languages.
... • Cultural and social studies: The vast digital repositories of social media platforms are a goldmine for sociocultural researchers. Analyzing the geographic spread of terms, specific hashtags, regional discussions, or even patterns of music or meme sharing can unravel the nuances of cultural diffusion (Danescu-Niculescu-Mizil et al., 2011;Saxton et al., 2015). It provides insights into how traditions evolve, merge, or fade in the digital age, and aids in preserving intangible heritages. ...
Chapter
Full-text available
In the contemporary era dominated by digital advancements, social media platforms have evolved beyond simple communication tools to become integral parts of our socio-cultural fabric. While platforms like Facebook, Twitter, and WeChat have facilitated global interactions, erasing geographical barriers, they have also underscored the continuing significance of geography. This chapter delves into the complex relationship between geographical realities and digital interactions, challenging the idea that the digital realm has diminished the role of geography. On the contrary, social media trends, preferences, and biases are deeply rooted in and influenced by users’ geospatial contexts. Through an exploration of the origins, patterns, and intricacies of geo-social media, this chapter seeks to provide a comprehensive understanding of how our physical world shapes and is reflected in our digital interactions. Emphasizing the enduring importance of geography, it highlights how our digital and geographical worlds intersect, reinforcing the intertwined nature of our tangible and virtual experiences.
... The studies surveyed in this paper examined several types of data categorized into linguistic data, psycholinguistic data, metadata, and interaction data (Tausczik and Pennebaker 2010;Chatterjee et al. 2022). Linguistic data was central to a series of NLP applications and includes,for example, authorship attribution and forensic linguistics, gender detection, and personality type detection (Danescu-Niculescu-Mizil et al. 2011 Metadata features are pieces of information that describe digital data, which can be account metadata or post/message metadata. Account metadata are the data that describe the account, such as the owner's name, profile information, biography, and location. ...
Article
Full-text available
Social media platforms have transformed traditional communication methods by allowing users worldwide to communicate instantly, openly, and frequently. People use social media to express their opinions and share their personal stories and struggles. Negative feelings that express hardship, thoughts of death, and self-harm are widespread in social media, especially among young generations. Therefore, using social media to detect and identify suicidal ideation will help provide proper intervention that will eventually dissuade others from self-harming, prevent suicide, as well as mitigate the spread of suicidal ideations on social media. Many studies have been carried out to identify suicidal ideation and behaviors in social media. This paper presents a comprehensive summary of current research efforts to detect suicidal ideation using machine learning algorithms on social media. This review encompasses thirty-seven studies investigating the feasibility of social media usage for suicidal ideation detection is intended to facilitate further research in the field and will be a beneficial resource for researchers engaged in suicidal text classification.
Article
AI chatbots are increasingly integrated into various sectors, including healthcare. We examine their role in responding to queries related to Alzheimer’s Disease and Related Dementias (AD/ADRD). We obtained real-world queries from AD/ADRD online communities (OC)—Reddit (r/Alzheimers) and ALZConnected. First, we conducted a small-scale qualitative examination where we prompted ChatGPT, Bard, and Llama-2 with 101 OC posts to generate responses and compared them with OC responses through inductive coding and thematic analysis. We found that although AI can provide emotional and informational support like OCs, they do not engage in deeper conversations, provide references, and share personal experiences. These insights motivated us to conduct a large-scale quantitative examination of comparing AI (GPT) and OC responses (90K) to 13.5K posts, in terms of psycholinguistics, lexico-semantics, and content. AI responses tend to be more verbose, readable, and complex. AI responses exhibited greater empathy, but more formal and analytical language, lacking personal narratives and linguistic diversity. We found that various LLMs, including GPT, Llama, and Mistral, exhibit consistent patterns in responding to AD/ADRD-related queries, underscoring the robustness of our insights across LLMs. Our study sheds light on the potential of AI in digital health and underscores design considerations of AI to complement human interactions.
Article
Variations in language abilities, use, and production style are ubiquitous within any given population. While research on language evolution has traditionally overlooked the potential importance of such individual differences, these can have an important impact on the trajectory of language evolution and ongoing change. To address this gap, we use a group communication game for studying this mechanism in the lab, in which micro‐societies of interacting participants develop and use artificial languages to successfully communicate with each other. Importantly, one participant in the group is assigned a keyboard with a limited inventory of letters (simulating a speech impairment that individuals may encounter in real life), forcing them to communicate differently than the rest. We test how languages evolve in such heterogeneous groups and whether they adapt to accommodate the unique characteristics of individuals with language idiosyncrasies. Our results suggest that language evolves differently in groups where some individuals have distinct language abilities, eliciting more innovative elements at the cost of reduced communicative success and convergence. Furthermore, we observed strong partner‐specific accommodation to the minority individual, which carried over to the group level. Importantly, the degree of group‐wide adaptation was not uniform and depended on participants’ attachment to established language forms. Our findings provide compelling evidence that individual differences can permeate and accumulate within a linguistic community, ultimately driving changes in languages over time. They also underscore the importance of integrating individual differences into future research on language evolution.
Article
Authorship verification is the task of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not informative in this setting. On the other hand, transcribed speech exhibits other patterns, such as filler words and backchannels (e.g., um, uh-huh), which may be characteristic of different speakers. We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts. To limit spurious associations of speakers with topic, we employ both conversation prompts and speakers participating in the same conversation to construct verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they perform markedly worse as conversational topic is increasingly controlled. We present analyses of the impact of transcription style on performance as well as the ability of fine-tuning on speech transcripts to improve performance.1
Article
The paper introduces the Special Issue on Language Contact and Speaker Accommodation, which originates from the conference Phonetics and Phonology in Europe (PaPE) held at the University of Lecce, Italy, in 2019. It discusses the topics of language contact and speaker accommodation, summarizing the contributions included in the Special Issue, and arguing explicitly in favour of a unitary view of how both temporary and stable changes happen in (part of) the linguistic systems. Accommodation is seen as the same gradual and non-homogeneous process at play in different contact settings. In the introductory sections, a discussion is offered on various situations in which linguistic systems are in contact and on the main factors that may be at play; the following sections offer an overview of the papers included in the Special Issue, which focus on accommodation in L2 and heritage speakers as well as on the time dimension of dialect or language societal contact. Finally, accommodation is discussed as the same process that is at work in any interaction, that may modify temporarily or long-term the system of L2 learners and bilinguals (e.g., immigrants), that usually affects in the long-term the heritage speakers’ system, and that only in the long term can lead to language changes involving entire communities.
Article
Full-text available
Article
Full-text available
The problem of automatically determining the gender of a document's author would appear to be a more subtle problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80 per cent accuracy. The same techniques can be used to determine if a document is fiction or non-fiction with approximately 98 per cent accuracy.
Chapter
The theory of accommodation is concerned with motivations underlying and consequences arising from ways in which we adapt our language and communication patterns toward others. Since accommodation theory's emergence in the early l970s, it has attracted empirical attention across many disciplines and has been elaborated and expanded many times. In Contexts of Accommodation, accommodation theory is presented as a basis for sociolinguistic explanation, and it is the applied perspective that predominates this edited collection. The book seeks to demonstrate how the core concepts and relationships invoked by accommodation theory are available for addressing altogether pragmatic concerns. Accommodative processes can, for example, facilitate or impede language learners' proficiency in a second language as well as immigrants' acceptance into certain host communities; affect audience ratings and thereby the life of a television program; affect reaction to defendants in court and hence the nature of the judicial outcome; and be an enabling or detrimental force in allowing handicapped people to fulfil their communicative potential. Contexts of Accommodation will appeal to researchers and advanced students in language and communication sciences, as well as to sociolinguists, anthropologists, sociologists and psychologists.
Book
Authorship attribution, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and a wide range of application. It is an important problem not only in information retrieval but in many other disciplines as well, from technology to teaching and from finance to forensics. The idea that authors have a statistical "fingerprint" that can be detected by computers is a compelling one that has received a lot of research attention. Authorship Attribution surveys the history and present state of the discipline, presenting some comparative results where available. It also provides a theoretical and empirically-tested basis for further work. Many modern techniques are described and evaluated, along with some insights for application for novices and experts alike. Authorship Attribution will be of particular interest to information retrieval researchers and students who want to keep up with the latest techniques and their applications. It is also a useful resource for people in other disciplines, be it the teacher interested in plagiarism detection or the historian interested in who wrote a particular document.
Article
This research examined the relationship between Linguistic Style Matching - the degree to which negotiators coordinate their word use - and negotiation outcome. Nine hostage negotiations were divided into 6 time stages and the dialogue of police negotiators and hostage takers compared across 12 linguistic dimensions. Correlational analyses showed that successful negotiations were associated with higher aggregate levels of Linguistic Style Matching (LSM) than unsuccessful negotiations. This result was due to dramatic fluctuations of LSM during unsuccessful negotiations, with negotiators unable to maintain the constant levels of rapport and coordination that occurred in successful negotiations. A further analysis of LSM at the local turn-by-turn level revealed complex but organized variations in behavior across outcome. In comparison to unsuccessful negotiations, the dialogue of successful negotiations involved greater coordination of turn taking, reciprocation of positive affect, a focus on the present rather than the past, and a focus on alternatives rather than on competition.
Article
Three experiments were conducted to determine the psychometric properties of language in dyadic interactions. Using text-analysis, it was possible to assess the degree to which people coordinate their word use in natural conversations. In Experiments 1 (n = 130) and 2 (n = 32), college students interacted in dyadic conversations in laboratory-based private Internet chat rooms. Experiment 3 analyzed the official transcripts of the Watergate tapes involving the dyadic interactions between President Richard Nixon and his aids H. R. Haldeman, John Erlichman, & John Dean. The results of the three studies offer substantial evidence that individuals in dyadic interactions exhibit linguistic style matching (LSM) on both the conversation level as well as on a turn-by-turn level. Furthermore, LSM is unrelated to ratings of the quality of the interaction by both participants and judges. We propose that a coordination-engagement hypothesis is a better description of linguistic behaviors than the coordination-rapport hypothesis that has been proposed in the nonverbal literature.