Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
Chapter 6
YouTube as text
Spoken interaction analysis and
digital discourse
Phil Benson
The rise of digital discourse raises questions about the adequacy of what
we might call ‘traditional’ discourse analysis tools to the task of analysing
computer-mediated communication (CMC). This chapter explores some
of these questions in the context of a study of multimodal interaction on
YouTube that draws upon tools for the analysis of spoken interaction.
Historically, spoken interaction has been placed at the heart of discourse
analysis by two major schools that have treated it as their primary object of
interest: Conversation Analysis (CA) (Sacks et al. 1974) and the Birmingham
school of discourse analysis (Sinclair and Coulthard 1975). While observing
that technology-based interaction was a likely growth area for CA research,
Seedhouse (2005) questioned how far CA principles could be applied to
written asynchronous online interaction. The same argument could be
made of any set of principles designed for the analysis of spoken interac-
tion. Nevertheless, CMC researchers have largely relied on these principles,
adapting them, in particular, to analysis of conversation-like interac-
tions conducted through the medium of writing (Herring 2001; Herring
et al. 2013).
CMC research initially focused on environments that allow written
interaction to emulate spoken interaction, either synchronously (e.g. chat
rooms) or asynchronously (e.g. threaded discussion forums). New multi-
modal text-types, such as blogs (Herring et al. 2005) and social media (boyd
and Heer 2006), have also been seen as inherently conversational, although
they have been positioned at some distance from the norm of face-to-face
spoken interaction. Herring et al. (2005: 1) nd that the blogosphere is
only ‘sporadically conversational’, while boyd and Heer (2006) raise a num-
ber of questions about the differences between online and spoken interac-
tion, among which the absence of clearly identied recipients and contexts
for messages are especially important.
While the various forms of multimodal CMC clearly differ from spoken
interaction, the degree to which they resemble it is an important issue.
Because spoken interaction is often viewed as a fundamental mode of
social interaction, any attempt to understand multimodal CMC in similar
DOI: 10.4324/9781315726465-6
82 Phil Benson
terms touches on the senses in which CMC texts are also products of social
interaction. Through a case study of the multimodal discourse of YouTube,
this chapter argues that, in spite of evident differences from spoken conversa-
tions, YouTube pages are products of social interactions that can be analysed
using tools designed for analysis of the structures of spoken interaction. It
asks what steps need to be taken in order to make these tools work in analysis
of the multimodal discourse of YouTube, and what their limitations may be.
The context for this discussion is a case study of YouTube pages based
on a series of videos, entitled ‘Cantonese Word of the Week!’, which have
attracted a considerable number of views and comments since they were
posted in 2011–12. These pages form part of the data set for a project
designed to investigate evidence of informal language and intercultural
learning in YouTube comments. This chapter focuses on the framework
for analysis of YouTube discourse, involving application of the spoken dis-
course categories of ‘exchange’, ‘turn’, ‘move’ and ‘act’ (Coulthard 1985;
Stenström and Stenström 1994), that was developed during the project. A
key feature of this framework is the application of these categories not only
to written comments, but to the broader multimodal processes that gener-
ate the text of the YouTube page as a whole.
YouTube as text
As the fastest growing web service of recent years, YouTube now ranks third
behind Google and Facebook in measurements of web trafc. YouTube has
also attracted academic interest in an emerging literature that tends to view
it as a technological, media or cultural phenomenon (Burgess and Green
2009; Kavoori 2011; Lovink and Niederer 2008; Snickars and Vonderau
2009; Strangelove 2010). On the face of it, YouTube is a website where peo-
ple watch videos, and not a ‘text’. Nevertheless, YouTube pages are some-
times discussed in terms of text and discourse (Androutsopoulos 2013;
Kavoori 2011) and several studies have pointed to the role of language in
the management and retrieval of videos. Kessler and Schäfer (2009: 279)
point out that, because YouTube cannot machine-read the semantic con-
tent of moving image les, information management relies on ‘metadata
that names, describes or categorises whatever there is to be seen’, which
comes in the form of ‘user-generated input provided as text’.
This comment points to two different ways of viewing YouTube as text,
which focus either on the use of writing or on the YouTube page as a whole.
In their work on the image sharing site Flickr, Barton and Lee (2012:
285) observe that, ‘[a]lthough Flickr is a site primarily devoted to images,
there is a great deal of user-generated writing on the site, especially writ-
ing around an uploaded image’. This user-generated writing includes titles
and descriptions of images, tags and geotags, notes added to photos, and
comments (see also Barton this volume). There is also a good deal of these
YouTube as text 83
kinds of writing on YouTube, and written comments on videos are promi-
nent among them. Another way of viewing media sharing websites such as
Flickr and YouTube as texts is to examine how pages are organised around
samples of media, with writing and other semiotic modes playing comple-
mentary roles. The video is the focal point of a YouTube page. It is the main
reason why people visit the page and, possibly, the only part of the page that
most users pay close attention to. Yet it is also difcult to understand how
YouTube works, both technologically and culturally, without attending to
the ways in which various semiotic modes work together to make up the text
of the YouTube page.
This second view of YouTube as text takes us into the terrain of mul-
timodal discourse analysis (MMDA) (Baldry and Thibault 2006; Bateman
2008; Jones 2013; Kress 2010). Much of the research on CMC focuses on
written interaction independently of its multimodal contexts. Herring
et al.’s (2013) handbook on the pragmatics of CMC example, includes
discussion of interaction in almost every chapter, but no chapter deals
substantially with multimodal interaction. Baldry and Thibault (2006:18),
however, refer to ‘the resource integration principle’, which treats multi-
modal texts as ‘composite products of the combined effects of all the resources
used to create and interpret them’. Applying this principle to YouTube,
Androutsopoulos (2013: 50) comments that ‘[a]lthough each textual bit
on a YouTube page can be viewed as a distinct textual unit, videos and
comments co-occur in a patterned way and are interrelated in meaning
making’. These comments suggest that, if tools for the analysis of spoken
interaction are to be used in analysis of YouTube pages, written text should
not be isolated from its multimodal context. Instead, interactional analysis
should encompass the multimodality of the page as a whole.
Viewed in this light, three characteristics of YouTube as text stand out:
1 YouTube pages deploy multiple semiotic modes, including moving images,
spoken word, music and sound, still images, written words, and a variety
of clickable objects, icons and links. The number of identiable com-
municative elements found on a YouTube page is typically more than
100, and this number increases as written comments are added. Each
comment sits within a space that contains eleven different elements in
addition to the comment itself, each of which leads to a different action
when clicked.
2 YouTube pages are products of multiple authorship. A page is created
when a user uploads a video and inputs written text to describe it, but
much of the text of the page is machine-generated and includes boil-
erplate text from YouTube and text created by advertisers and other
users. Users subsequently add to the text in various ways: by adding
written comments, but also by actions that do not involve writing, such
as ‘liking’ and ‘disliking’ the video or individual comments.
3
84 Phil Benson
YouTube pages are highly dynamic in the sense that the text of the page
constantly changes in response to user and machine-generated input.
Simply by viewing a page, a user alters the number of views that is dis-
played below the video. The text of the page that surrounds a particular
video also varies according to the location of the user, automatically or
in response to user-dened settings.
These characteristics make YouTube particularly amenable to MMDA, but
two further steps are needed to support the use of tools intended for analy-
sis of spoken interaction in the context of MMDA. The rst step involves
treating YouTube as a form of social media, while the second involves treat-
ing YouTube pages as products of mediated social interaction. Zourou and
Lamy (2013: 1) dene social media as ‘artefacts with a networking dimen-
sion, which are designed so as to make that dimension central to their use’.
YouTube ts this denition well as a particular kind of social media, in
which networking is mediated by uploading media (videos, images, written
text) and commenting. This denition encompasses YouTube and other
media sharing services, in which social interaction is mediated by the activi-
ties of uploading and viewing media.
The second step involves a particular understanding of the sense in
which YouTube is ‘interactive’. Rafaeli and Ariel (2007) make a useful dis-
tinction between two kinds of interactivity in digital media: one concerned
with ‘responsiveness’ to user input, the other with interpersonal interac-
tion. In the rst sense, the interactivity (or responsiveness) of YouTube is a
matter of how the interface responds to user input; in the second it is a mat-
ter of the quality of the social interaction that can be observed on YouTube
pages. Adopting a discourse-based view, Rafaeli and Ariel dene this second
sense of interactivity as ‘the extent to which messages in a sequence relate
to each other and especially the extent to which later messages recount the
relatedness of earlier messages’ (73).
This denition of interactivity takes cohesion and coherence as the main
indicator of the quality of digitally mediated social interaction. Herring’s
(2013) comparison of message sequences in a recreational Internet Relay
Chat (IRC) chat room and YouTube comments also attends to linkages
among messages. Herring found that comments on media sharing sites
tend to be ‘prompt-focused’ and ‘respond to an initial prompt, such as a
news story, a photo, or a video, more often than to other users’ responses’
(13). Few comments are linked to a previous comment, which means that
the ‘stepwise’ patterns of topic development observed on IRC chat are
largely absent. This comparison is problematic, however, to the extent
that it only considers linkages among written messages. Sindoni (2013:
205) goes a step further by positing a ‘multimodal relevance maxim’ for
YouTube comments, which states that ‘[c]omments need to be consistent
with the main communicative focus of multimodal interaction and the most
YouTube as text 85
salient semiotic resource: the foregrounded video’; but at the same time,
she appears to remove the video from the eld of interaction, by describing
it as a ‘master-text’ and comments on it as ‘meta-texts’, or ‘adjuncts’ to the
video (180). The framework that I will describe develops this view by treat-
ing the uploading of a video as an interactional turn, which begins a process
of multimodal social interaction in which users ‘respond’ to the ‘initiation’
of the video using a variety of semiotic modes.
A framework for analysing YouTube interaction
The framework used in this study is based on the framework for analysing
the structure of spoken interaction developed at Birmingham University
in the 1970s (Sinclair and Coulthard 1975). Originally based on classroom
interaction, this framework was later applied to analysis of everyday English
conversation (Stenström and Stenström 1994; Tsui 1994). The framework
identies a hierarchy of nested units – ‘transaction’, ‘exchange’, ‘move’ and
‘act’ – through which participants organise spoken interaction. Transactions
are composed of exchanges, which are typically composed of two or three
moves (Initiation, Response, Follow-up). In Stenström and Stenström’s
(1994) account, on which this study is based, Sinclair and Coulthard’s three
moves are extended to eight: Summons, Focus, Initiate, Repair, Response,
Re-open, Follow-up, Backchannel. In both accounts, an exchange mini-
mally consists of two moves: an Initiation followed by a Response.
Stenström and Stenström (1994: 30) dene a move as ‘what a speaker
does in a turn in order to start, carry on and nish an exchange’. An
Initiation (I) begins an exchange and ‘predicts’ or ‘constrains’ the following
move, which will normally be a Response (R) (Coulthard 1985). Stenström
and Stenström insert the CA category of ‘turn’ (Sacks et al. 1974) between
exchange and move. An IR exchange may take place over two turns, but the
categories of move and turn are not coterminous. A turn that follows an
Initiation often consists of two moves: a Response followed by an Initiation
(R+I). Coulthard also gives an example of a single move that contains both
a Response and an Initiation (R/I).
1 Teacher: can anyone tell me what this means I
2 Pupil: does it mean danger men at work R/I
3 Teacher: Yes . . . R
Source: Adapted from Coulthard (1985: 135)
In this example, Turn 2 is both a Response to the Initiation in Turn 1 and
an Initiation that elicits a Response in Turn 3. Importantly, R+I and R/I
turns are a resource for speakers to produce sequences of topically linked
exchanges and crucial to the stepwise topic development that Herring
(2013) fails to nd in YouTube comments. Exchanges typically ow into
86 Phil Benson
new exchanges through R+I and R/I moves, but terminate with turns that
consist only of a Response move.
Act is the smallest interactional unit, signalling ‘what the speaker
intends, what s/he wants to communicate’. A move may consist of one, two
or several acts. Coulthard (1985) emphasises that interactional acts differ
from ‘speech acts’ (Searle 1969), because they are dened principally by
their interactive function. The speech act ‘statement’, for example, might
be classied interactionally as an ‘inform’ if it occurs in an I move, or as an
‘answer’ if it occurs in an R move responding to a ‘question’. The speech act
‘question’ might also be classied interactionally as a ‘challenge’ or ‘clari-
cation check’ if it occurs in an R move. Because moves are realised by acts,
they predict both the move and the range of acts that must follow if the
interaction is not to break down. An I move that contains a ‘question’, for
example, predicts an R move containing an ‘answer’. In the study discussed
later in this chapter YouTube data were coded using a modied version of
Stenström and Stenström’s (1994) taxonomies of moves and acts, which are
among several in the literature that differ considerably in their terminology
and in the distinctions they make (cf. Tsui 1994). While the classication
of moves and acts remains an inexact science, these taxonomies proved
useful as heuristic devices for exploring interactional patterns within the
multimodal discourse of YouTube.
In mapping YouTube discourse on to this framework, two assumptions
were made. The rst was that interactional moves need not be spoken or
written. From the perspective of MMDA, a framework that could only be
applied to spoken (or conversation-like written) interaction would be nar-
row, because interaction is invariably multimodal (Norris 2013). The sec-
ond assumption was that an exchange may be multimodal in the particular
sense that it begins with a turn that draws on one set of semiotic resources
and is completed by a turn that draws on another. The premise of analysis
was, therefore, that any action that modies the content of a YouTube page
potentially counts as an interactional turn that can be coded in terms of
moves and acts. The relevant user actions are multiple and include, in addi-
tion to writing a comment or replying to a comment, uploading a video and
‘liking’ or ‘disliking’ a video or comment. The most important implication
of this premise is the treatment of uploaded videos as complex I moves
that predict R moves of various kinds. Rafaeli and Ariel’s (2007) distinction
between ‘responsiveness’ and ‘interactivity’ is again useful. By clicking on
icons on the video player (play, pause, etc.) users produce a correspond-
ing effect but they do not modify the text of the page. This is a matter of
the responsiveness of the YouTube interface. By clicking on the ‘like’ icon
below the video player, on the other hand, the user modies the page by
increasing the number of ‘likes’ displayed. This is a matter of interactiv-
ity, or social interaction. By clicking a ‘like’ icon a user makes an R move
realised by an ‘evaluate’ act, producing a simple exchange that begins with
YouTube as text 87
the I move of the video and terminates with the R move of the ‘like’. This
approach to analysis also has an impact on our understanding of written
comments, because it allows us to recast them as ‘interactional turns’ that
can generally be classied as I, R, R+I, or R/I moves.
The study
The remainder of this chapter discusses how this framework was used in a
study that aimed to uncover evidence of language and intercultural learn-
ing in comments on YouTube videos. The starting point of the study was
an assumption that learning is a social process embedded in interaction
(Seedhouse and Walsh 2010: 127), which, although previous research has
focused largely on learning and spoken interaction, might also be evident
in interactions among YouTube commenters. Videos involving the use of
English and Chinese were identied and comments related to language
and culture were identied for analysis. A rst step in the study was the
design of a framework to analyse the discourse of these comments, which
gradually developed into a more comprehensive framework for analys-
ing the processes of multimodal interaction that generate the text of the
YouTube page as a whole.
The use of this framework is illustrated by data from the ‘Cantonese Word
of the Week!’ series, in which Carlos Vidal (YouTube username, ‘carlos-
douh’) explains the meaning of popular Cantonese colloquial expressions
in an entertaining manner. The series elicited more than 6,054 comments,
of which 1,296 were related to language or culture and analysed in detail. In
the most popular video in the series (‘I am a Hong Kong Girl with 公主病
[Gung Jyuh Behng]’, which recorded more than 1.3 million views) Carlos
introduces a Cantonese expression meaning ‘princess sickness’. This video
elicited 1,260 comments, of which 246 were language-culture related; this
proportion (20 per cent) was similar to the proportion for the whole series
(21 per cent). Comments on this video are used as examples in the follow-
ing sections.
Turns
CA researchers treat conversation as an orderly self-managed system, anal-
ogous to the turn-taking systems that govern games (Sacks et al. 1974).
Although interaction on YouTube is not a self-managed system of this kind,
turn is relevant as a basic unit of interactional analysis. Turns are framed by
the affordances of the YouTube interface, which governs how users’ contri-
butions will appear on the page. In this context, any user action that con-
tributes semantic content to the page counts as a turn. This content may be
produced ofine, before the action is performed (e.g. writing a comment
and then uploading it) or it may be embedded in the action (e.g. clicking
88 Phil Benson
a ‘like’ icon). In order for a turn to be interactive, this action, whatever its
form, must somehow be linked to an action performed by another user. In
other words, it must contribute to an interactional exchange.
By uploading a video, together with a title, description and tags, a user
makes the all-important I move that begins the interaction that will gener-
ate the ongoing text of a YouTube page. This move is both multimodal and
interactively complex. In conversation, an utterance becomes an I move
when it is followed by an R move. A speaker may also make more than one
potential I move in the same turn. The R move in the next turn, therefore,
indicates how the next speaker orients towards the preceding turn as an I
move (i.e. the part of the turn they are responding to and how they inter-
pret it interactionally). This point is particularly important in understand-
ing interaction on YouTube pages, because users typically respond to videos
in a variety of ways, orienting both towards the act of uploading the video
or towards some particular I move within the video or the written text that
accompanies it. It is also worth noting that the uploader of the video also
has the power to decide who will participate in this interaction, by allowing
unrestricted comments, by limiting viewing and commenting to an identi-
ed group of users, or disallowing comments altogether.
Response moves
YouTube offers three semiotic modes for R moves: 1) video responses;
2) the ‘like’ and ‘dislike’ icons; and 3) written ‘comments’. A video response
responds to video through the medium of another video. ‘Liking’ or ‘dislik-
ing’ a video or comment is simply a matter of clicking an icon and repre-
sents a simple ‘evaluate’ act. Written comments represent a range of acts,
among which ‘evaluate’ appears to be most frequent on most pages. An
‘evaluate’ in a written comment (‘i like you XD’, ‘Funny!’) is interactively
equivalent to liking or disliking a video and differs only in its modality and
specicity. Written comments also offer three distinctive interactional affor-
dances: 1) they allow R moves to be directed at a particular aspect of the
video; 2) they allow performance of a range of acts within R moves; and 3)
they allow performance of I moves, which potentially lead to prolonged
exchange sequences. Important as they are in the interactional world of
YouTube pages, the ‘like’ and ‘dislike’ icons are interactively limited. They
allow only one move (R) and act (‘evaluate’), which terminates any IR
exchange in which they play a part.
Analysis of written comments provides a good deal of evidence that they
are not simply ‘comments’ on the video ‘prompt’ or ‘metatext’, but interac-
tive turns that complete or prolong exchanges that begin with the upload-
ing of a video. Comments that can be classied as simple R moves typically
display different orientations towards the video as an I move. In the ‘Gung
Jyuh Behng’ video, Carlos speaks directly to the camera and addresses
YouTube as text 89
viewers as ‘you’. The 1:27 minute video is divided into three segments:
1) Carlos begins by saying, in Cantonese with English subtitles, ‘I am a
Hong Kong girl with Gung Jyuh Behng’; he then acts out several Cantonese
utterances as if they were spoken by such a girl (e.g. ‘Hurry up and buy
me things! I want Louis Vuitton and Gucci!!’), repeating the phrase Gung
Jyuh Behng after each; 2) he explains the meaning of Gung Jyuh Behng in
English and nishes with his catchphrase for the series, ‘Hear it! Speak it!
Memorise it!’; and (3) he ends the video by directing viewers to his other
videos, Facebook page and Twitter feed. Extract 1 shows ve different ori-
entations towards the video as an I move.
Extract 1
Comment Orientation
1 Funny!
2 i like you XD
3 Omg! Your cantonese is amazing! And
fantastically funny your rendition of
Princess Syndrome!
4 LOL I totally thought you mean princess
cookie/pastry this whole time . . .
5 Actually we call ‘公主病’ Princess
Syndrome instead of ‘princess sickness’
The whole video
Carlos
Segment 1 (Carlos speaking
Cantonese)
Specic item in Segment 1 (Carlos’s
pronunciation of 病)
Specic item in Segment 2 (Carlos’s
translation of 公主病)
Comments 1 and 2 are simple ‘evaluate’ moves, oriented to the video as
a whole and to Carlos’s performance. The vast majority of non-language-
culture related comments for this video are of these kinds. Comments 2
and 3 are oriented towards the rst segment of the video. Comment 2 is
an ‘evaluate’ of this of this segment, while Comment 3 points specically
to Carlos’s pronunciation of 病, which could be interpreted as ‘cookie/
pastry’, rather than ‘sickness’. Most of the language-culture related com-
ments for this video are of these kinds and orient towards Carlos’s use of
Cantonese in Segment 1. Comment 2 is one of a small number oriented
towards the explanation in Segment 2, and none were found that oriented
to Segment3.
Extract 1 shows how comments, typically, ‘respond to’, rather than ‘com-
ment on’, videos or specic aspects of them. Three other kinds of evidence
for interactivity can also be seen in Comments 3–5. First, Comments 3 and 4
both use second person address, which is characteristic of comments on vid-
eos in which the uploader speaks directly to the viewer. All three comments
also begin with an ‘uptake’ act (‘OMG’, ‘LOL’, ‘Actually’) that explicitly
links the comment to the video. Lastly, comments are often marked for
affective and epistemic stance (Ochs 1996), which signals an orientation
towards commenting as an interactive event. In the comments above, it
is the ‘uptake’ acts that are affectively (‘OMG’, ‘LOL’) and epistemically
(‘Actually’) marked (note also, ‘I totally thought’, in Comment 4). While
90 Phil Benson
the stance markers used in YouTube comments tend to differ from those
used in face-to-face interaction (notably the use of acronyms and punctua-
tion marks as affective markers), their interactive functions appear to be
very similar.
Complex moves in written comments
While the majority of comments on the ‘Gung Jyuh Behng’ video consist
of a single R move, three other types are also found: turns that consist of
both R and I moves (R+I or R/I) and turns that consist of a single I move.
In the rst case, the R move completes an exchange and a further I move
is added. In the latter, the comment is usually linked to the video semanti-
cally, but not interactively. Comments 4 and 5 in Extract 1 are examples of
R/I moves. Comment 4 clearly responds to Carlos’s pronunciation of 病
(‘evaluate’) but also adds the commenter’s opinion that it sounds like the
word for ‘cookie/pastry’ (‘opine’). Comment 5 differs in that an ‘object’
act both responds to Carlos’s opinion on the translation and signals a new
opinion. Extract 2 shows an example of an R+I move followed by an R move.
Extract 2
Comment Move
A:
B (Carlos):
OMG~ why can you pronounce cantonese so freaking well!
and I have yr shelf behind u, IKEA product~ha
haha, yah i love that shelf!! :D
R+I
R
A’s comment consists of an ‘evaluate’ of Carlos’s Cantonese pronunciation,
followed by an ‘inform’ referring to a shelf that appears behind Carlos in
the video. The following R move, made by Carlos, closes the exchange.
Extract 3 shows an example of an I move.
Extract 3
Comment Move
A: Hey r u really non chinese? u speak mandarin really uently~ I
B: when did he speak mandarin?? R/I
A’s comment is semantically related to the video, but it is clearly not a
Response to it. Instead, A directs a ‘question’ to Carlos about his ethnicity
(in fact, Carlos is not ethnically Chinese). B’s response is a ‘challenge’ to
this question, which implicitly refers to the fact that Carlos is speaking Can-
tonese, rather than Mandarin.
In all of these examples, new topics are raised in the I moves, which allow
interaction to move forward. R/I and R+I moves are especially important
in this respect because they can, in principle, be followed by further R/I
and R+I moves ad innitum. The longest sequence of this kind in the com-
ments on the Gung Jyuh Behng video consists of ten comments. Incomplete
YouTube as text 91
exchanges, in which a potential I move is not followed by an R move, are
also characteristic of the discourse of YouTube comments. Extract 3 exem-
plies this as B’s R+I does not elicit a Response. This tendency for potential
exchanges to ‘hang’ for want of an R move is one important difference
between YouTube interaction and spoken interaction, in which closure of
exchanges is the norm.
Written comments – patterns of interaction
Based on the analysis presented so far, it is clear that the characteristic
pattern of interaction in written comments is the IR exchange, in which
the video functions as the I move. The vast majority of rst comments
in the ‘Cantonese Word of the Week!’ series consist of or begin with
an R move directed at the video or some aspect of it, especially those
that are not related to language or culture. Among the language-culture
related comments extended comment sequences are more frequent. Of
the 1,296 language-culture related comments in this series, 421 (32 per
cent) were found in sequences of comments that begin with an R+I,
R/I, or I move. There were 154 sequences of this kind with an average
length of 2.7 comments. The fact that such sequences occur primar-
ily among the language-culture related comments is also signicant.
Language and culture is, broadly, the substantive topic raised by the
videos in the series. A language-culture related comment is a comment
oriented towards the substantive topic of the Initiation that the video
represents, and a sequence of comments represents a development of
this topic. The number and length of exchange sequences are, in this
sense, indicators of the semantic breadth and depth of interaction on
the topic of a video.
The patterns that are found within sequences of written comments
are varied and are somewhat similar to multi-party spoken interactions.
Extract4 illustrates some of the possibilities:
Extract 4
A: Video: [Carlos saying公主病] I
B: 공주병 Wow it sounds really similar haha R/I
C: how does it pronounce? R/I (reply to B)
B: Gong ju byung. Something like that. R (reply to C)
D: That’s cause Chinese is the oldest language compared to R+I (reply to B)
Japanese and Korean. The Japanese have ‘kanji’ and the
Koreans have ‘hanja’, where they take some Chinese
characters (called ‘hánzi) and write them the same but
pronounce them differently. They also have similar sounds
for meanings, such as 공주병 and 公主病.
E: i totally agree with you~~~many japanese and korean R+I (reply to D)
vocabularies sound like Chinese. I found lots of korean vocab
have similar sounds to Hakka, one of the dialects of Chinese.
92 Phil Benson
In Extract 4, there are ve participants, including Carlos (A), whose
pronunciation of 公主病 is the I move to which B responds. B’s turn is
coded as an R/I move, in which the I move is realised by the information
that 公主病 sounds like a Korean phrase 공주병 with the same meaning. C
then makes an R/I move asking B how the Korean phrase is pronounced
and B responds by providing the pronunciation. The interaction continues
with a comment from D in reply to B’s rst comment, offering an explana-
tion of why the two phrases have similar pronunciations. Lastly, E responds
to D with an R+I move, agreeing with D’s explanation and adding the infor-
mation that many Korean words sound like Hakka words. The sequence of
comments ‘hangs’ at this point, as E’s Initiation does not attract a response.
Extract 4 shows how a topic develops through commenters ending their
turns with I moves. Occasionally, a topic develops in this way through a
series of dyadic conversation-like comments posted by two commenters
(typically when there is a point of dispute). However, multi-party interac-
tions are more characteristic, in which topics develop in a ‘chain’ of com-
ments, where, for example, A responds to B, C responds to B, D responds
to C, and so on. In Extract 4, B makes the initial R/I move, C responds to
B, and B responds to C and then plays no further part. D then responds to
B’s rst turn and E responds to D. This pattern reects the way in which
asynchronous interactions develop, with participants not necessarily attend-
ing to the whole of a sequence of comments. In this context, it is worth not-
ing that, in principle, interactive turns that contain I moves remain open
for R moves indenitely. This is especially true of videos; the Gung Jyuh
Behng video has continued to elicit Responses over a period of three years.
It is also true of written comments (note, for example, how B’s rst com-
ment in Extract 4 elicited independent Responses from C and D), though
less so because comments move out of view as they are replaced by newer
comments.
Displaying interaction
The way in which comments that are available for response become less
likely to elicit responses as they move out of view points to a tension between
the textual product of the YouTube page and the interactional processes
that produce it. While analysts of spoken interaction and conversation-like
written interaction rely on records that reect the sequence in which turns
were taken, an interactional analysis of a YouTube page can recover this
sequence only partially. ‘Likes’ and ‘dislikes’ can be understood as inter-
actional moves, but they are recorded only as aggregate numbers; no trace
is left of when or by whom these moves were made. The layout of written
comments has some relationship to the chronological sequence in which
they were posted, but the layouts of the comments section makes it difcult
to read them in this sequence (see below). The fate of video responses is
YouTube as text 93
also interesting in this respect, because their name originally signalled an
orientation on the part of YouTube towards interactive dialogue through
the medium of video. In August 2013, however, YouTube announced the
discontinuation of video responses, because they were ‘little-used’ (YouTube
2013). Users were encouraged, instead, to embed links to videos in written
comments, or to indicate that their own videos are responding to others by
using titles, tags and descriptions. Comments on the blog post that discuss
this change suggest that many users saw this as a move away from an orienta-
tion towards ‘conversation’.
The layout of written comments, in particular, reects a tension between
the sequential order in which they are posted and the ways in which the
page designers assume that users prefer to read them. Comments are dis-
played using a modied version of a threaded discussion forum, in which it
is possible to, for example, reply to a comment, or reply to a reply. Where
comments are explicitly linked in this way, they are nested and displayed in
sequential order. The comments section as a whole, however, is organised
differently, so that comments can be viewed in one of two ways. By default,
comments are displayed in descending order according to the number
of ‘likes’ they receive (the ‘top comments’ option). The alternative is to
display them in reverse sequential order, with the most recent comment
appearing at the head of the list (‘newest rst’). In addition, only the most
popular or most recent comments are displayed; users need to click ‘all
comments’ to see the section as a whole. The effect is that it is both difcult
and counter-intuitive to read comments in sequential order (and presum-
ably only a discourse analyst would want to do so!). Nevertheless, comments
can be read in this way and this is even encouraged by the layout of nested
comments. The tension here is, presumably, between the imperative of pro-
viding a satisfactory multimodal reading experience, while at the same time
representing YouTube as an interactional space.
Conclusion
The aim of this chapter has been to discuss the usability of tools designed
for analysing spoken interaction in the analysis of YouTube pages. It has
shown these tools have proved useful, not only in the analysis of written
comments, but also in facilitating a broader understanding of the senses in
which the multimodal text of a YouTube page is a product of interactional
processes of various kinds. In this context, the basic ideas and categories
associated with exchange structure are seen to have applications beyond
spoken interaction. They remain usable and useful in the context of mul-
timodal digital discourse because, in its essentials, communicative interac-
tion has certain properties that are relatively independent of modality. They
are also especially useful in analysis of texts such as YouTube pages, which
do not at rst sight appear to be records of interaction. Androutsopoulos
94 Phil Benson
(2013: 50) describes YouTube as a ‘participatory spectacle’; we are able to
both view and participate in the construction of the text of YouTube pages
as it evolves over time. What we observe in this ‘spectacle’ is interaction
among YouTube users, albeit the structure of this interaction is somewhat
concealed by the design of the page. Spoken interaction analysis tools,
therefore, help us to see how the text of a YouTube page is, in fact, a prod-
uct of interactional processes.
At the same time, there are clearly areas of YouTube to which exchange
structure analysis is not immediately applicable. The analysis of videos as
Initiation moves is probably the most problematic area in this respect. If
interactional moves are made up of acts, then it is clear that the action
of uploading a video involves a large number of acts. While it is relatively
easy to identify the element in a video that commenters are responding
to, it is often difcult to characterise this element as an act (see, for exam-
ple, Extract 4, in which the comment responds to Carlos’s pronunciation
of a word). Although they do not appear in the data for this study, video
responses and links among videos using tags and other cross-referencing
devices add another layer of complexity. At one point YouTube appears to
have imagined ‘conversations’ conducted through the medium of video;
how would exchange structure analysis cope with conversations of this
complexity? Another interesting area for further investigation is commu-
nicative actions that are clearly interactive but directed beyond the page,
such as ‘reporting’ a video or comment as spam or abuse and ‘sharing’ a
video through other social media services. The latter reminds us that the
YouTube page on which a video is rst uploaded is often not the only place
on which it can be seen and discussed.
Acknowledgements
This chapter is based on a project funded by the Hong Kong Research
Grants Council General Research Fund, entitled Informal Language
Learning in Social Media Environments: A YouTube-based Study (Ref. No.
840211). I am grateful to Ada Fong for her work in collecting and analysing
data for this project.
References
Androutsopoulos, J. (2013) ‘Participatory culture and metalinguistic discourse:
performing and negotiating German dialects on YouTube’, in D. Tannen and
A.M. Trester (eds) Discourse 2.0: language and new media, 47–71, Washington, DC:
Georgetown University Press.
Baldry, A. and Thibault, P. J. (2006) Multimodal Transcription and Text Analysis: a
multimedia toolkit and coursebook, London: Equinox.
Barton, D. and Lee, C. (2012) ‘Redening vernacular literacies in the age of Web
2.0’, Applied Linguistics, 33(3): 282–298.
YouTube as text 95
Bateman, J. A. (2008) Multimodality and Genre: a foundation for the systematic analysis of
multimodal documents, New York: Palgrave Macmillan.
boyd, d. and Heer, J. (2006) ‘Proles as conversation: networked identity perfor-
mance on Friendster’, proceedings of the Hawai’i International Conference on
System Sciences (HICSS-39), Kauai, HI: IEEE Computer Society.
Burgess, J. and Green, J. (2009) YouTube: online video and participatory culture,
Cambridge: Polity Press.
Coulthard, M. (1985) An Introduction to Discourse Analysis, 2nd edn (1st edn 1977),
London: Longman.
Herring, S. C. (2001) ‘Computer-mediated discourse’, in D. Schiffrin, D. Tannen
and H. E. Hamilton (eds) The Handbook of Discourse Analysis, 612–634, Oxford:
Blackwell Publishers.
Herring, S. C. (2013) ‘Discourse in Web 2.0: familiar, recongured, and emergent’,
in D. Tannen and A. M. Trester (eds) Discourse 2.0: language and new media, 1–25,
Washington, DC: Georgetown University Press.
Herring, S. C., Kouper, I., Paolillo, J. C., Scheidt, L. A., Tyworth, M., Welsch,
P., Wright, E. and Yu, N. (2005) ‘Conversations in the blogosphere: an analysis
“from the bottom up”’, proceedings of the 38th Hawaii International Conference
on System Sciences, Kauai, HI: IEEE Computer Society.
Herring, S. C., Stein, D. and Virtanen, T. (eds) (2013) Pragmatics of Computer-mediated
Communication, Handbooks of Pragmatics, vol. 9, Berlin: De Gruyter Mouton.
Jones, R. H. (2013) ‘Multimodal discourse analysis’, in C. A. Chapelle (ed.) The
Encyclopedia of Applied Linguistics, Oxford: Blackwell.
Kavoori, A. (2011) Reading YouTube: the critical viewers guide, New York: Peter Lang.
Kessler, F. and Schäfer, M. T. (2009) ‘Navigating YouTube: constituting a hybrid
information management system’, in P. Snickars and P. Vonderau (eds) The
YouTube Reader, 275–291, Stockholm: National Library of Sweden.
Kress, G. (2010) Multimodality: a social semiotic approach to contemporary communication,
New York: Routledge.
Lovink, G. and Niederer, S. (eds) (2008) The Video Vortex Reader: responses to YouTube,
Amsterdam: Institute of Network Culture.
Norris, S. (2013) ‘Multimodal interaction analysis’, in C. A. Chapelle (ed.) The
Encyclopedia of Applied Linguistics, Oxford: Blackwell.
Ochs, E. (1996) ‘Linguistic resources for socializing humanity’, in J. Gumperz and S.
Levinson (eds) Rethinking Linguistic Relativity, 407–437, Cambridge: Cambridge
University Press.
Rafaeli, S. and Ariel, Y. (2007) ‘Assessing interactivity in computer-mediated
research’, in A. N. Johnson, K. Y. A. McKenna, T. Postmes and U. D. Rieps,
The Oxford Handbook of Internet Psychology, 71–88, Oxford: Oxford University
Press.
Sacks, H., Schegloff, E. A. and Jefferson, G. (1974) ‘A simplest systematics for the
organisation of turn-taking for conversation’, Language, 50(4): 696–735.
Searle, J. (1969) Speech Acts, Cambridge: Cambridge University Press.
Seedhouse, P. (2005) ‘Conversation analysis and language learning’, Language
Teaching, 38: 165–187.
Seedhouse, P. and Walsh, S. (2010) ‘Learning a second language through classroom
interaction’, in P. Seedhouse, S. Walsh and C. Jenks (eds) Conceptualising
‘Learning’ in Applied Linguistics,127–146, Basingstoke: Palgrave Macmillan.
96 Phil Benson
Sinclair, J. M. and Coulthard, M. (1975) Towards an Analysis of Discourse, Oxford:
Oxford University Press.
Sindoni, M. G. (2013) Spoken and Written Discourse in Online Interactions: a multimodal
approach, London: Routledge.
Snickars, P. and Vonderau, P. (eds) (2009) The YouTube Reader, Stockholm: National
Library of Sweden.
Stenström, A. and Stenström, B. (1994) An Introduction to Spoken Interaction, London:
Longman.
Strangelove, M. (2010) Watching YouTube: extraordinary videos by ordinary people,
Toronto: University of Toronto Press.
Tsui, A. (1994) English Conversation, Oxford: Oxford University Press.
YouTube (2013) ‘So long, video responses . . . , Next Up: better ways to connect’,
post on the YouTube Partners and Creators blog. Online. Available HTTP:
<http://youtubecreator.blogspot.ca/2013/08/so-long-video-responsesnext-up-
better.html> (accessed 25 February 2014).
Zourou, K. and Lamy, M.-N. (2013) ‘Introduction’, in M.-N. Lamy and K. Zourou
(eds) Social Networking for Language Education, 1–7, Basingstoke, UK: Palgrave
Macmillan.