ArticlePDF Available

Multimodal (Inter)action Analysis in a Nutshell: Philosophy, Theory, Method and Methodology



This paper presents a concise introduction to Multimodal (inter)action analysis (MIA), which began to be developed in the early 2000s in tandem with technological advances for visual qualitative research. By now, MIA has grown into a fully-fledged research framework, including multimodal philosophy, theory, method and methodology for the study of human action, interaction and identity. With systematic phases from data collection to transcription (including transcription conventions) and data analysis, this framework allows researchers to work in a data-driven and replicable manner moving past common interpretive paradigms (Norris 2019, 2020).
QUIVIRR, VOL. 2 (2021) DOI: 10.5278/ojs.quivirr.v2.2021.a0004
Multimodal (Inter)action Analysis in a Nutshell: Philosophy,
Theory, Method and Methodology
Sigrid Norris
School of Communication Studies
Auckland University of Technology
This paper presents a concise introduction to Multimodal (inter)action analysis
(MIA), which began to be developed in the early 2000s in tandem with technological
advances for visual qualitative research. By now, MIA has grown into a fully-fledged
research framework, including multimodal philosophy, theory, method and
methodology for the study of human action, interaction and identity. With systematic
phases from data collection to transcription (including transcription conventions) and
data analysis, this framework allows researchers to work in a data-driven and
replicable manner moving past common interpretive paradigms (Norris 2019, 2020).
Multimodal (inter)action analysis, multimodal philosophy, multimodal theory,
multimodal method, multimodal methodology
1. Introduction
Video-based qualitative research is an exciting area to work in. There are many new
developments and perspectives. The one that we work in at the AUT Multimodal
Research Centre is multimodal (inter)action analysis (MIA) (Norris 2004a, b, 2011a,
2029, 2020; Geenen et al. 2015), which has grown from theory and methodology to a
fully-fledged framework. I will take a moment to set out the state of this framework
and discuss the role that audio-visual technology plays in MIA.
2. What is MIA?
First, multimodal (inter)action analysis is a coherent framework to analyse video-
based qualitative research of human action, interaction and identity, which was made
possible through technological advances. Second, and often in connection with video-
based data, this framework can be used to analyse other types of data from interviews,
images and chats to diaries.
What makes MIA coherent is that all parts from philosophy to theory, method and
methodology are interconnected. On the overarching philosophical strata, MIA posits
that human action, interaction and identity come about through a primacy of
perception and a primacy of embodiment (Norris, 2019). On the theoretical strata,
MIA follows MDA (Scollon, 1998, 2001), insisting on the principles of social action
(including communication) and history (Norris, 2020). On this theoretical level, MIA
further follows MDA by arguing that all human actions are mediated, and that
mediated actions with a history are practices. Then, MIA moves on from MDA to
multimodal mediated theory, noting that mediated actions appear on different levels
(lower-level, higher-level and frozen) (Norris, 2004a). Multimodal mediated theory
further suggests that modes are systems of mediated actions (Norris, 2013) and
discourses are practices (mediated actions with a history) with an institutional and/or
ideological dimension (Norris, 2020).
Primacy of Perception & Primacy of Embodiment
Principles of Social action (incl. communication) & history
Units of Analysis: Mediated action (lower-level, higher-level & frozen),
Mode as system of mediated action, Practices & Discourses
Phase I
Phase II
Phase III
Phase IV
Phase V
data set
action &
video data
Analysing data by
engaging with
methodological tools such
as Modal density, Modal
continuum of
means, Scale of Actions,
Levels of action, Agency,
Site of engagement etc.
Data Collection
Figure 1 - Multimodal (inter)action analysis is a coherent framework for the analysis of
human action, interaction and identity
When looking at Figure 1, we see how the philosophical and the theoretical strata
seemingly sit above method and methodology. However, a two-dimensional image
does not do this framework justice. Rather, philosophy and theory play an integral
part in all phases depicted in the method and methodology sections.
2.1. The Phases of MIA and the technology needed
The method part is divided into four phases (Norris, 2019), all of which depend on
audio-visual technology, but none are technology-specific. In other words, researchers
can use the kind of technology they have easy access to, without having to download
any kind of specific software. The first two phases help us collect and keep track of
multimodal, and often diverse, data pieces in a systematic manner. In these first two
phases, the researcher initially produces data collection tables and then data set tables.
Regular computer software from Word to Excel, Free Writing software, or some kind
of qualitative data analysis software, can be used for this.
The next two phases are the transcription phases. In Phase III, interviews, videos
as well as all other data are transcribed into higher-level action tables and the higher-
level actions are then bundled. By doing this, we systematically work through all of
the data that has been collected in a consistent manner, allowing us to work in a data-
driven way that moves beyond common interpretive paradigms (Norris 2019, 2020).
As with Phases I & II, easily accessible computer software chosen by the researcher
is used.
During and/or after the bundling of higher-level actions in Phase III, we select
video-data pieces for micro analyses, which are then transcribed in detail by using
multimodal transcription conventions (Norris 2004a, 2011, 2019, 2020) in Phase IV.
Here, we again rely on readily accessible audio-visual technology. Rather than relying
on a particular video-editing software, researchers can use whichever one that allows
them to examine individual frames and take screenshots.1 As transcription continues
and researchers assemble the screenshots and embed circles, arrows, overlaid
language, etc., the researcher again chooses an easily accessible software that allows
this kind of assembly. This can be done in writing programs, photo-editing programs,
or even in software such as PowerPoint.
The video excerpts that we transcribe in this detail are usually quite small (often
from a few seconds to less than a minute). The reason for this is that we are interested
in the great detail of an unfolding moment: how does the moment unfold with and
through the multiple modes that are involved? As we transcribe the actions performed
in great detail, we begin to see what we cannot see when watching a video clip. There
are small movements, gaze-changes, etc. that are so very easily missed when watching
a video, but which in fact drive the interaction under scrutiny in the direction that it is
moving. During multimodal transcription, we embed those findings in the transcript
and highlight them through the circles, arrows, etc. mentioned above to make them
clearly visible (Norris, 2016). Because of our transcription conventions, we can ensure
that our process is replicable.
Once we have produced transcripts of pertinent excerpts from our video data, we
engage with methodological tools that are relevant for the data pieces such as modal
density (Norris 2004a), modal configuration (Norris 2009a), the foreground-
background continuum of attention/awareness (Norris 2004a; 2008),
semantic/pragmatic means (Norris 2004a), levels of action (Norris 2009b), scales of
1 As to which frames are selected, please have a look at Norris (2004a, 2019).
action (Norris 2017b), agency (Norris 2005; Pirini, 2017), or the site of engagement
(Scollon 1998, 2001; Norris, 2004a, 2019, 2020; Norris and Jones 2005). Here again,
we rely on audio-visual technology without, however, favouring any one kind.
2.2. Some areas of human action, interaction, and identity where MIA has been
Multimodal (inter)action analysis is thus a coherent and comprehensive research
framework for the analysis of qualitative video-based data.2 All the pieces in this
framework fit together (Norris 2012; Pirini 2014b), allowing the researcher to build a
coherent picture of whatever human action, interaction or identity is being studied. In
this way, we have made strides in examining space and place or children’s acquisition
(Geenen 2013; Geenen 2017, 2018); identity (Norris 2005, 2007, 2008, 2011; Norris
and Makboon 2015; Matelau-Doherty and Norris 2021); video conferences (Norris
2017a; Norris and Pirini 2017); business coaching, high school tutoring and
intersubjectivity (Pirini 2013, 2014a, 2016), to name but a few areas in which the
framework has been used. What we at the AUT Multimodal Research Centre are
finding is that with a coherent framework such as MIA, there is much potential to
discover new insight and knowledge about any kind of human action, interaction, and
3. In summary
Multimodal (inter)action analysis is a framework that allows us to make new
discoveries about human action, interaction and identity, and while MIA uses and
relies on audio-visual technology, MIA is software-independent. No doubt, as
technology advances, MIA will advance in tandem.
Geenen, Jarret. 2013. Actionary Pertinence: Space to Place in Kitesurfing.
Multimodal Communication 2 (2): 123–153.
Geenen, Jarret. 2017. Show (and Sometimes) Tell: Identity Construction and the
Affordances of Video‐Conferencing. Multimodal Communication 6 (1): 118.
2 MIA also allows the integration of other kinds of data and allows for the same systematic collection,
analysis and transcription of all varied data.
Geenen, Jarret. 2019. The Acquisition of Interactive Aptitudes: A Microgenetic Case
Study. Pragmatics & Society. 9 (4): 518-544.
Geenen, Jarret, Sigrid Norris, and Boonyalakha Makboon. 2015. Multimodal
Discourse Analysis. In The International Encyclopedia of Language and
Social Interaction, edited by Karen Tracy, Christina Ilie and Todd Sandel, 1-
17. Hoboken: NJ: Wiley‐Blackwell.
Matelau-Doherty, M. Tui, and Sigrid Norris. 2021. A Methodology to Examine
Identity: Multimodal (Inter)action Analysis. In The Cambridge Handbook of
Identity, edited by Michael Bamber, Carolin Demuth and Meike Watzlawik,
304-323. Cambridge: Cambridge University Press.
Norris, Sigrid. 2004a. Analyzing Multimodal Interaction: A Methodological
Framework. London: Routledge.
Norris, Sigrid. 2004b. Multimodal Discourse Analysis: A Conceptual Framework. In
Discourse and Technology: Multimodal Discourse Analysis, edited by Philip
Levine and Ron Scollon, 101115. Washington, DC: Georgetown University
Norris, Sigrid. 2005. Habitus, Social Identity, the Perception of Male Domination
And Agency? In Discourse in Action: Introducing Mediated Discourse
Analysis, edited by Sigrid Norris and Rodney Jones, 183197. London:
Norris, Sigrid. 2006. Multiparty Interaction: A Multimodal Perspective on
Relevance. Discourse Studies 8 (3), 401421.
Norris, Sigrid. 2007. The Micropolitics of Personal National and Ethnicity Identity.
Discourse and Society 18 (5), 653674.
Norris, Sigrid. 2008. Some Thoughts on Personal Identity Construction: A
Multimodal Perspective. In Advances in Discourse Studies, edited by Vijay
Bhatia, John Flowerdew and Rodney Jones, 132149. London: Routledge.
Norris, Sigrid. 2009a. Modal Density and Modal Configurations: Multimodal
Actions. In The Routledge Handbook of Multimodal Analysis, edited by Carey
Jewitt. London: Routledge.
Norris, Sigrid. 2009b. Tempo, Auftakt, Levels of Actions, and Practice: Rhythms in
Ordinary Interactions. Journal of Applied Linguistics 6 (3), 333356.
Norris, Sigrid. 2011a. Identity in (Inter)action: Introducing Multimodal (Inter)action
Analysis. Berlin: De Gruyter Mouton.
Norris, Sigrid. 2012. Multimodal (Inter)action Analysis: An Integrative
Methodology. In Body – Language – Communication, edited by Cornelia
Müller, Ellen Fricke, Alan Cienki and David McNeill, 275-286. Berlin: De
Gruyter Mouton.
Norris, Sigrid. 2013. What is a Mode? Smell, Olfactory Perception, and the Notion
of Mode in Multimodal Mediated Theory. Multimodal Communication 2 (2),
Norris, Sigrid. 2016. Concepts in Multimodal Discourse Analysis with Examples
from Video Conferencing. Yearbook of the Poznan Linguistic Meeting 2 (1),
Norris, Sigrid. 2017a. Rhythmus und Resonanz in Internationalen
Videokonferenzen. In Resonanz, Rhythmus & Synchronisierung:
Erscheinungsformen und Effekte, edited by Thiemo Breyer, Michael B.
Buchholz, Andreas Hamburger, Stefan Pfänder and Elke Schumann, 85-102.
Bielefeld: transcript-Verlag.
Norris, Sigrid. 2017b. Scales of Action: An Example of Driving & Car Talk in
Germany and North America. Text & Talk 37 (1), 117–139.
Norris, Sigrid. 2019. Systematically Working with Multimodal Data: Research
Methods in Multimodal Discourse Analysis. Hoboken, NJ: Wiley Blackwell.
Norris, Sigrid. 2020. Multimodal Theory and Methodology: For the Analysis of
(Inter)action and Identity. London: Routledge.
Norris, Sigrid, and Jones, Rodney H. 2005. Discourse in Action: Introducing
Mediated Discourse Analysis. London: Routledge.
Norris, Sigrid, and Boonyalakha Makboon. 2015. Objects, Frozen Actions, and
Identity: A Multimodal (Inter)action Analysis. Multimodal Communication 4
(1), 4360.
Norris, Sigrid, and Jesse Pirini. 2017. Communicating Knowledge, Getting
Attention, and Negotiating Disagreement via Videoconferencing Technology:
A Multimodal Analysis. Journal of Organizational Knowledge
Communication 3 (1):23-48.
Pirini, Jesse. 2013. Analysing Business Coaching: Using Modal Density as a
Methodological Tool. Multimodal Communication 2 (2): 195–215.
Pirini, Jesse. 2014a. Producing Shared Attention/Awareness in High School
Tutoring. Multimodal Communication 3 (2): 163–179.
Pirini, Jesse. 2014b. Introduction to Multimodal (Inter)action Analysis. In
Interactions, Images and Texts: A Reader in Multimodality, edited by Sigrid
Norris and Carmen D. Maier, 7792. Berlin: De Gruyter Mouton.
Pirini, Jesse. 2016. Intersubjectivity and Materiality: A Multimodal Perspective.
Multimodal Communication 5 (1): 114.
Pirini, Jesse. 2017. Agency and Co‐production: A Multimodal Perspective.
Multimodal Communication 6 (2): 120.
Scollon, Ron. 1998. Mediated Discourse as Social Interaction: A Study of News
Discourse. London: Longman.
Scollon, Ron. 2001. Mediated Discourse: The Nexus of Practice. London:
This article uses the concept of literacy-as-event to explore the embodied meaning-making of a young child during small world play. Recent developments in literacy research, influenced by relational thinking, have led to a reconsideration of how meaning-making unfolds in home and school settings. The concept of literacy-as-event suggests that meaning-making is unpredictable and dynamic, responding to novel socio-material interactions between texts, people, objects and moments. This view suggests that there is a need to ensure children have opportunities to engage with embodied and material meaning-making beyond shared reading events. In this article, small world play after a shared reading event is positioned as enabling socio-material meaning-making through embodied and material encounters with people and objects. A single episode of small world play is presented for analysis. A multimodal analytical approach is used, drawing attention to the embodied interactions between a child, her adult and objects. Analysis of the data shows that the young child's meaning-making involved moments of physical and material almost-hiatus, followed by erratic movements. These often unexpected fluctuations, between stillness and motion, created generative tensions between the child and her adult, enabling creative swerves in engagement between narrative action, character traits and story themes.
Dooly (2017, p. 169) defines telecollaboration in education as “the use of computer and/or digital communication tools to promote learning through social interaction and collaboration, thus moving the learning process beyond the physical boundaries of classrooms.” As digital communication technologies advance, newer and more sophisticated cutting-edge ICT tools are being used for telecollaboration, including virtual reality (VR). Researchers have applied different models and approaches of multimodal analysis to understand the specific features of VR on students' language learning (Dubovi, 2022; Friend & Mills, 2021) and intercultural communication (Rustam, et al., 2020). Nevertheless, very little has been done to look into language teacher telecollaboration via VR technologies. This present study recruited student teachers of an additional language (LX) (Dewaele, 2017) from different geographical locations and cultural backgrounds to participate in a project aiming at cultivating the participants' critical views on LX teaching and intercultural communication skills. The participants interacted and discussed LX teaching/learning issues in VR environments. Their interactions were video recorded and analyzed. By applying Multimodal (inter)action Analysis (MIA) (Norris, 2014) as the analytical framework, this study systematically unpacked the thematical saliencies and significant moments of the participants' intercultural interaction in the three VR meetings. Based on the findings, suggestions and caveats for future designing and researching intercultural telecollaboration in VR environments are provided.
Multimodal (inter)action analysis (Norris 2011) is a methodology to analyze identity production in everyday life. This methodology has strong theoretical foundations in mediated discourse analysis (Scollon 1998, 2001; Norris & Jones 2005), interactional sociolinguistics (Goffman 1956, 1974; Gumperz 1982; Tannen 1984) and social semiotics (Kress & Van Leeuwen 1996; Van Leeuwen 1999). In multimodal (inter)action analysis, identity is viewed as made up of elements, which can change on a small scale from moment to moment and on a more general scale throughout time, showing the fluidity of identity production, but also allowing for the analysis of stabilization of identity. When examining video ethnographic data, as well as interviews, and fieldnotes, the researcher examines the situated nature of participants’ production of identity elements in action and interaction through the analysis of the various kinds of mediated action on the micro level. In addition, the researcher can utilize the analytical tool, layers of discourse, in order to analyze the immediate identity element production (produced in the moment), link it to the continuous identity element production (produced within and with the networks of the participant), and link it to the general identity element production (produced within and with institutions and/or society that the performer belongs to). Thereby situated identity elements can clearly be shown to link to the concrete socio-historically and socioculturally embeddedness on the meso and the macro levels. In this chapter, we shall exemplify these analytical tools through two transcripts depicting one participant from a larger ethnographically informed study.
This concise guide outlines core theoretical and methodological developments of the growing field of Multimodal (Inter)action Analysis. The volume unpacks the foundational relationship between multimodality and language and the key concepts which underpin the analysis of multimodal action and interaction and the study of multimodal identity. A focused overview of each concept charts its historical development, reviews the essential literature, and outlines its underlying theoretical frameworks and how it links to analytical tools. Norris illustrates the concept in practice via the inclusion of examples and an image-based transcript, table, or graph. The book provides a succinct overview of the latest research developments in the field of Multimodal (Inter)action Analysis for early career scholars in the field as well as established researchers looking to stay up-to-date on core developments.
A guide that offers a step-by-step process to data-driven qualitative multimodal discourse analysis Systematically Working with Multimodal Data is a hands-on guide that is theoretically grounded and offers a step-by-step process to clearly show how to do a data-driven qualitative Multimodal Discourse Analysis (MDA). This full-color introductory textbook is filled with helpful definitions, notes, discussion points and tasks. With illustrative research examples from YouTube, an Experimental and a Video Ethnographic Study, the text offers many examples of how to deal with small to large amounts of data, including information on how to transcribe video data multimodally, including online videos, and how to analyze the data. This textbook contains ample theory, directions for literature, and a teaching guide to help with a clear understanding of how to work with multimodal data. Contains new research data, exceptional illustrations and diagrams Offers step-by-step processes of working through examples, transcriptions and online videos Goes into great depth so that students can use the book as hands-on material to engage with their own data analysis Designed to be easy-to-use with color-coded definitions, tasks, discussion points and notes Written for advanced undergraduate, graduate and PhD level students, as well as participants in research workshops, Systematically Working with Multimodal Data is an authoritative guide to understanding data-driven qualitative Multimodal Discourse Analysis.
In this article, I detail incremental microgenetic alterations in the development of one particular socio-interactive aptitude: making a relevant interactive contribution. Taking heed of Clark’s (2014) call for the need to reorient our attention to investigate the pragmatics of interaction by accounting for the multiple communicative modes through which this is acccomplished I detail the ways in which parental facilitation and a flexible participatory configuration, made possible by video-conferencing technology, create conditions enabling the agentive re-introduction of a psycho-socially relevant topic. Paramount are the ways in which residual interactive specificities in introduction, co-production and multimodal configurations re-manifest suggesting a more symbiotic relationship between traditional notions of ‘message’ and ‘production’. During the microgenesis of interactive aptitudes, children are not just learning what constitutes psycho-socially relevant topoi, they also acquire an understanding of exactly how to make the contribution through multimodal ensembles.
Building on multimodal (inter)action analysis as a theoretical and methodological framework, this article introduces and develops the theoretical/methodological tool called primary agency. Taking the mediated action as a unit of analysis, agency can be analysed as a feature of action. However, there is a lack of empirical approaches for the study of agency, and an overemphasis on language as the most important site for identifying agentive action. I develop primary agency through an analysis of three co-produced higher-level actions from a research project into high school tutoring. These are the higher-level actions of conducting research, tutoring and reading a text. Applying co-production and the modal density foreground/background continuum I explore how the researcher, the tutor and the student co-produce these higher-level actions. Through this analysis, I identify the most significant mediational means for each higher-level action, and the social actor with ownership or agency over these mediational means. I define this social actor as the one with primary agency over the co-produced higher-level action. Finally, my analysis outlines the implications of primary agency for co-produced higher-level actions, including the role of the researcher, the attention/awareness participants pay to overarching research projects, and links between primary agency and successful learning.
Dieser Aufsatz ist in erster Linie ein theoretischer Beitrag, der mein Verständnis von Rhythmus anhand einer internationalen Videokonferenz über eine sequenzielle Rekonstruktion des von den Beteiligten im zeitlichen Verlauf konstituierten Moments hinweg erlaubt. Basierend auf den soziokulturellen Theorien der mediatisierten Diskursanalyse (Scollon 1998, 2001; Wertsch 1998; Norris/Jones 2005a) und der multimodalen (Inter-)Aktionsanalyse (Norris 2004, 2011a, 2013b; Pirini et al. 2014) beleuchtet dieser Aufsatz: (1) wie eine internationale Videokonferenz durch rhythmische Zeitgeber (Scollon 2005) beeinflusst wird und (2) wie die Videokonferenz durch große (Inter-)Aktionsrhythmen koordiniert wird (Lemke 2000). Insbesondere befasst sich dieser Beitrag mit Rhythmen, die den emergenten Rhythmen im zeitlichen Verlauf des konstituierten Moments übergeordnet sind.
Identity construction is a widely covered topic in studies of discourse and a topic that has interested me for some time (Norris 2002, 2004, forthcoming). As in my other chapters, my focus in this chapter is a methodological one that allows the investigation of identity construction from a slightly new perspective. In this chapter, I take up the topic of personal identity construction and illustrate what a multimodal approach can offer to grasp such a complex, fluid and ever-changing notion. While these pages centre around one social actor in particular, I would like to emphasize that the reader needs to keep in mind the quote above, which alludes to the fact that one social actor can never act alone or have a personal identity without the collective. My work is grounded in the methodological framework of multimodal interaction analysis (Norris, 2004) and with this, my writing is first of all an extension of Scollon’s (1998, 2001) mediated discourse analysis. Second, this framework is strongly influenced by the work of Kress and van Leeuwen in multimodality (1998, 2001; and van Leeuwen 1998). Besides these two merging directions, the framework of multimodal interaction analysis draws on and builds upon the micro analytical aspects found in interactional sociolinguistics of Goffman (1959, 1961, 1974), Gumperz (1982) and Tannen (1984); discourse analysis as in Hamilton (1996, 1998) or Schiffrin (1994, 2005); and the macro analytical aspects of a historical approach of Wodak et al. (2001).
This article provides a preliminary answer to exactly why video-conferencing is evaluated as better than traditional telephony for long-distance familial interaction by allocating analytical attention to the showing of objects during interaction. While it is acknowledged that ‘showing’ constitutes an interactive move less contingent on linguistic maturation, more importantly, the showing of objects, artefacts or entities during video-conferencing interactions exemplifies an agentive and volitional production of identity elements on behalf of young children. Thus, while some have pointed to shortcomings of conversation-like activities mediated by video-conferencing in favour for more activity-driven tasks, I make a case for drawing upon pre-existing components of the material surround as a means to more comprehensively and longitudinally engage younger children in video-conferencing interaction.
This article takes a multimodal approach to examine how two young men communicate knowledge, shift attention, and negotiate a disagreement via videoconferencing technology. The data for the study comes from a larger ongoing project of participants engaging in various tasks together. Linking micro, intermediate and macro analyses through the various methodological tools employed, the article presents multimodal (inter)action analysis (Norris, 2004, 2011, 2013a, 2013b) as a methodology to gain new insight into the complexity of knowledge communication via videoconferencing technology, which is relevant to many settings from education to employment, from organizations to gaming.