Content uploaded by Pankaj Chejara
Author content
All content in this area was uploaded by Pankaj Chejara on Sep 05, 2020
Content may be subject to copyright.
Quantifying Collaboration Quality in
Face-to-Face Classroom Settings Using MMLA
Anonymized for review
Anonymized for review
Abstract. The estimation of collaboration quality using manual obser-
vation and coding is a tedious and difficult task. Researchers have pro-
posed the automation of this process by estimation into few categories
(e.g., high vs. low collaboration). However, such categorical estimation
lacks in depth and actionability, which can be critical for practitioners.
We present a case study that evaluates the feasibility of quantifying col-
laboration quality and its multiple sub-dimensions (e.g., collaboration
flow) in an authentic classroom setting. We collected multimodal data
(audio and logs) from two groups collaborating face-to-face and in a
collaborative writing task. The paper describes our exploration of differ-
ent machine learning models and compares their performance with that
of human coders, in the task of estimating collaboration quality along a
continuum. Our results show that it is feasible to quantitatively estimate
collaboration quality and its sub-dimensions, even from simple features
of audio and log data, using machine learning. For instance, our support
vector regression model covered approximately 50% of the performance
gap between randomness and human-level performance. These findings
open possibilities for in-depth automated quantification of collaboration
quality, and the use of more advanced features and algorithms to get
their performance closer to that of human coders.
Keywords: Computer-Supported Collaborative Learning ·Multimodal
Learning Analytics ·Collaboration Quality
1 Introduction
Collaboration has been traditionally studied using observation, interviews and
ethnographic methods [11]. Although these methods offer in-detailed informa-
tion, they also demand a lot of human effort and time, which are difficult to
scale up [11]. The use of technology to mediate collaboration has provided re-
searchers with large amounts of learner activity data (in the form of logs), offering
an alternative to traditional analyses of collaboration. Researchers have used a
variety of data (e.g., system logs, chats, discussion forums) to understand the
underlying process of collaboration using Learning Analytics (LA) methods like
content analysis, and interaction analysis[4]. The results of these analyses have
been employed to develop various kinds of feedback systems, from mirroring to
guiding support [6]. While Computer-Supported Collaborative Learning (CSCL)
2 Anonymized for review
often involves face-to-face and computer-mediated interactions, collaborative LA
support often relies on just digital logs, thus offering only a partial picture of
the interactions. Aware of this limitation, the field of Multimodal Learning An-
alytics (MMLA) [2] emerged with the goal of understanding learning through
multimodal data from digital and physical spaces. Recent MMLA studies showed
that it is feasible to estimate collaboration aspects categorically (e.g., high vs.
low collaboration) in face-to-face settings by combining physical and digital ac-
tivity traces [15, 14]. In addition, researchers have found verbal interactions and
speaking activity features as an important indicator for collaboration behav-
ior [8] [1]. However, most of these studies are conducted in laboratory settings
[3], so their results might not hold under authentic classroom constraints (e.g.,
noisy data). Moreover, the qualitative estimation of collaboration quality into
a few classes provides end users (e.g., a teacher) with little information about
what might be the underlying problem or reason why collaboration quality is
high/low.
In order to estimate quality of collaboration in a more fine-grained fashion,
this paper explores regression analysis models to quantitatively estimate collabo-
ration quality in a classroom setting from audio and log data. To reach that goal,
we carried out a case-study where we collected data from two groups (each with
four participants) in an authentic classroom setting. The learning activity in-
volved face-to-face discussion and collaborative writing using digital means. We
applied various regression models and compared their performance with that of
human coders, in the task of coding collaboration quality and its sub-dimensions
along a continuum.
2 Related Work
Researchers have investigated the problem of estimating collaboration into a lim-
ited set of categories, in various settings: pair-programming [5], project-based
learning [14], and tabletop-based collaborative learning [9]. These studies col-
lected data through different means (audio [15, 7, 1], Kinect sensors [5], system
logs [8, 15], and video [15]) and extracted a wide variety of features from them,
e.g.: non-verbal features like intensity, pitch, or speaking rate [7]; or spatial and
dynamic features like hand movement or distance between learners [14]. These
features were in turn used to estimate different aspects of the collaboration
process: detecting rapport [7], collaboration quality [15, 1, 9], or success in col-
laboration [14]. Certain studies [5, 15, 9] have included data from both physical
and digital spaces to investigate collaboration behavior.
While most of these studies devised their own coding schemes to annotate or
classify collaboration quality, others [9] have used collaboration rating schemes
that are widely used in the collaborative learning sciences (e.g., [10]). Although
these rating schemes often output quantitative scores (e.g., collaboration quality
[9], grading of collaboration work [14]), such scores have often been mapped into
two or three categories (e.g., high vs. low collaboration), as binary classification is
an easier problem (from an information theory point of view) and often results in
Quantifying Collaboration Quality Using MMLA 3
better performance when using statistical analysis [7] or machine learning models
[15, 14]. However, this “flattening” of the scores also takes away much of the
nuance and the different aspects that contribute to high-quality collaboration.
In terms of performance, classification accuracy has been reported from above
average (48% [5]) to moderate (69% [8]) and high level (80% [14], 96% [15]).
A number of gaps emerge from the aforementioned state of the art. First, that
MMLA researchers have mostly built models to estimate collaboration quality
categorically, which offers limited information about the reasons or underlying
structure of that judgement (i.e., limited explainability and actionability). In
consequence, there is still a lack of understanding regarding whether we can
estimate collaboration quality along a continuum (or to what extent). Third,
that MMLA studies often report their results without frames of reference that
can help the community understand how far (or how close) we are to developing
solutions of practical relevance to our classrooms (e.g., how they compare with
human-level classification or quantification of collaboration quality).
3 Methodology
To address the gaps identified in previous section, we setup a study to explore
the following research questions: RQ1. How well can we estimate collaboration
quality using machine learning, using audio and log data from an authentic
classroom setting in upper secondary school? RQ2. How well can we estimate
the various sub-dimensions of collaboration quality with machine learning, using
audio and log data from an authentic classroom setting?
To start addressing these questions, we have conducted a first case study
[16] in an authentic classroom setting, where learners performed collaborative
discussion and writing tasks, as part of their normal classes. Such case study
methodology allowed us to understand the situation in depth, and explore mul-
tiple aspects of the research questions (e.g., data fusion and regression models,
collaboration sub-dimensions).
4 Case Study
A 30 minute collaboration activity was co-designed by a researcher and a teacher
in which the students had to discuss and fill in a worksheet regarding genetic
mutations. The activity was enacted in a secondary education biology course
with 10 students in autumn 2019.
Two researchers were present in the classroom for data collection purposes
and technical support. A brief introduction was given to the students about the
aim of study and their consent for data collection was taken in written form
before the activity. In the case study below, the data from two groups (four
students each) are analyzed.
4 Anonymized for review
4.1 Data Collection
The students used an audio-capturing prototype with an omni-directional mi-
crophone (CoTrack) placed in the center of the group’s table, and Etherpad1for
the collaborative writing. CoTrack uses a voice activation detection (VAD) algo-
rithm to detect the presence of voice, and a direction-of-arrival (DOA) algorithm
to identify the direction of the sound. The prototype then maps the direction
to a particular learner and extracts various features (e.g., speaking time, IP ad-
dresses, number of characters added or deleted) from the audio and Etherpad
logs. Our analyses below use a total of 12 features: three features (speaking time,
number of characters added, and number of characters deleted) for each of the
four students in the groups, for every 30-second window of time (see below).
4.2 Data Annotation
We used Rummel et al. [13] collaboration quality rating scheme (itself adapted
from [10]), assigning a collaboration quality score along seven dimensions: Sus-
taining Mutual Understanding (SMU), Collaboration Flow (CF), Knowledge
Exchange (KE), Cooperative Orientation (CO), Argumentation (ARG), Struc-
turing Problem Solving Process and Time Management (SPST), and Individual
Task Orientation (ITO). Two raters coded these dimensions at the group level
(except the ITO, which is coded at an individual level and averaged to get
the group-level feature). Following the recommendations by Mart´ınez et al. [8],
we used time windows of 30 seconds, in which each of the aforementioned sub-
dimensions was assigned a score between −2 (very bad) and +2 (very good). The
sub-dimension scores at the group level were then added up to get the overall
collaboration quality score of the group for that time window (which can theo-
retically range from −14 to +14). This resulted in a dataset with 121 data points
from the collaboration of two learner groups. Two raters went through four it-
erations of coding before reaching substantial agreement on each sub-dimension
in terms of Cohen’s kappa (Table 1).
Table 1: Inter-rater agreement of human coders in each collaboration quality
sub-dimension (Cohen’s kappa)
SMU CF KE ARG SPST CO ITO-1 ITO-2 ITO-3 ITO-4
0.71 0.91 0.74 0.80 0.65 0.68 0.72 0.76 0.75 0.78
4.3 Data Analysis2
To map the individual student audio and log features to group-level features,
we explored three different approaches: simple averaging of individual scores,
1An open source real-time collaborative text editor, see https://etherpad.org
2Data analysis source code available at (anonymized for review)
Quantifying Collaboration Quality Using MMLA 5
using dimensionality reduction, and entropy-based fusion. In the dimensional-
ity reduction approach we applied principal component analysis (PCA) on all
individual-level features and extracted the four components that explained most
variance. The entropy-based approach has been used to map individual features
to group-level features in previous research [1], by converting each of the indi-
vidual features into a proportion p(x) (by dividing their value by the sum of
that feature in the corresponding time-window). These proportion values are
then used to compute group-level features using Shannon’s Entropy using the
formula:
H=X
x∈X
p(x)log2p(x)
For the training and evaluation of the machine learning models, we used
Python’s Scikit-learn library [12]. We randomly divided our dataset into train-
ing and test sets, using a ratio of 70:30. We trained regression models of different
kinds on our training set and investigated their performance on the test set. Con-
cretely, machine learning model families explored included K-Nearest Neighbors,
Random forest, Adaboost, Gradient boost, XGboost, Support vector regressors
(SVR), Neural networks, and ensemble (voting) regression models (using SVR,
Random forest and Adaboost). We used GridSearchCV (from Scikit-learn) with
3-fold cross validation to tune the model’s parameters.
4.4 Results
We used RMSE (Root Mean Square Error) as the performance metric to com-
pare the different regression models. As frames of reference, we computed the
RMSE that the human coders had achieved in their last round of manual collab-
oration quality scoring. We also computed the RMSE of two “no-information”
regressors, one that just estimates random values within the range of possible
quality scores, and one that provides an estimation equal to the average value
of the collaboration quality (quality=1.93, for this dataset).
From our analysis of the three fusion approaches, we found PCA-based fusion
as a better option than entropy and average, in terms of the performance of the
different regression models on test data. Figure 1(a) shows that PCA-fusion
based regression models achieved lower RMSE on test data than entropy-fusion
and average-fusion based regression models.
For the comparative analysis among regression models, we used PCA-based
fusion and trained different kinds of regression models, computing RMSE for
both the training and test data. All regression models (except Gradient Boost)
performed better than the average estimation model (figure 1(b)). XG Boost
and Neural Network regression models reported the highest variation between
training and testing errors, which can probably be explained by the models over-
fitting the small dataset available. The support vector regression (SVR) model
performed better than the other models, both in terms of lower RMSE, and
lower difference between training and test error. Comparing the performance
of this SVR model (which, let’s remember, used only very basic audio and log
6 Anonymized for review
(a) RMSE for different Fusion (b) RMSE of regression models
Fig. 1: Performance scores of regression models
features) with that of the no-information models (average and random) and the
human coders’ own RMSE values, we find that SVR covered about 50% of the
gap between the best no-information predictors and human-level performance.
Fig. 2: SVR performance on various sub-dimensions of collaboration quality
We also applied similar regression models to estimate the seven sub-dimensions
of collaboration quality. Again, support vector regression models performed bet-
ter than other models in estimating the majority of the sub-dimensions. Figure
2 shows the RMSE scores of support vector regression model, compared with
the no-information and human-level frames of reference. For these dimensions,
SVR covered 50% or more of the gap between no-information and human-level
performance.
Quantifying Collaboration Quality Using MMLA 7
5 Conclusions and future work
This paper investigated the feasibility of estimating the quality of collabora-
tion in face-to-face classroom settings using simple features from audio and log
data, and machine learning regression models. These results suggest that it is
feasible to quantitatively estimate collaboration quality along a continuum, and
even open the door to more in-depth estimation of the different collaboration
sub-dimensions (e.g., collaboration flow, knowledge exchange), which can be of
greater value to practitioners. We also provided three frames of reference (average
and random no-information estimators, as well as human coders’ performance)
with the aim to offer a more interpretable view on their performance. We sug-
gest future MMLA researchers to analyze their models’ performance using such
frames of reference, to help our research community in better understanding how
far our models and solutions have to go to achieve human-level performance.
This work is not without limitations. The small size of our dataset is proba-
bly the main weakness of our results so far, limiting greatly the generalizability
of the particular models and performance claims made. This issue can explain
the discrepancy between training and test errors of some of the regression mod-
els (e.g., Gradient Boost, AdaBoost, Neural Networks) due to over-fitting. The
expansion of this dataset with data from groupwork performed in different kinds
of authentic classroom settings, is one of our most important avenues of future
work.
Moreover, in the current case study we only used simple audio and log fea-
tures, and a limited set of machine learning models, considering all dataset sam-
ples independently (i.e., not looking at their sequence). The use of more complex
features (e.g., MFCC for audio data, or conversion of voice to text and subse-
quent analyses of content), different data fusion models and the exploration of
time-dependent machine learning models (e.g., Hidden Markov Models, sequence
analysis) will also be an important way to expand our work towards automated
estimation of collaboration quality that is close to human-level performance.
Acknowledgement
Anonymized for review
References
1. Bassiou, N., Tsiartas, A., Smith, J., Bratt, H., Richey, C., Shriberg, E., D’Angelo,
C., Alozie, N.: Privacy-preserving speech analytics for automatic assessment of stu-
dent collaboration. In: Proceedings of the Annual Conference of the International
Speech Communication Association, INTERSPEECH. pp. 888–892 (2016)
2. Blikstein, P., Worsley, M.: Multimodal learning analytics and education data min-
ing: using computational technologies to measure complex learning tasks. Journal
of Learning Analytics 3(2), 220–238 (2016)
8 Anonymized for review
3. Chua, Y.H.V., Dauwels, J., Tan, S.C.: Technologies for automated analysis of co-
located, real-life, physical learning spaces: Where are we now? In: Proceedings of
the 9th International Conference on Learning Analytics Knowledge. p. 11–20.
LAK19, ACM, NY, USA (2019)
4. Dillenbourg, P., J¨arvel¨a, S., Fischer, F.: The evolution of research on computer-
supported collaborative learning. In: Balacheff, N., Ludvigsen, S., de Jong, T.,
Lazonder, A., Barnes, S. (eds.) Technology-Enhanced Learning: Principles and
Products, pp. 3–19. Springer Netherlands, Dordrecht (2009)
5. Grover, S., Bienkowski, M., Tamrakar, A., Siddiquie, B., Salter, D., Divakaran,
A.: Multimodal analytics to study collaborative problem solving in pair program-
ming. In: Proceedings of the Sixth International Conference on Learning Analytics
Knowledge. p. 516–517. LAK ’16, ACM, NY, USA (2016)
6. Jermann, P., Soller, A., Muehlenbrock, M.: From mirroring to guiding: A review
of the state of art technology for supporting collaborative learning. International
Journalof Artificial Intelligence in Education (IJAIED) 15, 261–290 (2005)
7. Lubold, N., Pon-Barry, H.: Acoustic-prosodic entrainment and rapport in collab-
orative learning dialogues. In: Proceedings of the 2014 ACM Workshop on Mul-
timodal Learning Analytics Workshop and Grand Challenge. p. 5–12. MLA ’14,
ACM, NY, USA (2014)
8. Martinez, R., Wallace, J.R., Kay, J., Yacef, K.: Modelling and identifying col-
laborative situations in a collocated multi-display groupware setting. In: Biswas,
G., Bull, S., Kay, J., Mitrovic, A. (eds.) Artificial Intelligence in Education. pp.
196–204. Springer, Berlin, Heidelberg (2011)
9. Martinez-Maldonado, R., Dimitriadis, Y., Martinez-Mon´es, A., Kay, J., Yacef, K.:
Capturing and analyzing verbal and physical collaborative learning interactions
at an enriched interactive tabletop. International Journal of Computer-Supported
Collaborative Learning 8(4), 455–485 (2013)
10. Meier, A., Spada, H., Rummel, N.: A rating scheme for assessing the quality of
computer-supported collaboration processes. International Journal of Computer-
Supported Collaborative Learning 2(1), 63–86 (2007)
11. Mercer, N., Littleton, K., Wegerif, R.: Methods for studying the processes of inter-
action and collaborative activity in computer-based educational activities. Tech-
nology, Pedagogy and Education 13(2), 195–212 (2004)
12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
13. Rummel, N., Deiglmayr, A., Spada, H., Kahrimanis, G., Avouris, N.: Analyzing
collaborative interactions across domains and settings: an adaptable rating scheme.
In: Puntambekar, S., Erkens, G., Hmelo-Silver, C. (eds.) Analyzing Interactions in
CSCL: Methods, Approaches and Issues. pp. 367–390. Springer US, Boston, MA
(2011)
14. Spikol, D., Ruffaldi, E., Dabisias, G., Cukurova, M.: Supervised machine learning
in multimodal learning analytics for estimating success in project-based learning.
Journal of Computer Assisted Learning 34(4), 366–377 (2018)
15. Viswanathan, S.A., VanLehn, K.: Using the tablet gestures and speech of pairs of
students to classify their collaboration. IEEE Transactions on Learning Technolo-
gies 11(2), 230–242 (2018)
16. Yin, R.K.: Case study methods. APA handbooks in psychology R
., American Psy-
chological Association, Washington, DC, US (2012)