Conference PaperPDF Available

Predicting Success in Massive Open Online Courses (MOOC) Using Cohesion Network Analysis

Authors:

Figures

Content may be subject to copyright.
Predicting Success in Massive Open Online Courses (MOOCs) Using
Cohesion Network Analysis
Scott A. Crossley, Georgia State University, scrossley@gsu.edu
Mihai Dascalu, University Politehnica of Bucharest, mihai.dascalu@cs.pub.ro
Danielle S. McNamara, Arizona State University, dsmcnamara1@gmail.com
Ryan Baker, University of Pennsylvania, ryanshaunbaker@gmail.com
Stefan Trausan-Matu, University Politehnica of Bucharest, trausan@gmail.com
Abstract: This study uses Cohesion Network Analysis (CNA) indices to identify student patterns
related to course completion in a massive open online course (MOOC). This analysis examines a
subsample of 320 students who completed at least one graded assignment and produced at least 50
words in discussion forums in a MOOC on educational data mining. The findings indicate that CNA
indices predict with substantial accuracy (76%) whether students complete the MOOC, helping us
to better understand student retention in this MOOC and to develop more actionable automated
signals of student success.
Introduction
Massive Open Online Courses (MOOCs) open a number of educational opportunities for traditional and non-
traditional learning. However, the size of classes, which easily reaches into the thousands of students, requires
educators and administrators to reconsider traditional approaches to instructor intervention and the manner in which
student engagement, motivation, and success is assessed, especially since attrition rates in MOOCs is notoriously high
(Ramesh, Godwasser, Huang, Daume, & Getoor, 2014). The uniqueness of MOOCs and the difficulties associated
with them has opened new research areas, especially in predicting or explaining completion rates and general student
success. Research has mainly focused on predicting success using click-stream data (i.e., student interactions within
the MOOC software). Other recent approaches include the use of Natural Language Processing (NLP) tools to gauge
students’ affective states (Wen, Yang, & Rose, 2014b, 2014a), measure the sophistication and organization of
studentsdiscourse within a MOOC (Crossley et al., 2015; Crossley, Paquette, Dascalu, McNamara, & Baker, 2016,
and a combination of click-stream and NLP data (Crossley et al., 2016). In this study, we examine new NLP
approaches grounded in text cohesion and Social Network Analysis (SNA) to predict success in a MOOC related to
educational data mining. Social interaction has long been recognized as an important component of learning
(Vygotsky, 1978). However, while the relationship between language and social participation has been studied in
MOOCs (Dowell et al., 2015), social interaction reflected through the language produced by MOOC students has not
been investigated within large-scale, on-line learning environments.
The variables used in this study are based on Cohesion Network Analysis (CNA), which can be used to
analyze discourse structures within collaborative conversations (Dascalu, Trausan-Matu, McNamara, & Dessus,
2015). CNA indices estimate cohesion between text segments based on similarity measures of semantic proximity.
We hypothesize that students who produce forum posts that are on topic, are more related to other student posts, are
more central to the conversation, and are more collaborative will be more likely to complete the MOOC than those
that are not. We focus specifically on student completion rates because they are an important component of student
success within the course, as well as after its completion (Wang, 2014). We assess links between completion and CNA
indices because CNA indices afford a wide array of opportunities for better understanding student success in terms of
collaboration. Using CNA indices to better understand student completion rates has the potential to inform pedagogical
interventions that provide individualized feedback to MOOC participants and teachers regarding social interactions
such as collaboration. Ultimately, our objective is to enhance participation and active involvement, to increase
completion rates, as well as to increase our understanding of the factors associated with MOOC completion.
MOOC Analysis
MOOCs have become an important component of education research for both instructors and researchers because they
have the potential to increase educational accessibility to distance and lifelong learners (Koller, Ng, Do, & Chen,
2013). Researchers examine links between click-stream data in MOOCs and academic performance because MOOCs
provide a tremendous amount of data via click-stream logs containing detailed records of the students' interactions
with the course content. The measures typically computed from click-stream data that have been used in MOOC
analyses include variables related to counts of the different possible types of actions, the timing of actions, forum
interactions and assignments attempts among others (Seaton, Bergner, Chuang, Mitros, & Pritchard, 2014).
More recently, researchers have applied NLP tools to MOOC data (Chaturvedi, Goldwasser, & Daume, 2014;
Wen, Yang, & Rose, 2014a, 2014b; Crossley et al., 2015; Crossley et al., 2016). Traditional usage of NLP tools in
this context focus on a text’s syntactic and lexical properties. The simplest approaches count the length of words or
sentences, or use pre-existing databases to compare the word properties in a single text to that of a larger, more
representative corpus of texts. More advanced NLP tools measure linguistic features related to the use of rhetorical
structures, syntactic similarity, text cohesion, topic development, and sophisticated indices of word usage. Such tools
have been used to examine text complexity (e.g., cohesion, lexical, and syntactic complexity) in forum posts and the
degree to which these indicators are predictive of MOOC completion. For instance, Crossley et al. (2015) found that
language related to forum post length, lexical sophistication, situational cohesion, cardinal numbers, trigram
production, and writing quality were significantly predictive of whether a MOOC student completed the course
(reporting an accuracy of 67%). In a follow up study, Crossley et al. (2016) combined click-stream data and NLP
approaches to examine if students' on-line activity and the language they produced in the on-line discussion forum
was predictive of MOOC completion. They found that click-stream variables (e.g., weekly lecture coverage and how
early students submitted their assignments) were the strongest predictors of MOOC completion but that NLP variables
(e.g., the number of entities in a forum post, the post length, the overall quality of the written post, the linguistic
sophistication of the post, cohesion between posts, and word certainty) significantly increased the accuracy of the
model. In total, click-stream and NLP indices predicted which students would complete the course with 76% accuracy.
Combined, these findings indicate that students who are more involved in the course and demonstrate more advanced
linguistic skills, are more likely to complete a MOOC.
Current Study
The goal of the study is to test new indices that measure social integration and collaboration using Cohesion Network
Analysis in order to examine student success in a MOOC. Specifically, we perform a longitudinal analysis on the
weekly timeline evolution of CNA indices to predict MOOC success and examine if students who engage in greater
social interaction, that is on topic and central to the MOOC, are more successful (i.e., complete the course).
Method
The MOOC: Big Data in Education
In this paper, we evaluate course completion in the context of the Big Data in Education MOOC (BDEMOOC), using
the data from the first iteration on this course, offered through the Coursera platform in 2013. This is the same MOOC
investigated by Crossley et al. (Crossley et al., 2015; Crossley et al., 2016). The course was designed to support
students in learning how to apply a range of educational data mining (EDM) methods to conduct education research
questions and to develop models that could be used for automated intervention in online learning, or to inform teachers,
curriculum designers, and other stakeholders. This course was targeted to the postgraduate level, and covered material
comparable to a graduate course taught by the instructor. The MOOC ran from October 24, 2013 to December 26,
2013, and included several lecture videos in each of the 8 weeks, and one assignment per week.
In each of the weekly assignments, students conducted a set of analyses on a given data set and answered
questions about the analyses. All assignments were automatically graded, and students had up to three attempts to
complete each assignment successfully. Students received a certificate by obtaining an overall average grade of 70%
or better on at least 6 of the 8 assignments. The course had an official enrollment of over 48,000 at the time of the
course’s official end. 13,314 students watched at least one video, 1,242 students watched all videos, 1,380 students
completed at least one assignment, and 710 made a post in the discussion forums. Of those with posts, 426 students
completed at least one class assignment while 638 students completed the online course and received a certificate. As
such, some students earned a certificate for BDEMOOC without ever posting to the discussion forums.
Student Completion Rates
We selected completion rate as our variable of success because it is one of the most common metrics used in MOOC
research (He, Bailey, Rubinstein, & Zhang, 2015), and correlates to future career participation (Wang, 2014). For this
study, completion was based on a smaller sample of forum posters as described below. “Completion” was pre-defined
as earning an overall grade average of 70% or above. The overall grade was calculated by averaging the 6 highest
grades extracted out of the total of 8 assignments.
Discussion Posts
Discussion posts are of interest within research on student participation in MOOCs because they are one of the core
methods that students use to participate in social learning (Ramesh, Goldwasser, Huang, Daume, & Getoor, 2014).
Discussion forums provide students with a platform to exchange ideas, discuss lectures, ask questions about the course,
and seek technical help, all of which lead to the production of language in a natural setting. Such natural language can
provide researchers with a window into individual student motivation, linguistics skills, writing strategies, and
affective states. This information can in turn be used to develop models to improve student learning experiences
(Ramesh, Goldwasser, Huang, Daume, & Getoor, 2014). In BDEMOOC, students and teaching staff participated
actively in weekly forum discussions. Each week, new discussion threads were created for each week's specific
content, including both videos and assignments under sub-forums, each with corresponding discussion threads. Forum
participation did not count toward student’s final grades. For this study, we focused on the forum participation in the
weekly course discussions. For this study, we extracted all forum posts and corresponding comments from the MOOC
environment for all 426 students who both made at least a forum post and completed an assignment. We removed all
data from instructors and teaching assistants. We analyzed data from those students who produced at least 50 words
in their aggregated posts (n = 319). Fifty words was used as a cut-off to ensure sufficient linguistic information. Of
these 319 students, 132 did not successfully complete the course while the remaining 187 completed the course.
Cohesion Network Analysis
In Computer Supported Collaborative Learning (CSCL) environments, Cohesion Network Analysis analyzes
discourse structure by combining NLP approaches with SNA (Dascalu, Trausan-Matu, McNamara, & Dessus, 2015).
In CNA, cohesion is computationally represented as an average value of similarity measures (or an aggregated score)
between semantic distances (Budanitsky & Hirst, 2006) using WordNet (Miller, 1995) Latent Semantic Analysis
(Landauer & Dumais, 1997) and Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003). We used the Touchstone
Applied Science Associates (TASA) corpus (approximately 13 million words; http://lsa.colorado.edu/spaces.html)
together with a collection of articles extracted from the Learning Analytics & Knowledge dataset (652 Learning
Analytics and Knowledge and Educational Data Mining conference papers and 45 journal papers;
https://www.w3.org/TR/REC-rdf-syntax/) to train dedicated LSA and LDA semantic models. The resulting corpora
covered both the curricula of the MOOC course and provided also a general knowledge background. Before training,
the texts were preprocessed such that stop-words were removed and all words were lemmatized.
A cohesion graph (Dascalu, Trausan-Matu, & Dessus, 2013) was generated using cohesion values in order
to determine connections between discourse elements. This graph represents a generalization of the utterance graph
(Trausan-Matu, Stahl, & Sarmiento, 2007) and can be used as a proxy for the semantic content of discourse. The
cohesion graph is a multi-layered structure containing different nodes (Dascalu, 2014) and the links between them. A
central node, representing the conversation’s thread, is divided into contributions, which are further divided into
sentences and words. Links are then built between nodes in order to determine a cohesion score that denotes the
relevance of a contribution within the conversation, or the impact of a word within a sentence or contribution. Other
links are generated between adjacent contributions, which are used to determine changes in the topics or of the
conversation’s thread. These changes are reflected by cohesion gaps between units of texts. Explicit links, created
using an interface functionality such as the “reply-to” option, are contained within the cohesion graph as well. In
addition, cohesive links determined using semantic similarity techniques are added between related contributions
within a timeframe of maximum 20 successive contributions, which can be considered the maximum span for these
type of cohesive links (Rebedea, 2012).
Cohesion Scoring Mechanism
The cohesion graph determines the active engagement in terms of participation in the MOOC. This is computed
quantitatively based on relations established between nodes from the cohesion graph. The contributions are analyzed
to determine their importance in relation to the discussion’s thread, coverage of topics, and their relatedness to other
contributions. The relevance score of a node in the cohesion graph is based on the relevance of underlying words and
on its relation to other components. For example, a contribution’s relevance score is computed as the sum of its
constituent words based on statistical presence and the semantic relatedness (Dascalu, Trausan-Matu, Dessus, &
McNamara, 2015). Statistical presence represents the word frequency within the text, while semantic relatedness refers
to semantic similarity between the word and the entire conversation thread that contains it. Keywords for the whole
conversation are determined by considering the aggregated score of the two factors.
Afterwards, the cohesion scoring mechanism assigns contribution scores by multiplying each word’s
previously determined score with its normalized term frequency (Dascalu, 2014), estimating an on-topic relevance of
the utterance. Links with other contributions, stored within the CNA are further used to improve contribution scores.
Each contribution’s local relevance is then calculated with regards to related contributions. Thus, each textual
element’s score can be viewed as its importance within the discourse, covering both the topic and the semantic
relatedness with other elements.
Collaboration Assessment
Social knowledge-building (KB) processes (Bereiter, 2002) are derived through collaboration (i.e., scores calculated
on the inter-animation of interactions between different participants). Social KB refers to the external dialog between
at least two participants supporting collaboration, while inner dialogue is reflected by the continuation of ideas or
explicit, referred contributions belonging to the same speaker.
Each contribution has a previously defined importance score and an effect score in term of both personal and
social KB. The personal score is initially assigned as each utterance’s importance score, while the social score is
initially assigned a zero. By analyzing the links from the cohesion graph, these scores are augmented. If a link is
established between contributions belonging to the same speaker, the knowledge (personal and social) from the
referred contribution is transferred to the personal dimension of the current contribution through the cohesion score.
If the link is established between different users, only the social dimension of the currently analyzed contribution is
increased by the cohesion measure. This enables a measurement of collaboration perceived as a sum of social KB
effects that consist of each contribution’s score, multiplied by the cohesion value to related contribution (Dascalu,
Trausan-Matu, McNamara, & Dessus, 2015).
Interaction Modeling and Integration of Multiple CNA Graphs
The sociogram reflects information exchanges between users and represents the central structure for modeling
interaction and information transfer between participants (Dascalu, 2014). The nodes represent users, while the edges
represent interchanged contributions. This graph considers not only the number of exchanged contributions, but
weights each utterance as a sum of social KB effects to other MOOC participants. Specific SNA metrics are further
computed starting from the sociogram in order to measure centrality or involvement (Dascalu, 2014). Some examples
include the number of links to (out-degree) and from (in-degree) other participants for a specific user. Betweenness
centrality (Bastian, Heymann, & Jacomy, 2009) is computed to determine central nodes and highlights the information
exchange between participants who, if eliminated, would highly reduce communication. The participant’s connection
to other nodes, called closeness centrality (Sabidussi, 1966), is computed as the inverse distance to all other nodes. A
higher values represents a participant’s stronger connection to all other discussion thread participants. The maximal
distance between a node and all other nodes, called eccentricity (Freeman, 1977), shows the closeness of a user to
other participants. These models were extended to facilitate the evaluation of not only a single discussion, but of an
entire MOOC by considering the aggregation of multiple discussion threads. Such a global analysis was used to build
a social network consisting of all involved participants and their contributions, thus enabling the evaluation of
participation at a macroscopic level, not only for specific discussions, but for the entire MOOC. The sociogram
between all participants was generated considering the sum of contribution scores per discussion thread within the
forum. The overview of different user goals, distributions, and interactions provides a broader perspective of a
participants’ evolution within the MOOC.
Longitudinal Analysis
We performed a longitudinal analysis by measuring the distribution of each participant’s involvement throughout the
duration of the MOOC which enabled us to quantify the evolution of learners’ participation, collaboration and
interaction patterns across time. In order to generate each participant’s time distribution, specific sociograms were
built for incremental weekly timeframes and CNA-derived quantitative indices were evaluated, covering the following
elements, as discussed above: a) cumulative utterance scores per participant (i.e., the sum of individual contribution
importance scores that were uttered by a certain participant), b) social KB effect as the cumulative effect of a
participant’s contribution in relation to other speakers, and c) specific SNA metrics (i.e., in-degree, out-degree,
betweenness, closeness and eccentricity centrality measures) computed on the CNA interaction graph.
As expected due to attrition, a large discrepancy was observed in terms of the density of the interaction graphs
found between the first and last week of the course, denoting a significant decrease in density. The values of each
CNA index per timeframe were used to create individual time series reflecting each participant’s evolution throughout
the course. Afterwards, the longitudinal analysis indices presented in Table 1 were used to model the trends of the
time series generated per participant and per CNA quantitative index. This approach creates an in-depth NLP-centered
perspective of our longitudinal analysis built on top of CNA.
Statistical Analysis
CNA indices that yielded non-normal distributions were removed. A multivariate analysis of variance (MANOVA)
was conducted to examine which indices reported differences between students who completed or did not complete
the MOOC. The MANOVA was followed by a stepwise discriminant function analysis (DFA) using CNA indices that
were normally distributed and demonstrated significant differences between students who completed the course and
those who did not. CNA indices were also checked for multicollinearity (r > .90). In the case of multicollinearity
between indices, the index demonstrating the largest effect size in the MANOVA was retained in the analysis. The
DFA was used to develop an algorithm to predict group membership through a discriminant function coefficient. A
DFA model was first developed for the entire corpus of student forum posts. This model was then used to predict
group membership (completers v. non-completers) for the student forum posts using leave-one-out-cross-validation
(LOOCV) in order to ensure that the model was stable across the dataset.
Table 1. Longitudinal analysis indices applied on students' social media contributions across time.
Name
Description
Average &
standard deviation
Average and standard deviation of the considered CNA quantitative index within all
timeframes
Slope
The degree of the slope corresponding to the linear regression applied on the time series.
The slope indicates whether students became more actively involved (slope > 0), had a
uniform involvement (slope = 0), or lost their interest throughout the semester (slope < 0).
Entropy
Considering the probability of posting within each timeframe, Shannon's entropy formula
(Shannon, 1948) grasps the discrepancies or inconsistencies in participation patterns. For
example, if students are active in only one timeframe, their entropy is 0, whereas if they
have a constant activity throughout the course, their entropy converges towards the
maximum value of log(n), where n is the number of timeframes
Uniformity
Degree of uniformity is measured using Jensen Shannon dissimilarity (JSD) (Manning &
Schütze, 1999) to a uniform distribution of 1/n. The JSD is a symmetric function based on
the KullbackLeibler divergence and is used to measure the similarity between two
distributions, in our case the student’s time series and an ideal, uniform participation in each
week
Local extreme
points
The number of local extreme points determined as the number of timeframes for which the
inflection or the direction of the evolution of the CNA index changes. This reflects the
monotony degree of the evolution or inconsistency in participation or collaboration - if
multiple spikes are encountered, these will be identified as local minimum or maximum
points; therefore, more local extreme points will be identified within the time series
evolution
Average &
standard deviation
of recurrence
Recurrence is expressed as the distance between timeframes in which the learner had at
least one contribution in the time series. This is useful for identifying and quantifying
pauses as adjacent weeks without any activity. If each timeframe has at least one event,
recurrence is 0, whereas if students take long pauses that inherently generate timeframes
with 0 events, recurrence increases (e.g., if they post every 2 weeks, recurrence becomes 1,
and so forth).
Results
A MANOVA was conducted using the CNA indices as the dependent variables, and whether the student completed
or did not complete the MOOC as the independent variable. Of the 56 indices, 15 indices were not normally distributed
and were removed. Of the remaining 41 indices, 27 indices did not demonstrate multicollinearity and were retained.
Of these 27 indices, 26 of them demonstrated significant differences between students who completed the MOOC and
students who did not complete the MOOC (see Table 2 for details). These indices demonstrated that MOOC
completers produced posts that were on topic, were more related to other posts, demonstrated greater collaboration,
and were more central to the conversation. These indices were used in the subsequent DFA.
A stepwise DFA using the 26 indices selected through the MANOVA retained three variables: Standard
deviation of recurrence (Overall Score), Slope degree (Closeness), and Average (Closeness). The results demonstrate
that the DFA using these three indices correctly allocated 243 of the 319 forum posts in the total set, χ2(df=1) = 86.325,
p < .001, for an accuracy of 76.2%. For the leave-one-out cross-validation (LOOCV), the discriminant analysis
allocated 242 of the 319 students for an accuracy of 75.9%. See Table 3 for recall, precision, and F1 scores for this
analysis. The Cohen’s Kappa measure of agreement between the predicted and actual class label was .518,
demonstrating moderate agreement.
Table 2. Longitudinal analysis indices applied on students' social media contributions across time.
Index
Did not complete:
Mean (SD)
Completed:
Mean (SD)
F
η2
Standard deviation of recurrence (Overall score)
2.433 (0.839)
1.395 (0.994)
95.666**
.232
Local extremes (Overall score)
2.106 (1.134)
3.401 (1.550)
66.842**
.174
Slope degree (Closeness)
0.006 (0.011)
0.024 (0.022)
71.637**
.184
Slope degree (Eccentricity)
0.084 (0.115)
0.281 (0.252)
69.91**
.181
Local extremes (Out-degree)
1.864 (1.247)
3.198 (1.678)
60.045**
.159
Degree of uniformity (Overall score)
0.639 (0.099)
0.518 (0.169)
54.739**
.147
Entropy (Overall score)
0.277 (0.349)
0.634 (0.542)
44.412**
.123
Standard deviation of recurrence (In Degree)
2.113 (0.949)
1.338 (0.996)
48.713**
.133
Standard deviation of recurrence (Out Degree)
2.207 (1.117)
1.434 (1.062)
39.325**
.110
Average (Closeness)
0.063 (0.056)
0.118 (0.093)
36.965**
.104
Local extremes (In-degree)
2.265 (1.313)
3.166 (1.492)
31.108**
.089
Entropy (Closeness)
0.309 (0.432)
0.702 (0.654)
36.333**
.103
Average recurrence (Overall score)
2.628 (0.949)
1.856 (1.456)
28.548**
.083
Local extremes (Betweenness)
1.409 (1.266)
2.369 (1.709)
29.997**
.086
Degree of uniformity (Closeness)
0.606 (0.123)
0.494 (0.198)
33.153**
.095
Degree of uniformity (In-degree)
0.613 (0.110)
0.522 (0.165)
30.736**
.088
Entropy (Out-degree)
0.162 (0.290)
0.416 (0.469)
30.461**
.088
Entropy (In-degree)
0.319 (0.372)
0.598 (0.534)
26.909**
.078
Average recurrence (Out-degree)
3.181 (1.593)
2.232 (1.724)
24.941**
.073
Degree of uniformity (Out-degree)
0.646 (0.098)
0.574 (0.143)
24.844**
.073
Standard deviation of recurrence (Betweenness)
1.905 (1.394)
1.318 (1.129)
17.236**
.052
Entropy (Betweenness)
0.123 (0.287)
0.284 (0.416)
14.893**
.045
Standard deviation (Closeness)
0.121 (0.083)
0.155 (0.087)
12.586**
.038
Average recurrence (In-degree)
2.449 (1.392)
1.889 (1.640)
10.219*
.031
Degree of uniformity (Betweenness)
0.626 (0.110)
0.583 (0.128)
9.909*
.030
Average recurrence (Betweenness)
3.999 (1.931)
3.320 (2.239)
7.956*
.024
* p < .010, ** p < .001
Table 3. Recall, precision, and F1 scores for LOOCV DFA
Did not complete
.687
.765
.724
Discussion and Conclusion
Previous MOOC studies have investigated completion rates though click-stream data or NLP techniques or a
combination of both. Our interest in this study was to focus on language indices related to social interaction and
collaboration, which are important components of learning, both inside and outside the classroom (Vygotsky, 1978).
This study examined MOOC completion rates using novel Cohesion Network Analysis indices to estimate connections
between discourse elements in order to develop models of the underlying semantic content of the MOOC forum posts.
The findings from this study indicate that CNA indices are important predictors of student completion rates with
students who produce more on-topic posts, posts that are more strongly related to other posts, or posts that are more
central to conversation. Thus, the results support the notion that students who collaborate more are more likely to
complete the MOOC. These findings have important implications for how students’ interactions within the MOOC in
reference to collaboration and social integration can be used to predict success.
The results indicate that overall contribution scores showed the strongest differences between those that
completed the MOOC and those that did not (see MANOVA results in Table 2). In addition, overall contribution
scores, which reflect an estimate of on-topic relevance for each utterance made by each participant, were a significant
predictor in the DFA model. The mean scores (see Table 2) show that participants who produced a greater number of
on-topic posts (i.e., were more engaged with the topic of the MOOC) were more likely to complete the course. The
next strongest predictors of whether students completed or did not complete the course were related to closeness and
eccentricity applied on weekly CNA interaction graphs. These indices reflect how strongly a student’s posts are related
to other posts made by other students (i.e., strength of connection to other posts). The results indicate that students are
more likely to complete the MOOC if their posts share semantic commonalities with posts made by other students.
Two indices related to closeness were included in the final DFA model. After closeness and eccentricity indices, the
next strongest indices were related to in-degree and out-degree. These indices are also computed based on interaction
graphs and measure the number and the semantic strength of links to and from other students. The findings show that
students who complete the MOOC have a greater number of semantically related links to and from other students in
the MOOC. Lastly, a number of betweenness indices demonstrated significant differences between students who
completed the MOOC and those that did not. Betweenness is a measure of how central a node is to communication in
term of the information exchanged between participants. Importantly, betweenness indices indicate how much
information would be reduced if participants were eliminated from the conversation. The findings from this study
indicate that participants who were more critical to forum discussion threads were more likely to complete the MOOC.
In terms of comparison to previous findings, our CNA indices alone are as powerful as the ones employed in
previous studies that combined both NLP and click-stream data (Crossley et al., 2016) with accuracies of 76% in both
cases, and more powerful than using NLP indices alone (67% with NLP indices compared to 76% with CNA indices
used in the longitudinal analysis; (Crossley et al., 2015). More importantly, the indices indicate that patterns of
collaboration and social interaction are important for understanding success, going beyond individual linguistic
differences and click-stream patterns. Thus, the findings help to provide support to the basic notion that cognitive
engagement during learning is a key component of learning and success (Corno & Mandinach, 1983) and that
cooperative work may lead to greater learning gains (Johnson & Johnson, 1990). More importantly, these theories of
collaboration within learning environments can be extended to large scale on-line classrooms, such as MOOCs. Even
in MOOCs, it appears that those students who deviate less from the expected content (Standard deviation of recurrence
[Overall Score]), and have higher and stronger connections to other participants (Slope degree and Average
[Closeness]) are more likely to be successful. Other CNA indices that were not included in the DFA, but demonstrated
significant differences between students who completed the course and those that did not, indicated that more
successful students had more links to and from other students (in- and out-degree), were central within the community
(low eccentricity) and facilitated conversation among students (betweenness).
The models presented in this paper could be employed to monitor and support students less likely to complete
the course by providing timely and personalized feedback in order to increase MOOC engagement and long-term
completion. However, much of this depends on the availability of textual traces, which are not always available in
many MOOCs. While we focused on forum posts in this study, the employed mechanisms should generalize and, as
such, could be applied on other text traces such as participation in collaborative chats, written assignments that are
scored in terms of effectively summarizing course lectures, responses to open answer questions which are
automatically assessed. In all cases, the results reported here need to be substantiated in follow up studies that evaluate
the applicability of the introduced CNA indices in the analysis of MOOCs from other domains and on MOOCs built
on other platforms. The LSA and LDA spaces developed for this study may need to change based on new domains,
although this needs to be tested. In addition, the CNA indices introduced here could be combined with more traditional
NLP indices, click-stream variables, and individual difference measures to further enhance our understanding of
student success in on-line classes.
References
Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: An open source software for exploring and manipulating
networks. In International AAAI Conference on Weblogs and Social Media (pp. 361362). San Jose, CA: AAAI
Press.
Bereiter, C. (2002). Education and mind in the knowledge age. Mahwah, NJ: Lawrence Erlbaum Associates.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research,
3(4-5), 9931022.
Budanitsky, A., & Hirst, G. (2006). Evaluating WordNet-based Measures of Lexical Semantic Relatedness.
Computational Linguistics, 32(1), 1347.
Chaturvedi, S., Goldwasser, D., & Daume, H. (2014). Predicting instructor's intervention in MOOC forums. In 52nd
Annual Meeting of the Association for Computational Linguistics (pp. 15011511). Baltimore, MA: ACL.
Corno, L., & Mandinach, E. (1983). The role of cognitive engagement in classroom learning and motivation.
Educational Psychologist, 18, 88100.
Crossley, S. A., McNamara, D. S., Baker, R., Wang, Y., Paquette, L., Barnes, T., & Bergner, Y. (2015). Language to
completion: Success in an educational data mining massive open online class. In 8th Int. Conf. on Educational
Data Mining (pp. 388392). Madrid, Spain.
Crossley, S. A., Paquette, L., Dascalu, M., McNamara, D. S., & Baker, R. S. (2016). Combining Click-Stream Data
with NLP Tools to Better Understand MOOC Completion. In 6th Int. Conf. on Learning Analytics & Knowledge
(LAK '16) (pp. 614). Edingurgh, UK: ACM.
Dascalu, M. (2014). Analyzing discourse and text complexity for learning and collaborating, Studies in Computational
Intelligence (Vol. 534). Cham, Switzerland: Springer.
Dascalu, M., Trausan-Matu, S., & Dessus, P. (2013). Cohesion-based analysis of CSCL conversations: Holistic and
individual perspectives. In N. Rummel, M. Kapur, M. Nathan & S. Puntambekar (Eds.), 10th Int. Conf. on
Computer-Supported Collaborative Learning (CSCL 2013) (pp. 145152). Madison, USA: ISLS.
Dascalu, M., Trausan-Matu, S., McNamara, D. S., & Dessus, P. (2015). ReaderBench Automated Evaluation of
Collaboration based on Cohesion and Dialogism. International Journal of Computer-Supported Collaborative
Learning, 10(4), 395423. doi: 10.1007/s11412-015-9226-y
Dowell, N., Skrypnyk, O., Joksimovic, S., Graesser, A., Dawson, S., Gasevic, D., de Vries, P., Hennis, T., &
Kovanovic, V. (2015). Modeling Learners’ Social Centrality and Performance through Language and Discourse.
In 8th Int. Conf. on Educational Data Mining (EDM 2015) (pp. 130137). Madrid, Spain: International
Educational Data Mining Society.
Freeman, L. (1977). A set of measures of centrality based on betweenness. Sociometry, 40(1), 3541.
He, J., Bailey, J., Rubinstein, B. I.P., & Zhang, R. (2015). Identifying at-risk students in massive open online courses.
In Twenty-Ninth AAAI Conf. on Artificial Intelligence (pp. 17491755). Austin, Texas: AAAI Press.
Johnson, D. W., & Johnson, R. T. (1990). Cooperative learning and achievement. In S. Sharan (Ed.), Cooperative
learning: Theory and research (pp. 2337). New York, NY: Praeger.
Koller, D., Ng, A., Do, C., & Chen, Z. (2013). Retention and Intention in Massive Open Online Courses. EDUCAUSE
Review, 48(3), 6263.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: the Latent Semantic Analysis theory of
acquisition, induction and representation of knowledge. Psychological Review, 104(2), 211240.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical Natural Language Processing. Cambridge, MA:
MIT Press.
Miller, G.A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 3941.
Ramesh, A., Goldwasser, D., Huang, B., Daume, H., & Getoor, L. (2014). Understanding MOOC Discussion Forums
using Seeded LDA. In 9th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 28
33). Baltimore, MA: ACL.
Rebedea, T. (2012). Computer-Based Support and Feedback for Collaborative Chat Conversations and Discussion
Forums. (Doctoral dissertation), University Politehnica of Bucharest, Bucharest, Romania.
Sabidussi, G. (1966). The centrality index of a graph. Psychometrika, 31, 581603.
Seaton, D. T., Bergner, Y., Chuang, I., Mitros, P., & Pritchard, D. E. (2014). Who does what in a massive open online
course? Communications of the ACM, 57(4), 5865.
Shannon, C. E. (1948). A Mathematical Theory of Communication. The Bell System Technical Journal, 27, 379423
& 623656.
Trausan-Matu, S., Dascalu, M., & Dessus, P. (2012). Textual complexity and discourse structure in Computer-
Supported Collaborative Learning. In S. A. Cerri, W. J. Clancey, G. Papadourakis & K. Panourgia (Eds.), 11th
Int. Conf. on Intelligent Tutoring Systems (ITS 2012) (pp. 352357). Chania, Grece: Springer.
Trausan-Matu, S., Stahl, G., & Sarmiento, J. (2007). Supporting polyphonic collaborative learning. E-service Journal,
Indiana University Press, 6(1), 5874.
Vygotsky, L. S. (1978). Mind in society. Cambridge, MA: Harvard University Press.
Wang, Y. (2014). MOOC Leaner Motivation and Learning Pattern Discovery. In J. Stamper, Z. Pardos, M. Mavrikis
& B. M. McLaren (Eds.), 7th Int. Conf. on Educational Data Mining (pp. 452454). London, UK.
Wen, M., Yang, D., & Rose, C. P. (2014a). Linguistic Reflections of Student Engagement in Massive Open Online
Courses. In Int. Conf. on Weblogs and Social Media.
Wen, M., Yang, D., & Rose, C. P. (2014b). Sentiment Analysis in MOOC Discussion Forums: What does it tell us.
In J. Stamper, Z. Pardos, M. Mavrikis & B. M. McLaren (Eds.), 7th Int. Conf. on Educational Data Mining (pp.
130137). London, UK.
Acknowledgments
This research was partially supported by the FP7 2008-212578 LTfLL project, by the 644187 EC H2020 Realising an
Applied Gaming Eco-system (RAGE) project, by University Politehnica of Bucharest through the “Excellence
Research Grants” Program UPBGEX 12/26.09.2016, as well as by the NSF grant 1417997 to Arizona State
University.
... Earning a certificate was the most prevalent definition. In a study, completion was defined as obtaining an overall grade average exceeding 70% or above, and this rate was calculated based on the average of six highest grades earned out of eight assignments (Crossley et al., 2017). This issue primarily leads to some problems regarding calculation and comparison of the completion rates in the literature. ...
... Their research demonstrated greater predictive power in views count than in post counts. Crossley et al. [30] showed that students had significantly better achievement than their peers when they made at least one post of 50 words or more. Furthermore, students who produce more on-topic posts, posts that are more strongly related to other posts, or posts that are more central to conversation presented a better completion rate. ...
Article
Full-text available
Learning mostly involves communication and interaction that leads to new information being processed, which eventually turns into knowledge. In the digital era, these actions pass through online technologies. Formal education uses LMSs that support these actions and, at the same time, produce massive amounts of data. In a distance learning model, the assignments have an important role besides assessing the learning outcome; they also help students become actively engaged with the course and regulate their learning behavior. In this work, we leverage data retrieved from students’ online interactions to improve our understanding of the learning process. Focusing on log data, we investigate the students’ activity that occur close to and during assignment submission due dates. Additionally, their activity in relation to their academic achievements is examined and the response time in the forum communication is computed both for students and their tutors. The main findings include that students tend to procrastinate in the submission of their assignments mostly at the beginning of the course. Furthermore, the last-minute submissions are usually made late at night, which probably indicates poor management or lack of available time. Regarding forum interactions, our findings highlight that tutors tend to respond faster than students in the corresponding posts.
... They found that models developed using US data could predict achievement in data from other developed countries with high accuracy, but that model performance dropped considerably for less developed countries. If this is also true for MOOC courses, then existing prediction models [2,7,10] developed predominantly with learners from a small number of countries may be less effective for learners from other countries. Several papers have raised questions about how broadly prediction models developed for MOOCs can generalize. ...
Conference Paper
Full-text available
Massive Open Online Courses (MOOCs) have increased the accessibility of quality educational content to a broader audience across a global network. They provide access for students to material that would be difficult to obtain locally, and an abundance of data for educational researchers. Despite the international reach of MOOCs, however, the majority of MOOC research does not account for demographic differences relating to the learners' country of origin or cultural background, which have been shown to have implications on the robustness of predictive models and interventions. This paper presents an exploration into the role of nation-level metrics of culture, happiness, wealth, and size on the generalizability of completion prediction models across countries. The findings indicate that various dimensions of culture are predictive of crosscountry model generalizability. Specifically, learners from indulgent, collectivist, uncertainty-accepting, or short-term oriented, countries produce more generalizable predictive models of learner completion.
... In addition, no attempt has been made to investigate how L2 writers use cohesive devices across different academic genres over time. Finally, recent studies (Crossley, Dascalu, McNamara, Baker, & Trausan-Matu, 2017;Crossley, Kyle, & Dascalu, 2018; have used computational tools to provide a linguistic analysis of cohesion (e.g., argument overlap, motion verbs, givenness, causal verbs, and locational nouns) to differentiate between L1 and L2 writers or specifically discover linguistic features which vary among L2 writers. Nonetheless, there is a lack of research that examines whether there are similarities between cohesive devices analyzed via modern computational tools and human ratings of essay quality across different academic genres. ...
Article
This study examined the use of cohesive features in 270 narrative and argumentative essays produced by 45 s language (L2) students over a semester-long writing course. Multiple regression analyses were conducted to determine the ability of the computational indices of cohesion (TAACO) variables to predict human ratings of essay quality, recognize any differences in the use of cohesive devices between narrative and argumentative genres, and ascertain which of the cohesive devices varied for each of the genres over time. The results indicated clear differences in how cohesion was signaled between the two genres. Narrative texts relied on the use of connective devices to signal cohesion, whereas argumentative texts relied on the use of global-level repetition. With regard to development, the results were less conclusive but do suggest expansion in the participants' use of cohesive devices. These results provide important implications for L2 writing pedagogy and assessment.
... Most studies relied on data provided by interactions in forums, and other MOOC-related aspects (answering quizzes, watching videos, etc.) Most studies capture learner activity in forums regarding post creation (i.e., posts entries and post replies) and views of other posts (N = 19). Few studies reported to capture further information, such as the number of comments created to learners' entries and number of comments to post replies (Almeda et al., 2018), positive and negative votes on the posts where the platform permitted it(Klusener & Fortenbacher, 2015), initiation of threads and sub-threads and posts' density and length(Crossley et al., 2017). Reviewed works used data from course assignments (e.g., scores or number of passed quizzes, and tests), «honour» pasted (task high marks) and failed tasks,video activity (e.g., video replays). ...
Article
Full-text available
Background Providing feedback in massive open online courses (MOOCs) is challenging due to the massiveness and heterogeneity of learners' population. Learning analytics (LA) solutions aim at scaling up feedback interventions and supporting instructors in this endeavour. Paper Objectives This paper focuses on instructor‐led feedback mediated by LA tools in MOOCs. Our goal is to answer how, to what extent data‐driven feedback is provided to learners, and what its impact is. Methods We conducted a systematic literature review on the state‐of‐the‐art LA‐informed instructor‐led feedback in MOOCs. From a pool of 227 publications, we selected 38 articles that address the topic of LA‐informed feedback in MOOCs mediated by instructors. We applied etic content analysis to the collected data. Results and Conclusions The results revealed a lack of empirical studies exploring LA to deliver feedback, and limited attention on pedagogy to inform feedback practices. Our findings suggest the need for systematization and evaluation of feedback. Additionally, there is a need for conceptual tools to guide instructors' in the design of LA‐based feedback. Takeaways We point out the need for systematization and evaluation of feedback. We envision that this research can support the design of LA‐based feedback, thus contributing to bridge the gap between pedagogy and data‐driven practice in MOOCs.
... Their research demonstrated greater predictive power in views count than in posts counts. Crossley, et al., (2017) shown that students had significantly better achievement than their peers when they made at least one post of 50 words or more. Furthermore, students who produce more on-topic posts, posts that are more strongly related to other posts, or posts that are more central to conversation presented a better completion rate. ...
... In MORF version 2.1, any aggregate output is now possible (compared to limited functionality in 2.0); however, often researchers want to look at the content of specific data points to verify algorithm functioning or iterate an algorithm. For example, a researcher developing a linguistic algorithm (Crossley et al., 2017) may want to look at specific text strings to see how they are classified. Or a researcher may want to look at the details of specific students being poorly classified by a dropout prediction algorithm , in order to support a next pass on iterative feature engineering. ...
Article
Full-text available
Learning analytics research presents challenges for researchers embracing the principles of open science. Protecting student privacy is paramount, but progress in increasing scientific understanding and improving educational outcomes depends upon open, scalable and replicable research. Findings have repeatedly been shown to be contextually dependent on personal and demographic variables, so how can we use this data in a manner that is ethical and secure for all involved? This paper presents ongoing work on the MOOC Replication Framework (MORF), a big data repository and analysis environment for Massive Open Online Courses (MOOCs). We discuss MORF's approach to protecting student privacy, which allows researchers to use data without having direct access. Through an open API, documentation and tightly controlled outputs, this framework provides researchers with the opportunity to perform secure, scalable research and facilitates collaboration, replication, and novel research. We also highlight ways in which MORF represents a solution template to issues surrounding privacy and security in the age of big data in education and key challenges still to be tackled. Practitioner notes What is already known about this topic Personal Identifying Information (PII) has many valid and important research uses in education. The ability to replicate or build on analyses is important to modern educational research, and is usually enabled through sharing data. Data sharing generally does not involve PII in order to protect student privacy. MOOCs present a rich data source for education researchers to better understand online learning. What this paper adds The MOOC replication framework (MORF) 2.1 is a new infrastructure that enables researchers to conduct analyses on student data without having direct access to the data, thus protecting student privacy. Detail of the MORF 2.1 structure and workflow. Implications for practice and/or policy MORF 2.1 is available for use by practitioners and research with policy implications. The infrastructure and approach in MORF could be applied to other types of educational data.
Article
Full-text available
Virtual learning circumstances have been observed as consistent growth over the years. The widespread use of online learning leads to an emerging amount of enrollments, also from pupils who have quit the education scheme previously. However, it also earned an increased amount of withdrawal rate when compared to conventional classrooms. Quick identification of pupils is a difficult issue that can be alleviated with the help of previous models for data evaluation and machine learning. In this research, a fractional-Iterative BiLSTM is used for predicting the student's dropout from online courses with a high accuracy rate. The feature extraction is provided by utilizing the encoder layer that efficiently extracts the features based on Statistical features. The Fractional-Iterative BiLSTM classifier is employed in the decoder layer, which is effectively performed in the classification function to predict the student dropout. The accomplishment of the research is evaluated by calculating the enhancement, and the developed model achieved the increment of 96.71% accuracy, 95.31% sensitivity, and 97.01% specificity, which shows the method's efficiency, and the MSE is reduced by 0.11%.
Article
Full-text available
Gephi is an open source software for graph and network analysis. It uses a 3D render engine to display large networks in real-time and to speed up the exploration. A flexible and multi-task architecture brings new possibilities to work with complex data sets and produce valuable visual results. We present several key features of Gephi in the context of interactive exploration and interpretation of networks. It provides easy and broad access to network data and allows for spatializing, filtering, navigating, manipulating and clustering. Finally, by presenting dynamic features of Gephi, we highlight key aspects of dynamic network visualization.
Thesis
Full-text available
With the wide adoption of instant messaging, online discussion forums, blog post and comments and social networking (status updates, comments, etc.), online communication started to switch from narratives to highly-collaborative conversations with multiple authors and parallel discussion threads. However, the theories used for analyzing this new type of discourse, that is very different from narratives, but also from dialogues between two persons, have remained in essence the same. The thesis introduces a new methodology that uses inter-animation for the analysis of online conversations with multiple authors and is based on detecting the implicit links that arise between turns using only Natural Language Processing (NLP) techniques. During the development of this methodology several existing theories for discourse analysis have been used. Some of them have been designed especially for dialogues, but others can also be used for narrative texts. In our work, they are used mainly to highlight the implicit links existing in online discussions. Thus, a conversation graph is built which can be used for determining the discussion threads and for an improved analysis of the conversations. The inter-animation framework for online conversations had been partially implemented for PolyCAFe, a system for analyzing chat conversations and discussion forums, developed within the FP7 project Language Technologies for Lifelong Learning (LTfLL). As illustrated in the thesis, it has been designed for the analysis of online conversations of students involved in Computer Supported Collaborative Learning (CSCL) activities. Besides a detailed description of the implemented system, several validation experiments are presented that prove the usefulness and efficiency of the software for the students and tutors that are using it. Furthermore, quite a few experiments have been undertaken to determine the accuracy of the results offered by the system and the main results are discussed in the final chapters. As the main elements of novelty within PolyCAFe are related to the semantic layer of linguistic analysis, at the end of the thesis are presented the results of two experiments that use semantics in order to improve the relevance of text documents in two different contexts: improving web documents ranking for search engines and improving the comments ranking on websites with video materials, such as YouTube. I hope that the thesis will offer the readers new insights in the way lexical, semantic and discourse elements work together to provide inter-connections between different elements of text, especially turns in online conversations.
Conference Paper
Full-text available
Completion rates for massive open online classes (MOOCs) are notoriously low, but learner intent is an important factor. By studying students who drop out despite their intent to complete the MOOC, it may be possible to develop interventions to improve retention and learning outcomes. Previous research into predicting MOOC completion has focused on click-streams, demographics, and sentiment analysis. This study uses natural language processing (NLP) to examine if the language in the discussion forum of an educational data mining MOOC is predictive of successful class completion. The analysis is applied to a subsample of 320 students who completed at least one graded assignment and produced at least 50 words in discussion forums. The findings indicate that the language produced by students can predict with substantial accuracy (67.8 %) whether students complete the MOOC. This predictive power suggests that NLP can help us both to understand student retention in MOOCs and to develop automated signals of student success.
Conference Paper
Full-text available
Completion rates for massive open online classes (MOOCs) are notoriously low. Identifying student patterns related to course completion may help to develop interventions that can improve retention and learning outcomes in MOOCs. Previous research predicting MOOC completion has focused on click-stream data, student demographics, and natural language processing (NLP) analyses. However, most of these analyses have not taken full advantage of the multiple types of data available. This study combines click-stream data and NLP approaches to examine if students' on-line activity and the language they produce in the online discussion forum is predictive of successful class completion. We study this analysis in the context of a subsample of 320 students who completed at least one graded assignment and produced at least 50 words in discussion forums, in a MOOC on educational data mining. The findings indicate that a mix of click-stream data and NLP indices can predict with substantial accuracy (78%) whether students complete the MOOC. This predictive power suggests that student interaction data and language data within a MOOC can help us both to understand student retention in MOOCs and to develop automated signals of student success.
Article
Full-text available
As Computer Supported Collaborative Learning (CSCL) gains a broader usage, the need for automated tools capable of supporting tutors in the time-consuming process of analyzing conversations becomes more pressing. Moreover, collaboration, which presumes the intertwining of ideas or points of view among participants, is a central element of dialogue performed in CSCL environments. Therefore, starting from dialogism and a cohesion-based model of discourse, we propose and validate two computational models for assessing collaboration. The first model is based on a cohesion graph and can be perceived as a longitudinal analysis of the ongoing conversation, thus accounting for collaboration from a social knowledge-building perspective. In the second approach, collaboration is regarded from a dialogical perspective as the intertwining or synergy of voices pertaining to different speakers, therefore enabling a transversal analysis of subsequent discussion slices. Abstract As Computer Supported Collaborative Learning (CSCL) gains a broader usage, the need for automated tools capable of supporting tutors in the time-consuming process of analyzing conversations becomes more pressing. Moreover, collaboration, which presumes the intertwining of ideas or points of view among participants, is a central element of dialogue performed in CSCL environments. Therefore, starting from dialogism and a cohesion-based model of discourse, we propose and validate two computational models for assessing collaboration. The first model is based on a cohesion graph and can be perceived as a longitudinal analysis of the ongoing conversation, thus accounting for collaboration from a social knowledge-building perspective. In the second approach, collaboration is regarded from a dialogical perspective as the intertwining or synergy of voices pertaining to different speakers, therefore enabling a transversal analysis of subsequent discussion slices.
Article
Massive Open Online Courses (MOOCs) have received widespread attention for their potential to scale higher education, with multiple platforms such as Coursera, edX and Udacity recently appearing. Despite their successes, a major problem faced by MOOCs is low completion rates. In this paper, we explore the accurate early identification of students who are at risk of not completing courses. We build predictive models weekly, over multiple offerings of a course. Furthermore, we envision student interventions that present meaningful probabilities of failure, enacted only for marginal students.To be effective, predicted probabilities must be both well-calibrated and smoothed across weeks.Based on logistic regression, we propose two transfer learning algorithms to trade-off smoothness and accuracy by adding a regularization term to minimize the difference of failure probabilities between consecutive weeks. Experimental results on two offerings of a Coursera MOOC establish the effectiveness of our algorithms.
Chapter
The development of multiple systems, a constant growth in terms of the complexity of the approach, the multitude of considered factors, the unified approach that addresses both general texts and conversations and the emphasis on providing effective support for tutors and students in their learning and CSCL activities, are just the highlighting points of our research.
Conference Paper
Discussion forums serve as a platform for student discussions in massive open online courses (MOOCs). Analyzing content in these forums can uncover useful information for improving student retention and help in initiating instructor intervention. In this work, we explore the use of topic models, particularly seeded topic models toward this goal. We demonstrate that features derived from topic analysis help in predicting student survival.