Conference PaperPDF Available

Towards Automated Content Analysis of Discussion Transcripts: A Cognitive Presence Case


Abstract and Figures

In this paper, we present the results of an exploratory study that examined the problem of automating content analysis of student online discussion transcripts. We looked at the problem of coding discussion transcripts for the levels of cognitive presence, one of the three main constructs in the Community of Inquiry (CoI) model of distance education. Using Coh-Metrix and LIWC features , together with a set of custom features developed to capture discussion context, we developed a random forest classification system that achieved 70.3% classification accuracy and 0.63 Cohen's kappa, which is significantly higher than values reported in the previous studies. Besides improvement in classification accuracy, the developed system is also less sensitive to overfitting as it uses only 205 classification features, which is around 100 times less features than in similar systems based on bag-of-words features. We also provide an overview of the classification features most indicative of the different phases of cognitive presence that gives an additional insights into the nature of cognitive presence learning cycle. Overall, our results show great potential of the proposed approach, with an added benefit of providing further characterization of the cognitive presence coding scheme.
Content may be subject to copyright.
Towards Automated Content Analysis of Discussion
Transcripts: A Cognitive Presence Case
Vitomir Kovanovi´
School of Informatics
The University of Edinburgh
Edinburgh, UK
cko Joksimovi´
Moray House School of
The University of Edinburgh
Edinburgh, UK
Zak Waters
Queensland University of
Brisbane, Australia
Dragan Gaševi´
Moray House School of
Education and School
of Informatics
The University of Edinburgh
Edinburgh, UK
Kirsty Kitto
Queensland University of
Brisbane, Australia
Marek Hatala
School of Interactive Arts and
Simon Fraser University
Burnaby, Canada
George Siemens
LINK Research Lab
University of Texas at Arlington
Arlington, USA
In this paper, we present the results of an exploratory study that
examined the problem of automating content analysis of student
online discussion transcripts. We looked at the problem of cod-
ing discussion transcripts for the levels of cognitive presence, one
of the three main constructs in the Community of Inquiry (CoI)
model of distance education. Using Coh-Metrix and LIWC fea-
tures, together with a set of custom features developed to capture
discussion context, we developed a random forest classification sys-
tem that achieved 70.3% classification accuracy and 0.63 Cohen’s
kappa, which is significantly higher than values reported in the pre-
vious studies. Besides improvement in classification accuracy, the
developed system is also less sensitive to overfitting as it uses only
205 classification features, which is around 100 times less features
than in similar systems based on bag-of-words features. We also
provide an overview of the classification features most indicative of
the different phases of cognitive presence that gives an additional in-
sights into the nature of cognitive presence learning cycle. Overall,
our results show great potential of the proposed approach, with an
added benefit of providing further characterization of the cognitive
presence coding scheme.
Community of Inquiry (CoI) model, content analysis, content ana-
lytics, online discussions, text classification
Online discussions are commonly used in modern higher educa-
tion, both for blended and fully online learning [42]. In distance
education, given the absence of face to face interactions, online
discussions represent an important component of the whole edu-
cational experience. This is especially important for the social-
constructivist pedagogies which emphasize the value of social con-
struction of knowledge through interactions and discussions among
a group of learners [3]. In this regard, the Community of Inquiry
(CoI) model [23,24] represents perhaps one of the best researched
and validated models of online and distance education, focused on
explaining important dimensions – also known as presences – that
shape students’ online learning experience.
The most commonly used approaches to the analysis of online
discussion transcripts are based on the quantitative content analysis
(QCA) [12,54,50,15]. According to Krippendorff [37] content
analysis is “a research technique for making replicable and valid
inferences from texts (or other meaningful matter) to the contexts
of their use”[p18]. In the case of the study presented in this paper,
contexts is online learning environments. QCA is a well defined
research technique commonly used in social science research, and
it makes use of specifically designed coding schemes to analyze text
artifacts with respect to the defined research goals and objectives.
For instance, the CoI model defines a set of coding schemes which
are used by the educational researchers to assess the levels of three
CoI presences.
In the domain of educational research, QCA of student discus-
sion data have been mainly used for the retrospection and research
after the courses are over without an impact on the courses’ learning
outcomes [53]. In the field of content analytics [36] – which focuses
on building analytical models based on the learning content includ-
ing student-produced content such as online discussion messages –
there have been some attempts to automate some of those coding
schemes. Most notable are the efforts of McKlin [44] and Corich
et al. [11] on automation of the CoI coding schemes, which served
as a starting point for our research in the area [35,62]. One of
the main challenges for automation of content analysis is the fact
that the most important constructs from the educational perspective
(e.g., student group learning progress, motivation, engagement, so-
cial climate) are latent constructs not explicitly present in the dis-
cussion transcripts. This means the assessment of these constructs
requires human interpretation and judgment.
This paper presents the results of a study that explored the use of
content analytics for automating content analysis of student online
discussions based on the CoI coding schemes. We focused on au-
tomation of the content analysis of cognitive presence, one of the
main constructs in the CoI model. By building upon the existing
work in the fields of text mining and text classification and our pre-
vious work in this area [35,62], we developed a random forests clas-
sifier which makes use of a novel set of classification features and
provides a classification accuracy of 70.3% and Cohen’s κof 0.63
in our cross validation testing. In this paper, we describe the de-
veloped classifier and the adopted classification features. We also
report on the findings of the empirical evaluation of the classifier
and critically discuss the findings.
2.1 The Community of Inquiry (CoI) model
The Community of Inquiry (CoI) model is a widely researched
model that explains different dimensions of social learning in on-
line learning communities [23,24]. Central to the model are the
three constructs, also known as presences, which together provide
a comprehensive understanding of learning processes [23,24]:
1) Cognitive presence which is the central construct in the CoI
model and describes different phases of student knowledge con-
struction within a learning community [24].
2) Social presence captures different social relationships within a
learning community that have a significant impact on the success
and quality of the learning process [51].
3) Teaching presence explains the role of instructors during the
course delivery as well as their role in the course design and
preparation [4].
The focus of this study is on the analysis of cognitive presence,
which is defined by Garrison et al. [24] as “an extent to which the
participants in any particular configuration of a community of in-
quiry are able to construct meaning through sustained communi-
cation.”[p11]. Cognitive presence is grounded in the constructivist
views of Dewey [14] and is “the element in this [CoI] model that
is most basic to success in higher education” [23, p89]. Cogni-
tive presence is operationalized by the practical inquiry model [24],
which defines the following four phases:
1) Triggering event: In this phase, an issue, dilemma or problem
is identified. In the case of a formal educational context, those
are often explicitly defined by the instructors; however, they can
also be initiated by the other discussion participants [24].
2) Exploration: This phase is characterized by the transition be-
tween the private world of reflective learning and the shared
world of social construction of knowledge [24]. Questioning,
brainstorming and information exchange are the main activities
which characterize this phase [24].
3) Integration: In this phase, students move between reflection
and discourse. The phase is characterized by the synthesis of
the ideas generated in the exploration phase. The synthesis ulti-
mately leads to the construction of meaning [24]. From a teach-
ing perspective, this is the most difficult phase to detect from
the discussion transcripts, as the integration of ideas is often not
clearly identifiable.
4) Resolution: In this phase, students resolve the original prob-
lem or dilemma that started the learning cycle. In the formal
educational setting, this is typically achieved through a vicari-
ous hypothesis testing or consensus building within a learning
community [24].
The CoI model defines its own multi-dimensional content analy-
sis schemes [23,24] and 34-item likert-scale survey instrument [5]
which are used for the assessment of the three presences. The model
has gained a considerable attention in the research community re-
sulting in a fairly large number of replication studies and empirical
validations (for an overview see [25]) including the studies about the
interaction dynamics between the three presences [26]. In general,
the model has been shown to be robust, and its coding scheme ex-
hibits sufficient levels of inter-rater reliability for it to be considered
a valid construct [25].
While the CoI model has been proven to be a very useful model
for assessment of the social distance learning, there are several prac-
tical issues that still remain open. First, the use of the CoI coding
schemes requires a substantial amount of manual work, which is
very time consuming and requires trained coders. For example, to
code the dataset used in this study, two experienced coders spent
around 130 hours each to manually code 1,747 messages [27]. The
coding process started with the calibration of the use of the coding
scheme which was then followed by the independent coding, and
finally reconciliation of the coding disagreements.
One major consequence of manual coding of messages in the CoI
model is that it has been used mostly for research purposes and not
for the real-time monitoring of students’ learning progress and guid-
ing instructional interventions. This is not unique to the CoI model
and is very common with most of content analysis schemes used in
education. The lack of automated content analysis approaches has
been identified by Donnelly and Gardner [15] as one of the main rea-
sons why transcript analysis techniques have had almost zero impact
on educational practice. The development of the CoI survey instru-
ment [5] is one attempt to eliminate, or at least to lessen the need
for the manual content analysis of discussion transcripts. Still, the
instrument is based on self-reported survey data, which makes it
not so suitable for the real-time monitoring and guidance of student
In order to enable for a broader adoption of the CoI model, the
coding process needs to be automated and this is precisely the goal
of the current study. While this study focuses on automation of cod-
ing online discussion transcripts for the levels of cognitive presence,
a more general goal is to automate coding for all three presences,
which would enable for a more comprehensive view of social learn-
ing phenomena and the development of more sophisticated social
learning environments [60]. This in turn could be used by the in-
structors to inform their interventions leading to better achievement
of learning objectives. From the standpoint of self-regulated learn-
ing research [8] – a major theory in contemporary education – in
order to regulate their own learning effectively, learners need real-
time feedback, which is an “inherent catalyst” for all self-regulated
activities [8]. By providing learners with timely feedback on their
own learning and the learning of their peers, they would be in a
position to better regulate their own learning activities.
2.2 Automating Cognitive Presence Analysis
Several studies have investigated automating content analysis us-
ing the cognitive presence coding scheme. A study by McKlin [44]
describes a system built using feed-forward, back-propagation ar-
tificial neural network that was trained on a single semester worth
of discussion messages (N=1,997). The classification features were
the counts of words in the one of the 182 different word categories as
defined in the General Inquirer category model [52]. McKlin [44]
also used a binary indicator whether a message is a reply to another
message, as triggering events are more likely to be the discussion
starters and thus not replies to other messages. Finally, McKlin [44]
defined custom categories of words and phrases, which are thought
to be indicative of the different phases of cognitive presence and
included count of words in those categories as additional classifi-
cation features. For example, “indicative words” category contains
“compared to”, “I agree”, “that reminds me of”, and “thanks” as it is
hypothesized that integration messages would contain larger num-
ber of these phrases in order to connect the message with the previ-
ously given information. Unfortunately, these additional coding cat-
egories are very briefly described and thus is not possible to repli-
cate them and evaluate their usability in future studies. McKlin’s
findings show that classification system overgeneralized the explo-
ration phase and under-generalized the integration phase. Further-
more, given the very low frequency of messages in the resolution
phase (i.e., <1% and only 3 messages in total in their data set), the
neural network developed by McKlin simply ignored the resolution
category and never predicted the resolution phase for any message
in the corpus. Overall, they reported Holsti’s Coefficient of Relia-
bility [31] of 0.69 and Cohen’s κof 0.31, which show some potential
of the proposed approach with much room for improvement in order
to reach reliability levels commonly found among two independent
coders – usually Cohen’s κof at least 0.70 [29].
Following the work of McKlin [44], a study by Corich et al. [11]
presented ACAT, a very general classification framework that can
support any coding scheme besides cognitive presence which is also
based on word count features. In order to use ACAT, users are re-
quired to provide a set of labeled training examples, which are used
for training of classification models. Furthermore, as ACAT does
not specify a particular set of word categories that are used as classi-
fication features, users are required to provide definitions (i.e., cate-
gory name and list of words) that are used as classification features.
Interestingly, the use of the ACAT system is also evaluated on the
problem of coding cognitive presence of the CoI model. However,
instead of classifying each message to one of the four phases of
cognitive presence, Corich et al. [11] classified each sentence of
each message to four cognitive presence levels. This poses some
theoretical challenges as the CoI coding schemes are originally de-
signed to be used for message-level content analysis. The dataset
used by Corich et al. [11] consists of 484 sentences originating from
74 discussion messages and they report Holsti’s coefficient of reli-
ability of 0.71 in their best test case. However, given that their re-
port did not provide sufficient details about the classification scheme
used in terms of the specific indicators for each category of cogni-
tive presence, nor did it discuss the types of features that were used
for classification, it is hard to evaluate the significance of their re-
Besides the studies by McKlin [44] and Corich et al. [11], we
should also mention our previous work in this domain. A study
by Kovanović et al. [35] investigated the use of Support Vector Ma-
chines (SVMs) [59] classification for the automation of cognitive
presence coding using a bag-of-words approach based on the N-
gram and Part-of-Speech (POS) N-gram features. Using a 10-fold
cross-validation, a classification accuracy of 0.41 Cohen’s κwas
achieved – which is higher than values reported in the previous stud-
ies [44,11].
Several challenges related to the classification of online discus-
sion massages based on cognitive presence were observed in our ex-
isting work [35]. First, the distribution of classes in the used dataset
(i.e., phases of cognitive presence) was uneven, which is in agree-
ment with the findings commonly reported in the literature [25].
This poses some challenges to the classification accuracy. This was
already seen in the McKlin [44] study whose classifier completely
ignored the resolution phase (as only three messages were coded
as being in resolution phase). Secondly, the use of bag-of-words
features (i.e., n-grams, POS n-grams, and back-off n-grams) cre-
ates a very large feature space (i.e., more than 20,000 features) rel-
ative to the number of classification instances (i.e., 1,747) which
poses challenge of over-fitting. Next, the use of bag-of-words fea-
tures makes the classification system highly domain dependent, as
the space of bag-of-words features is defined based on the training
set. For instance, a classification system trained on a introductory
programming course would likely have a bigram feature java pro-
gramming which is highly specific to a particular domain and would
impede the performance of the classifier in other domains. Finally,
given that each message belongs to a discussion and represents a
part of the overall conversation, the context of the previous mes-
sages in the discussion thread is very important. For example, given
the structure and cyclic nature of inquiry process, it is highly un-
likely that a discussion would start with a resolution message, or
that the first response to a triggering message will be an integration
message [27]. These “dependencies” between discussion messages
are not taken into the account when each message is classified in-
dependently of other messages in the discussion.
In order to address the challenge of isolated classification of dis-
cussion messages, Waters et al. [62] developed a structured classi-
fication system using conditional random fields (CRFs) [38]. This
classifier does a prediction for the whole sequence of messages within
a discussion, taking into the account orderings of messages within
a discussion thread. Using a 10-fold cross-validation, the devel-
oped classifier achieved Cohen’s κof 0.48 which is significantly
higher than 0.41 Cohen’s κreported by [35], showing a promise of
the structured classification approach. However, there are still cou-
ple of unresolved issues which warrant further investigation. First
of all, although the classification accuracy is improved, it is still far
below the Cohen’s κof 0.7 which is considered a norm for assessing
the quality of the coding in the CoI research community [29]. Sec-
ondly, CRFs are an example of black-box classification method [28]
that are hard to interpret, which limits their potential use for under-
standing how cognitive presence is captured in the discourse.
3.1 Data set
The dataset used in this study is the same dataset that was used
in studies by Kovanović et al. [35] and Waters et al. [62]. The data
comes from a masters level, and research-intensive course in soft-
ware engineering offered through a fully online instructional con-
dition at a Canadian open public university. The dataset consists
of six offerings of the course between 2008 and 2011 with the to-
tal of 81 students that produced 1,747 discussion messages (Ta-
ble 1). On average, each offering of the course had 13-14 stu-
dents (SD = 5.1) that produced on average 291 messages, al-
beit with a large variation in the number of messages per course
offer (SD = 192.4). The whole dataset was coded by the two ex-
pert coders for the four levels of cognitive presence enabling for a
supervised learning approach. The inter-rater agreement was excel-
lent (percent agreement = 98.1%, Cohen’s κ= 0.974) with a
total of only 33 disagreements.
Table 2shows the distribution of four phases of cognitive pres-
ence. In addition to the four categories of cognitive presence, we in-
cluded the category “other”, which is used for messages that did not
exhibit signs of any phase of cognitive presence. The most frequent
messages were exploration messages (39% of messages), while the
least frequent were the resolution messages (6% of messages). This
large difference between the frequencies of the four phases was ex-
pected. It is consistent with the previous studies of cognitive pres-
ence [26], which found that a majority of students were not pro-
Table 1: Course offerings statistics
Student count Message count
Winter 2008 15 212
Fall 2008 22 633
Summer 2009 10 243
Fall 2009 7 63
Winter 2010 14 359
Winter 2011 13 237
Average (SD) 13.5 (5.1) 291.2 (192.4)
Total 81 1,747
Table 2: Distribution of cognitive presence phases
ID Phase Messages (%)
0 Other 140 8.0%
1 Triggering Event 308 17.6%
2 Exploration 684 39.2%
3 Integration 508 29.1%
4 Resolution 107 6.1%
Average (SD) 349.4 (245.7) 20.0% (10.0%)
Total 1,747 100%
gressing to the later stages of integration and resolution. While
there are various interpretations for this pattern, including the va-
lidity of the model, the design and expectations of the courses –
i.e., not requiring students to move to those phases – seems to be
the most compelling reason, as shown by its growing acceptance
in the literature [25]. Psychologically, if students are going through
the four phases of the practical inquiry model that underlies the cog-
nitive presence construct, it does seem reasonable that students will
spend more time exploring and hypothesizing different solutions,
before they could come up with a final resolution [2,27]. More-
over, as discussions were designed to occur between the third and
the fifth week of the course, students did not typically move to the
resolution phase this early in the course. Specifically, the discus-
sions were organized to provide the students with opportunities to
discuss ideas that would inform the individual research projects that
they planned for the later stages of the course.
3.2 Feature Extraction
While the majority of the previous work related to text classi-
fication is based on lexical N-gram features (e.g., unigrams, bi-
grams, trigrams) and similar features (e.g., POS bigrams, depen-
dency triplets), we eventually decided not to include N-gram and
similar features described in the Kovanović et al. [35] study for sev-
eral reasons. First of all, the use of those features inflates the fea-
ture space, generating thousands of features even for small datasets.
This strongly increases the chances for over-fitting the training data.
Secondly, the use of those features is also very “dataset dependent”,
as data itself defines the classification space. Thus, it is hard to
define a fixed set of classification features in advance, as the par-
ticular choice of words in the training documents will define what
features are used for classification (i.e., what N-gram variables are
extracted). Finally and most importantly, given that N-grams and
other simple text mining features are not based on any existing the-
ory of human cognition related to the CoI model, it is hard to un-
derstand what they might theoretically mean. Given that our goal
is also to understand how cognitive presence is captured within
discourse, we focused our work on extracting features which are
strongly theory-driven and based on empirical studies. In total, we
extracted 205 classification features which are described in the re-
minder of this subsection.
3.2.1 LIWC features
In this study, we used the LIWC (Linguistic Inquiry and Word
Count) tool [57], to extract a large number of word counts which
are indicative of different psychological processes (e.g., affective,
cognitive, social, perceptual). Our previous research [32] showed
that different linguistic features operationalized through the LIWC
word categories offer distinct proxies of cognitive presence.
In contrast to extracting N-grams, which produce a very large
number of independent features, LIWC provides us with exactly 93
different word counts which are all based on extensive empirical re-
search [58, cf.]. LIWC features essentially “merge” related – and
domain-independent – N-gram features together to produce more
meaningful classification features. We used the 2015 version of the
LIWC software package, which also provides four high-level aggre-
gate measures of i) analytical thinking, ii) social status, confidence,
and leadership, iii) authenticity, and iv) emotional tone.
3.2.2 Coh-Metrix features
For extraction of features for classification we also used Coh-
Metrix [30,45], a computational linguistics tool that provides 108
different metrics of text coherence (i.e., co-reference, referential,
causal, spatial, temporal, and structural cohesion), linguistic com-
plexity, text readability, and lexical category use. Coh-Metrix has
been extensively used a large number of studies to measure subtle
differences in different forms of text and discourse and is currently
used by the Common Core initiative to analyze learning texts in K-
12 education [45].
Coh-Metrix has been previously used in the domain of social
learning to measure the student performance [16] and development
of social ties [33,34] based on the language used in the discourse.
For example, a study by Dowell et al. [16] showed that character-
istics of the discourse – as measured by Coh-Metrix – were able
to account for 21% of the variability in the performance of active
MOOC students. Students performed significantly better when then
engaged in exploratory-style discourse, with the high levels of deep
cohesion and the use of simple syntactic structures and abstract lan-
guage. With the goal of the existing CoI content schemes to pre-
scribe different indicators of important socio-cognitive processes
in the discourse, the use of Coh-Metrix provides a valuable set of
metrics that can be easily extracted and used for automation of the
CoI coding schemes.
3.2.3 Discussion context features
Drawing on the study by Waters et al. [62], we also focused on
incorporating more context information in our feature space. Thus,
we included all features (except unigrams) which were used in the
Waters et al. study. Those included:
Number of replies: An integer variable indicating the number
of replies a given message received.
Message Depth: An integer variable showing a position of
message within a discussion.
Cosine similarity to previous/next message: The rationale be-
hind these features is to capture how much a message builds
on the previously presented information.
Start/end indicators: Simple 0/1 indicator variables showing
whether a message is first/last in the discussion.
As the CoI model – from the perspective of educational psychology
– is a process model [25], students’ cognitive presence is viewed as
being developed over time through discourse and reflection. There-
fore, in order to reach higher levels of cognitive presence students
need to either: i) construct knowledge in the shared-world through
the exchange of a certain number of discussion messages, or ii) con-
struct knowledge in the their own private world of reflective learn-
ing. Given the social-constructivist view of learning in the CoI
model, we can expect that the distribution of messages exhibiting
the characteristics of the different phases of cognitive presence will
tend to change over time, as the students progress through those
phases. Thus, we can expect that triggering and exploration mes-
sages will be more frequent in the early stages of the discussions,
while integration and resolution messages will be more common in
the later stages.
3.2.4 LSA similarity
Messages belonging to different phases of cognitive presence are
characterized with various socio-cognitive processes [24]. The trig-
gering phase introduces a certain topic in a tentative form, present-
ing a concept(s) that might not be completely developed, while the
exploration phase further elaborates on various approaches to the
inquiry initiated in the triggering phase. More precisely, the explo-
ration phase introduces new ideas, divergent from the community,
or even several contrasting topics within the same message [49].
On the other hand, the integration phase assumes a continuous pro-
cess of reflection and integration, which leads to the construction
of meaning from the introduced ideas [24]. Finally, the resolu-
tion phase presents explicit guidelines for applying knowledge con-
structed through the inquiry process [24,49]. Based on these in-
sights, we assumed that information presented in the various stages
of the learning process might have an important influence on mes-
sage comprehension. Still, given the differences among the learners
and their learning habits, we did not expect this to be manifested as
a general rule, but more as a slight tendency which would be useful
in combination with the other classification features.
Following the approach suggested by Foltz et al. [20], we used
LSA with the sentence as a unit of analysis to define a single vari-
able lsa.similarity, which represents the average sentence sim-
ilarity (i.e., coherence) within a message. As LSA determines the
coherence based on the semantic relatedness between terms (i.e.,
terms that tend to occur in a similar context) [13], we first had to
define a semantic space in which the similarity estimates are given.
Having in mind that different discussions might relate to the dif-
ferent concepts, we decided to create a separate semantic space for
each discussion. We identified the most important concepts from
the first message in a discussion with a semantic annotation tool
TAGME [19] and then each identified concept was linked to an
appropriate Wikipedia page from which we extracted information
about that concept [19]. Given that previous studies [55,22] showed
that Wikipedia can be used for estimation of semantic similarity be-
tween different concepts, we used information from the extracted
pages to construct the semantic space on which LSA similarity of
the concepts is calculated.
3.2.5 Number of named entities
Based on the work described in [47] and our previous study [35],
we hypothesized that messages belonging to the different phases of
cognitive presence would contain different count of named entities
(e.g., named objects such as people, organizations, and geographi-
cal locations). The basis for this is taken from the definition of the
cognitive presence construct [24]. Exploration messages are char-
acterized by the brainstorming and exploration of new ideas, and
thus, those messages are expected to contain more named entities
than integration and resolution messages. Given the subject of the
course in which the data for this study were collected, we extracted
from each message a number of entities that are related to the com-
puter science category of Wikipedia by using the DBPedia Spotlight
annotation tool [46].
3.3 Data preprocessing
As the first step in our analysis, we addressed the problem of dif-
ferent number of messages in five classification categories (i.e., four
phases of cognitive presence and “other”). The imbalance of dif-
ferent classes can have very negative effects on the results of the
classification analyses [56]. Generally speaking, there are two pos-
sible ways of addressing this problem [10]: i) cost-sensitive clas-
sification, in which different penalties are assigned for misclassifi-
cation of instances from different categories (higher penalties for
smaller classes), and thus forcing the algorithm to put more em-
phasis on properly recognizing smaller classes; and ii) resampling
methods, either by oversampling smaller classes, undersampling
large classes, or through a combination of these two approaches.
Given that cost-sensitive classification is used typically for two class
problems (“positive” vs. “negative”), where correctly classifying
one of the classes is the primary goal of the classifier (i.e., patients
with a disease, fraudulent banking transaction), it makes sense to as-
sign different misclassification costs as correctly identifying “neg-
ative” class is not important. However, in our case, we are equally
interested in all five classes (four cognitive presence categories and
the other messages), as they represent different phases in student
learning cycles and it is not immediately clear whether misclassi-
fication of resolution messages is “worse” than misclassification of
triggering event messages. Thus, in our study, we used resampling
techniques and in particular a very popular SMOTE algorithm [9],
which is a hybrid approach that combines oversampling the minor-
ity class with undersampling of the majority class.
One interesting property of SMOTE is that instead of simply re-
sampling minority class instances – which would generate simple
copies of the existing data points – it generates new synthetic in-
stances which are “similar” to the existing instances but not exactly
the same. For example, in n-dimensional feature space, for every
data point (X={f1, f2, ...fn}) of the class Cithat is selected for
resampling, SMOTE:
1) Find K (in our case five) nearest neighboring instances from
the class Ci. As the distances between original Cidata points
are known in advance, the list of K nearest neighbors for all
instances in Ciclass are calculated and stored in N×Kma-
trix (where Nis the number of data points in the Ciclass).
2) Randomly picks one of the identified neighbors (Y).
3) Generates a new data point Zas:
Z=X+rand(0,1) Y
where rand(0,1) is a function returning a random number
between 0 and 1.
Figure 1shows the results of applying SMOTE to our dataset.
As our original dataset consists of 1,747 messages, the class distri-
bution would be uniform if each of the classes contained approxi-
mately 350 messages (i.e., 1,747/5350). Thus, we first user
SMOTE oversampling procedure explained previously to generate
additional 210, 42, and 243 instances of “Other”, “Triggering”, and
“Resolution” classes, respectively. This increased the total num-
ber of messages in each of these three classes to 350 messages in
total. We then undersampled messages in “Exploration” and “Inte-
gration” categories to create a smaller groups of also 350 messages.
Hence, we removed 334 “Exploration” messages and 158 “Integra-
tion” messages, to produce smaller groups of also 350 messages in
total. Overall, after applying SMOTE the new dataset consists of
1,750 messages, with each of the five categories of messages repre-
sented with exactly 350 messages.
Besides compensating for class imbalance problem, we also re-
moved the two duplicate features that were provided by both LIWC
and Coh-Metrix: i) the total number of words in a message, and
ii) the average number of words in a sentence. We decided to re-
move LIWC values and use only the ones provided by Coh-Metrix.
Other Trig. Exp. Integ. Resol.
Message Count
0 350 700
Figure 1: SMOTE preprocessing for class balancing. Dark blue
– original instances which are preserved, light blue – synthetic
instances, red – original instances which are removed.
The primary reason for using Coh-Metrix features is consistency,
as there are some small differences in how those two systems pro-
cess corner cases (e.g., hyphenated words, interpunction signs) and
given that Coh-Metrix provides additional set of metrics (e.g., num-
ber of sentences, number of paragraphs) we wanted to use consistent
calculations for all of the included metrics.
3.4 Model Selection and Evaluation
To build our classifier, we used random forests [7], a state-of-the
art tree-based classification technique. A large comparative analysis
of 179 general-purpose (i.e., not domain-specific, offline, and un-
structured) classification algorithms on 121 different datasets used
in the previously published studies by Fernández-Delgado et al. [18]
found that random forests were the top performing classification al-
gorithm, only matched by Gaussian kernel SVMs. Random forests
are ensemble tree-based method that combines bagging (bootstrap
aggregating) with the idea of random-subspace to create a robust
classification system which has low variance without increasing the
bias [18]. Random forests work by creating a large number of trees
and then the final prediction is decided using the majority voting
scheme. Each tree is constructed on a different bootstrap sample
(sub-sample of the same size with repetition) and evaluated on data-
points that did not enter the bootstrap sample (in general, around
one third of the training dataset size). In addition, each tree does
not use the complete feature set, but has a random selection of N
attributes (i.e., a subspace) which are then used for growing an in-
dividual tree without any pruning. Random forests are widely used
technique that can handle large datasets with thousands of features.
It is important to note that random forests can also be used to
measure importance of individual classification features. While im-
portance of individual classification features might be calculated in
many different ways [41], one popular measure is Mean Decrease
Gini (MDG) which is based on the reduction in Gini impurity mea-
sure. Generally speaking, Gini impurity index measures how much
the data points of a given tree node belong to the same class (i..e,
how much they are “clean”). For every internal (split) node we can
measure the decrease in Gini impurity, which shows how useful a
given tree node is for separating the data (i..e, how much it reduces
the impurity of the resulting groups of data). For random forests,
MDG measure for a feature Xjis calculated as a mean decrease in
Gini impurity of all tree nodes where a given feature Xjis used.
As there are two parameters used for configuration of random
forests (i.e., ntree – number of trees constructed, and mtry – the
number of randomly selected features), we used a cross-validation
to select the optimal random forest parameters. As the performance
of random forests typically stabilizes after a certain number of trees
are built, we decided to build a large ensemble of 1,000 trees to
make sure that convergence is reached. Thus, we focused on se-
lecting optimal number of features used in every three (i.e., mtry
parameter). We used a 10-fold cross validation and repeated it 10
Number of attributes in a tree
0 50 100 150 200
Figure 2: Random forest parameter tuning results.
times in order to reduce variability and get more accurate estimates
of cross validated performance. In each run of the cross validation,
we examined 20 different values for the mtry parameter: {2, 12, 23,
34, 44, 55, 66, 76, 87, 98, 108, 119, 130, 140, 151, 162, 172, 183,
194, 205}. The exact set of these values is obtained by using the
var_seq function from R’s caret package.
Before training and evaluating our classification models, we split
data to 75% for model training and 25% for testing. We used strat-
ified sampling, so that class distribution in both sub-samples is the
same. We selected the best mtry value using the 10 repetitions of
the 10-fold cross validation and then reported the classification ac-
curacy of the best performing model on the testing data.
3.5 Implementation
We implemented our classifier in the R and Java programming
languages using several software packages:
for feature extraction we used Coh-Metrix [45,30] and LIWC
2015 software packages [58],
for developing random forest classifier, we used the randomForest
R package [40],
for running repeated cross validation and aggregating model per-
formance, we used the caret R package [21],
for running the SMOTE algorithm we used the Weka [63] Java
package, and
for calculation of LSA similarity measure, we used the Text Min-
ing Library for LSA (TML)1.
The complete dataset for the study and source code of the implemen-
tation is publicly available at
_classification repository.
3.6 Limitations
The major limitations of our approach are related to the size of
our data set. Although we have six course offerings, they are all
from the same course at a single university, and together with the
particular details of adopted pedagogical and instructional approach
they might potentially have an effect on the generalizability of our
classification model. Thus, in our future work, we plan to test the
generalization power of our classifier on a different dataset, which
would preferably also account for other important confounding vari-
ables recognized in research of the CoI model such as subject do-
main [6], level of education (i.e., undergraduate vs. graduate) [26],
and mode of instruction (blended vs. fully online vs. MOOC) [61].
4.1 Model training and evaluation
Figure 2shows the results of our model selection and evaluation
procedure. The best classification accuracy of 0.72 (SD = 0.04)
and 0.65 Cohen’s κ(SD = 0.05) was obtained with mtry value of
12, which means that each decision tree takes into the account only
Table 3: Random forest parameter tuning results
mtry Accuracy Kappa
Min 194 0.68 (0.04) 0.59 (0.04)
Max 12 0.72 (0.04) 0.65 (0.05)
Difference 0.04 0.06
0 200 400 600 800 1000
0.1 0.3 0.5 0.7
Number of trees
Figure 3: Best random forest configuration performance.
12 out of 205 features. The difference between the best- and worst-
performing configurations was 0.06 Cohen’s κ(Table 3), which sug-
gest that parameter optimization plays an important role in the final
classifier performance. Looking at the best performing configura-
tion (Figure 3), we can see that the use of 1,000 trees in an ensem-
ble resulted in reasonably stable error rates, with an average out-of-
bag (OOB) error rate of 0.29, (i.e., an average misclassification rate
for all data points in cases when they were non used in bootstrap
samples). As expected, the highest error rates were associated with
the undersampled classes (i.e., exploration and triggering) and the
smallest with the classes that were most heavily oversampled (i.e.,
resolution and “other”)
Following the model building, we evaluated its performance on
the hold-out 25% of the data. Our random forest classifier obtained
70.3% classification accuracy (95% CI[0.66, 0.75]) and 0.63 Co-
hen’s κwhich were significant improvements over 0.41 and 0.48
reported in Kovanović et al. [35] and Waters et al. [62] studies, re-
spectively. Table 4shows the confusion matrix obtained on the test-
ing dataset. We can see that the most significant misclassifications
are between exploration and integration messages which are hard-
est to distinguish. This is already witnessed in the [62] where most
of the misclassifications were related to exploration and integration
4.2 Variable importance analysis
Figure 4shows the variable importance measures for all the 205
classification features. The median MDG score was 4.43, with the
most of the features having smaller MDG scores, and only few fea-
tures having very high MDG scores. Table5shows the values of top
20 variables based on their MDG scores and their average values in
each class (i.e., cognitive presence phase). We can see that the most
important variable was the cm.DESCWC, i.e., the number of words in
a message; that is, the longer the message was, the higher the chance
of the message was to be in the later phases of the cognitive presence
cycle. Also, the number of paragraphs, number of sentences, and
Table 4: Confusion matrix for the best performing model
Actual Other Triggering Explorat. Integrat. Resolut.
Other 79 2222
Triggering 567 960
Exploration 9 15 35 27 1
Integration 2 2 23 44 16
Resolution 004281
Mean Decrease Gini
0 10 20 30
Figure 4: Variable importance by Mean Decrease Gini measure.
Blue line separates top twenty features.
average sentence length showed similar trends, with higher values
being associated with the later phase of cognitive presence.
The most important Coh-Metrix features were related to lexical
diversity of the student vocabulary with the highest lexical diver-
sity being displayed by “other” messages. Standard deviation of
the number of syllables – which is an indicator of the use of words
of different lengths – had the strongest association with the trig-
gering event phase. In contrast, the givenness (i.e., how much of
the information in text is previously given) had the highest associ-
ation with the resolution phase messages. Finally, the low Flesch-
Kincaid Grade level readability score and the low overlap between
verbs used had the strongest association with “other” messages (i.e.,
messages without traces of cognitive presence).
The most important LIWC features were i) the number of ques-
tion marks used, which was strongly associated with the trigger-
ing event phase, ii) the number of first person pronouns, which was
highly associated with the other (i.e., non-cognitive presence) mes-
sages, and iii) use of money-related words, which is mostly associ-
ated with the integration and resolution phases.
Message context features also scored high, with message depth
being higher for the later stages of cognitive presence, and highest
for “other” messages. A similar trend was observed for similarity
with the previous message, which was highest for the integration
and resolution messages and lowest for the triggering event mes-
sages. In contrast, similarity with the next message and number of
replies were highest for triggering events and lowest for the “other
messages. It is interesting to note that both LSA similarity and the
number of named entities obtained high MDG scores. The number
of named entities was the second most important feature and was
highly associated with the later stages of the cognitive presence cy-
cle. A similar trend was also observed for LSA similarity however,
its importance was much lower.
Based on the testing results of the developed classifier, we can see
that the use of the LIWC and Coh-Metrix features, together with
a small number of thread-based context features could be used to
provide reasonably high classification performance. The obtained
Cohen’s κvalue of 0.63 falls in the range of “substantial” inter-
rater agreement [39], and is just slightly below the 0.70 Cohen’s κ
which is the CoI research community commonly used as a threshold
value for that is required before coding results are considered valid.
We can also see that the parameter tuning plays an important role
in optimizing the classifier performance, as the different classifier
configurations obtained results different up to 0.05 Cohen’s κand
0.04% classification accuracy (Table 3).
Given that the same dataset is used as in the [35] and [62] stud-
ies, it is possible to directly compare the results of the classification
algorithms. The obtained Cohen’s κis 0.15 and 0.22 higher than
the ones reported by Waters et al. [62] and Kovanović et al. [35], re-
spectively. Furthermore, the resulting feature space is much smaller,
Table 5: Twenty most important variables and their mean scores for messages in different phases of cognitive presence
Cognitive presence phase
#Variable Description MDGOther Triggering Exploration Integration Resolution
1cm.DESWC Number of words 32.91 55.41 (61.06) 80.91 (41.56) 117.71 (67.23) 183.30 (102.94) 280.68 (189.62)
2ner.entity.cnt Number of named entities 26.41 13.44 (15.36) 21.67 (10.55) 28.84 (16.93) 44.75 (24.85) 64.18 (32.54)
3cm.LDTTRa Lexical diversity, all words 21.98 0.85 (0.12) 0.77 (0.09) 0.71 (0.10) 0.65 (0.09) 0.58 (0.09)
4message.depth Position within discussion 19.09 2.39 (1.13) 1.00 (0.90) 1.84 (0.97) 1.87 (0.94) 2.00 (0.68)
5cm.LDTTRc Lexical diversity, content words 17.12 0.95 (0.06) 0.90 (0.06) 0.86 (0.08) 0.82 (0.07) 0.78 (0.07)
6cm.LSAGN Avg. givenness of each sentence 16.63 0.10 (0.07) 0.14 (0.06) 0.18 (0.07) 0.21 (0.06) 0.24 (0.06)
7liwc.QMark Number of question marks 16.59 0.27 (0.85) 1.84 (1.63) 0.92 (1.26) 0.58 (0.82) 0.38 (0.55)
8message.sim.prev Similarity with previous message 16.41 0.20 (0.17) 0.06 (0.13) 0.22 (0.21) 0.30 (0.24) 0.39 (0.19)
9cm.LDVOCD Lexical diversity, VOCD 15.43 12.92 (33.93) 28.99 (50.61) 53.57 (54.68) 83.47 (43.00) 97.16 (28.95)
10 Number of money-related words 14.38 0.21 (0.69) 0.32 (0.74) 0.32 (0.75) 0.65 (1.12) 0.99 (1.04)
11 cm.DESPL Avg. number of paragraphs sent. 12.47 4.26 (2.98) 6.37 (2.76) 7.49 (4.11) 10.17 (5.64) 14.05 (8.88)
12 Similarity with next message 11.74 0.08 (0.14) 0.34 (0.40) 0.20 (0.22) 0.22 (0.24) 0.22 (0.23)
13 message.reply.cnt Number of replies 11.67 0.42 (0.67) 1.44 (1.89) 0.82 (1.70) 1.10 (2.66) 0.84 (1.24)
14 cm.DESSC Sentence count 11.67 4.28 (3.17) 6.36 (2.75) 7.49 (4.11) 10.17 (5.64) 14.29 (10.15)
15 lsa.similarity Avg. LSA sim. between sentences 9.69 0.29 (0.27) 0.47 (0.23) 0.54 (0.23) 0.62 (0.20) 0.67 (0.17)
16 cm.DESSL Avg. sentence length 9.60 11.88 (6.82) 13.62 (5.85) 16.69 (6.54) 19.36 (8.39) 21.73 (8.61)
17 cm.DESWLsyd SD of word syllables count 8.92 0.98 (0.69) 1.33 (0.70) 0.98 (0.18) 0.97 (0.14) 0.97 (0.11)
18 liwc.i Number of FPSpronouns 8.84 4.33 (3.53) 2.82 (2.06) 2.37 (1.94) 2.51 (1.65) 2.19 (1.23)
19 cm.RDFKGL Flesch-Kincaid Grade Level 8.29 7.68 (4.28) 10.30 (3.50) 10.19 (3.11) 11.13 (3.46) 11.99 (3.37)
20 cm.SMCAUSwn WordNet overlap between verbs 8.14 0.38 (0.25) 0.48 (0.20) 0.51 (0.13) 0.50 (0.10) 0.47 (0.06)
MDG - Mean decrease Gini impurity index, FPS - first person singular
with only 205 classification features in total, which is 100x smaller
than the number of bag-of-words features extracted by Kovanović
et al. [35] classifier. This limits the chances of over-fitting the train-
ing data and also improves the performance of the classifier. This
is particularly important for the prospective use of the classifier in
different subject domains, and pedagogical contexts.
Another important finding of this study is the list of important
classification features. We see that a small subset of features is
highly predictive of the different phases of cognitive presence, while
a majority of the features have a much lower predictive power (Fig-
ure 4). It is interesting to notice that most of the discussion context
features (except the discussion start/end indicators) obtained high
importance scores, indicating the value in providing contextual in-
formation to the classification algorithm. In our future work, we will
focus on investigation of the additional features that would provide
even more contextualized information to the classifier.
It is important to notice that the list of the most important vari-
ables is aligned with the conceptions of cognitive presence in the
existing CoI literature. If we look at the messages in the four phases
of cognitive presence, we can see that the higher levels of cognitive
presence are associated with messages that are i) generally longer,
with more sentences and paragraphs, ii) adopt more complex lan-
guage with generally longer sentences, iii) include more named en-
tities (e.g., names of different constructs, theories, people, compa-
nies, and geographical locations) iv) have lower lexical diversity,
v) occur later in the discussion, vi) have higher givenness of the
information, higher coherence, and higher verb overlap, vii) use
fewer question marks and first-person singular pronouns, viii) ex-
hibit higher similarity with the previous messages, and ix) more
frequently use money-related terms. Interestingly, the feature of the
highest importance is also the simple word count implying that the
longer the message, the more likely it is in the higher levels of cogni-
tive presence cycle. This is also consistent with the findings of a pre-
vious study with the same dataset [32]. Joksimović et al. [32] found
that word count was the only LIWC 2007 variable that yielded sta-
tistically significant differences among all four cognitive presence
categories. This is not totally surprising as the similar findings are
reported by essay grading studies who found that the strongest pre-
dictor of the final essay grade is the length of the essay [48].
Looking at the non-cognitive or “other” messages, we can see
that they are characterized by the large lexical diversity. This is
expected, as non-cognitive messages tend to be shorter (i.e., fewer
words, paragraphs, and sentences) and more informal. Higher lev-
els of lexical diversity are known to be associated with very short
tests or texts of low cohesion [1]. As “other” messages often are
not related to the course topic, they also tend to have a lower num-
ber of named entities, and lower givenness and verb overlap. Such
messages also tend to adopt a simpler language, as indicated by the
lowest scores on the Flesch-Kincaid grade level. “Other” messages
also tend to occur more frequently near the end of the discussion,
as indicated by their high values for message.depth feature and
also more often are related to expression of personal information,
as indicated by the highest values for the use of first-person singu-
lar pronouns. This is expected as many discussions would typically
finish with students thanking each other for their contributions.
This paper has twofold contributions. First, we developed a clas-
sifier for coding student discussion transcripts for the levels of cog-
nitive presence with a much higher performance (0.63 Cohen’s κ)
than previously reported ones [35,62] in the studies with the same
dataset. The performance of the developed classifier is in the range
which is generally considered to be a substantial level of agree-
ment [39]. We can see that the proposed approach, which is based
on the use of Coh-Metrix, LIWC, and discussion context features,
shows a great promise for providing a fully automated system for
coding cognitive presence. The feature space that is used is also
much smaller, which limits the chances for over-fitting the data and
makes the developed classifier more generalizable to other contexts.
Secondly, we can see a particular subset of classification features
that are very highly predictive of the different phases of cognitive
presence. The most predictive feature is simple word count, which
implies that the longer the message is, the higher the chances are
for the message to display higher levels of cognitive presence. We
also identified several additional features which are also highly pre-
dictive of the cognitive presence phase, in particular the number
of named entities that are used (higher values are associated with
integration and resolution phase) and lexical diversity (lower val-
ues are associated with “other” and triggering messages). We also
see that features that provide information on the discussion context
(i.e., similarity the with previous/next message, order in the discus-
sion thread, and number of replies) are highly valuable and provide
important information to the classification algorithm.
In our future work, we will focus on exploring additional fea-
tures for improving the classification performance [43]. The study
presented in this paper and our previous work [35] indicate that con-
textual features have a significant effect on classification accuracy
and we will examine additional features of this kind. As our results
reveal that the number of named entities has a significant effect on
classification accuracy, and we will further explore similar features,
such as concept maps [64], which would provide additional infor-
mation about relationships between important concepts discussed
in text-based messages. Finally, we will look at the different data
preprocessing steps, including the use of the different algorithms
for resolving the class imbalance problem. As we also observed
that some of the students used direct quotes of other student mes-
sages which can cause problems for many of the text metrics that
we used for classification, we will further examine the effects of the
quotation on the final classification accuracy.
Finally, following the results presented in [17], we are explor-
ing ideas for the development of a system that would – beside class
labels – provide associated probabilities. Such a classifier could
be used to develop a semi-automated classification system in which
only one part of the data for which probabilities are sufficiently high
would be automatically classified, and the rest would be manually
classified. This would be advantageous as the combined desired
accuracy of automatic-manual coding could be reached by setting
a corresponding probability threshold. For achieving high levels
of accuracy, a large majority of data would be classified automati-
cally eliminating the large part of the manual work. Besides using
it for coding discussion transcripts for research purposes, such sys-
tem could be use, for example, to provide a real-time overview of
the progress for a group of students and to point out the students for
which an progress estimates are uncertain.
[1] Coh-Metrix 3.0 indicies.
[2] Z. Akyol, J. B. Arbaugh, M. Cleveland-Innes, D. R. Garrison, P. Ice,
J. C. Richardson, and K. Swan. A response to the review of the com-
munity of inquiry framework. Journal of Distance Education, 23(2),
[3] T. Anderson and J. Dron. Three generations of distance education
pedagogy. The International Reviewof Research in Open and Distance
Learning, 12(3):80–97, 2010.
[4] T. Anderson, L. Rourke, D. R. Garrison, and W. Archer. Assessing
teaching presence in a computer conferencing context. Journal of
Asynchronous Learning Networks, 5:1–17, 2001.
[5] J. Arbaugh, M. Cleveland-Innes, S. R. Diaz, D. R. Garrison, P. Ice,
J. C. Richardson, and K. P. Swan. Developing a community of inquiry
instrument: Testing a measure of the community of inquiry framework
using a multi-institutional sample. The Internet and Higher Education,
11(3–4):133–136, 2008.
[6] J. B. Arbaugh, A. Bangert, and M. Cleveland-Innes. Subject matter
effects and the community of inquiry (coi) framework: An exploratory
study. The Internet and Higher Education, 13(1):37–44, 2010.
[7] L. Breiman. Random Forests. Machine Learning, 45(1):5–
32, Oct. 2001. ISSN 0885-6125, 1573-0565. doi: 10.1023/A:
1010933404324. URL
[8] D. L. Butler and P. H. Winne. Feedback and self-regulated learning:
A theoretical synthesis. Review of Educational Research, 65(3):245–
281, 1995.
[9] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote:
synthetic minority over-sampling technique. Journal of artificial in-
telligence research, pages 321–357, 2002.
[10] N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue
on learning from imbalanced data sets. ACM Sigkdd Explorations
Newsletter, 6(1):1–6, 2004.
[11] S. Corich, K. Hunt, and L. Hunt. Computerised content analysis for
measuring critical thinking within discussion forums. Journal of e-
Learning and Knowledge Society, 2(1), 2012.
[12] B. De Wever, T. Schellens, M. Valcke, and H. Van Keer. Content anal-
ysis schemes to analyze transcripts of online asynchronous discussion
groups: A review. Computers & Education, 46(1):6–28, 2006.
[13] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and
R. Harshman. Indexing by latent semantic analysis. Journal of the
American Society for Information Science, 41(6):391–407, 1990.
[14] J. Dewey. My pedagogical creed. School Journal, 54(3):77–80, 1897.
[15] R. Donnelly and J. Gardner. Content analysis of computer conferenc-
ing transcripts. Interactive Learning Environments, 19(4):303–315,
[16] N. Dowell, O. Skrypnyk, S. Joksimović, A. C. Graesser, S. Dawson,
D. Gašević, P. de Vries, T. Hennis, and V. Kovanović. Modeling Learn-
ers’ Social Centrality and Performance through Language and Dis-
course. In Submitted to the 8th International Conference on Educa-
tional Data Mining (EDM 2015), Madrid, Spain, June 2015.
[17] P. Dönmez, C. Rosé, K. Stegmann, A. Weinberger, and F. Fischer.
Supporting CSCL with automatic corpus analysis technology. In Pro-
ceedings of th 2005 conference on Computer support for collaborative
learning: learning 2005: the next 10 years!, page 125–134, 2005.
[18] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do
we need hundreds of classifiers to solve real world classification prob-
lems? The Journal of Machine Learning Research, 15(1):3133–3181,
[19] P. Ferragina and U. Scaiella. Fast and accurate annotation of short texts
with wikipedia pages. Software, IEEE, 29(1):70–75, 2012.
[20] P. W. Foltz, W. Kintsch, and T. K. Landauer. The measurement of
textual coherence with latent semantic analysis. Discourse Processes,
25:285–307, 1998.
[21] M. K. C. from Jed Wing, S. Weston, A. Williams, C. Keefer, A. En-
gelhardt, T. Cooper, Z. Mayer, B. Kenkel, the R Core Team, M. Ben-
esty, R. Lescarbeau, A. Ziem, L. Scrucca, Y. Tang, and C. Candan.
caret: Classification and Regression Training, 2015. URL http:
// R package version 6.0-
[22] E. Gabrilovich and S. Markovitch. Computing Semantic Relatedness
Using Wikipedia-based Explicit Semantic Analysis. In Proceedings
of the 20th International Joint Conference on Artifical Intelligence,
IJCAI’07, pages 1606–1611, San Francisco, CA, USA, 2007. Morgan
Kaufmann Publishers Inc. URL
[23] D. R. Gar rison, T. Anderson, and W. Archer. Critical inquiry in a text-
based environment: Computer conferencing in higher education. The
Internet and Higher Education, 2(2–3):87–105, 1999.
[24] D. R. Gar rison, T. Anderson, and W. Archer. Critical thinking, cogni-
tive presence, and computer conferencing in distance education. Amer-
ican Journal of Distance Education, 15(1):7–23, 2001.
[25] D. R. Gar rison, T. Anderson, and W. Archer. The first decade of the
community of inquiry framework: A retrospective. The Internet and
Higher Education, 13(1–2):5–9, 2010.
[26] R. Gar rison, M. Cleveland-Innes, and T. S. Fung. Exploring causal
relationships among teaching, cognitive and social presence: Student
perceptions of the community of inquiry framework. The Internet and
Higher Education, 13(1–2):31–36, 2010.
[27] D. Gašević, O. Adesope, S. Joksimović, and V. Kovanović. Externally-
facilitated regulation scaffolding and role assignment to develop cog-
nitive presence in asynchronous online discussions. The Internet and
Higher Education, 24:53–65, Jan. 2015. doi: 10.1016/j.iheduc.2014.
[28] L. Getoor. Introduction to Statistical Relational Learning. MIT Press,
2007. ISBN 978-0-262-07288-5.
[29] P. Gorsky, A. Caspi, I. Blau, Y. Vine, and A. Billet. Toward a coi
population parameter: The impact of unit (sentence vs. message) on
the results of quantitative content analysis. The International Review
of Research in Open and Distributed Learning, 13(1):17–37, 2011.
[30] A. C. Graesser, D. S. McNamara, and J. M. Kulikowich. Coh-
Metrix Providing Multilevel Analyses of Text Characteristics. Edu-
cational Researcher, 40(5):223–234, June 2011. ISSN 0013-189X,
1935-102X. doi: 10.3102/0013189X11413260. URL http://edr.
[31] O. R. Holsti. Content analysis for the social sciences and humanities.
[32] S. Joksimović, D. Gašević, V. Kovanović, O. Adesope, and M. Hatala.
Psychological characteristics in cognitive presence of communities of
inquiry: A linguistic analysis of online discussions. The Internet and
Higher Education, 22:1–10, July 2014. ISSN 1096-7516. doi: 10.
1016/j.iheduc.2014.03.001. URL http://www.sciencedirect.
[33] S. Joksimović, N. Dowell, O. Skrypnyk, V. Kovanović, D. Gašević,
S. Dawson, and A. C. Graesser. Exploring the Accumulation of So-
cial Capital in cMOOC Through Language and Discourse. Journal of
Educational Data Mining, (submitted), 2015.
[34] S. Joksimović, V. Kovanović, J. Jovanović, A. Zouaq, D. Gašević, and
M. Hatala. What Do cMOOC Participants Talk About in Social Me-
dia?: A Topic Analysis of Discourse in a cMOOC. In Proceedings of
the Fifth International Conference on Learning Analytics And Knowl-
edge, LAK ’15, pages 156–165, New York, NY, USA, 2015. ACM.
ISBN 978-1-4503-3417-4. doi: 10.1145/2723576.2723609. URL
[35] V. Kovanović, S. Joksimović, D. Gašević, and M. Hatala. Automated
Content Analysis of Online Discussion Transcripts. In Proceedings
of the Workshops at the LAK 2014 Conference co-located with 4th In-
ternational Conference on Learning Analytics and Knowledge (LAK
2014), Indianapolis, IN, Mar. 2014. URL
[36] V. Kovanović, S. Joksimović, D. Gašević, M. Hatala, and G. Siemens.
Content Analytics: the definition, scope, and an overview of published
research. In C. Lang and G. Siemens, editors, Handbook of Learning
Analyitcs. 2015.
[37] K. H. Krippendorff. Content Analysis: An Introduction to Its Method-
ology. Sage Publications, 0 edition, 2003.
[38] J. Laffer ty, A. McCallum, and F. C. Pereira. Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. 2001.
[39] J. R. Landis and G. G. Koch. The measurement of observer agreement
for categorical data. Biometrics, 33(1):159–174, Mar. 1977. ISSN
[40] A. Liaw and M. Wiener. Classification and regression by randomforest.
R News, 2(3):18–22, 2002. URL
[41] G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts. Understanding
variable importances in forests of randomized trees. In Advances in
Neural Information Processing Systems, pages 431–439, 2013.
[42] R. Luppicini. Review of computer mediated communication research
for education. Instructional Science, 35(2):141–185, 2007.
[43] E. Mayfield and C. Penstein-Rosé. Using feature construction to avoid
large feature spaces in text classification. In Proceedings of the 12th
annual conference on Genetic and evolutionary computation, page
1299–1306, 2010.
[44] T. McKlin. Analyzing Cognitive Presence in Online Courses Using
an Artificial Neural Network. PhD thesis, Georgia State University,
College of Education, Atlanta, GA, United States, 2004.
[45] D. S. McNamara, A. C. Graesser, P. M. McCarthy, and Z. Cai. Auto-
mated Evaluation of Text and Discourse with Coh-Metrix. Cambridge
University Press, Mar. 2014.
[46] P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. DBpedia spot-
light: shedding light on the web of documents. In Proceedings of the
7th International Conference on Semantic Systems, page 1–8, 2011.
[47] J. Mu, K. Stegmann, E. Mayfield, C. Rosé, and F. Fischer. The
ACODEA framework: Developing segmentation and classification
schemes for fully automatic analysis of online discussions. Interna-
tional Journal of Computer-Supported Collaborative Learning, 7(2):
285–305, 2012.
[48] E. B. Page and N. S. Petersen. The computer moves into essay grading:
Updating the ancient test. Phi Delta Kappan, 76(7):561, Mar. 1995.
ISSN 00317217. URL
[49] C. L. Park. Replicating the Use of a Cognitive Presence Measurement
Tool. Journal of Interactive Online Learning, 8:140–155, 2009.
[50] L. Rourke, T. Anderson, D. R. Garrison, and W. Archer. Methodolog-
ical issues in the content analysis of computer conference transcripts.
International Journal of Artificial Intelligence in Education (IJAIED),
12:8–22, 2001. Part II of the Special Issue on Analysing Educational
Dialogue Interaction (editor: Rachel Pilkington).
[51] L. Rourke, T. Anderson, D. R. Garrison, and W. Archer. Assessing so-
cial presence in asynchronous text-based computer conferencing. The
Journal of Distance Education / Revue de l’Éducation à Distance, 14
(2):50–71, 2007.
[52] P. J. Stone, D. C. Dunphy, and M. S. Smith. The general inquirer: A
computer approach to content analysis. 1966.
[53] J.-W. Strijbos. Assessment of (computer-supported) collaborative
learning. IEEE Transactions on Learning Technologies, 4(1):59–73,
[54] J.-W. Strijbos, R. L. Martens, F. J. Prins, and W. M. G. Jochems. Con-
tent analysis: what are they talking about? Computers & Education,
46(1):29–48, 2006.
[55] M. Strube and S. P. Ponzetto. WikiRelate! Computing Semantic Re-
latedness Using Wikipedia. In Proceedings of the 21st National Con-
ference on Artificial Intelligence - Volume 2, AAAI’06, pages 1419–
1424, Boston, Massachusetts, 2006. AAAI Press. ISBN 978-1-57735-
281-5. URL
[56] P.-N. Tan, V. Kumar, and M. Steinbach. Introduction to Data Mining.
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,
2005. ISBN 0-321-32136-7.
[57] Y. R. Tausczik and J. W. Pennebaker. The Psychological Meaning of
Words: LIWC and Computerized Text Analysis Methods. Journal of
Language and Social Psychology, 29(1):24–54, 2010.
[58] Y. R. Tausczik and J. W. Pennebaker. The Psychological Meaning
of Words: LIWC and Computerized Text Analysis Methods. Jour-
nal of Language and Social Psychology, 29(1):24–54, Mar. 2010.
ISSN 0261-927X, 1552-6526. doi: 10.1177/0261927X09351676.
[59] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1 edi-
tion edition, 1998.
[60] J. Vassileva. Toward social learning environments. IEEE Transactions
on Learning Technologies, 1(4):199–214, 2008.
[61] N. Vaughan and D. R. Garrison. Creating cognitive presence in a
blended faculty development community. The Internet and Higher
Education, 8(1):1–12, 2005.
[62] Z. Waters, V. Kovanović, K. Kitto, and D. Gašević. Structure mat-
ters: Adoption of structured classification approach in the context of
cognitive presence classification. In Proceedings of the 11th Asia In-
formation Retrieval Societies Conference, AIRS 2015, 2015.
[63] I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical Ma-
chine Learning Tools and Techniques, Third Edition. Morgan Kauf-
mann, 3 edition, 2011.
[64] A. Zouaq and R. Nkambou. Building domain ontologies from text for
educational purposes. IEEE Transactions on Learning Technologies,
1(1):49–62, 2008.
... For such an automatic measurement, automatic classifiers need to be developed to analyze the students' posts and classify the post into the four levels of cognitive presence. Automatic classifiers for cognitive presence have already been developed and tested for the English and Portuguese languages (Barbosa et al., 2021;Kovanović et al., 2016;Neto et al., 2018). A classifier for German-language posts is still lacking. ...
... Research initially started with methods of deep learning (McKlin et al., 2001) and then assessed the support vector machine classifier (Hayati et al., 2019;, conditional random fields (Waters et al., 2015), naive Bayes classifier, and logistic regression (Hayati et al., 2019). Recently, research has focused on random forest classifiers in different variations (Barbosa et al., 2020(Barbosa et al., , 2021Farrow et al., 2019;Hu et al., 2022;Kovanović et al., 2016;Neto et al., 2018Neto et al., , 2021 ...
... Frequently, results show an imbalance in the occurrences of the phases of cognitive presence. This problem has frequently been addressed by applying the oversampling strategy SMOTE (Synthetic Minority Oversampling Technique), as initially used by Kovanović et al. (2016) and demonstrated in detail by Farrow et al. (2019). The studies for the Portuguese language also employed SMOTE (Neto et al., 2018), with one study using a specific pipeline with the steps NearMiss, Tomek, SMOTE + Tomek, and Edited Nearest Neighbor (Barbosa et al., 2020). ...
Full-text available
Cognitive presence is a core construct of the Community of Inquiry (CoI) framework. It is considered crucial for deep and meaningful online-based learning. CoI-based real-time dashboards visualizing students’ cognitive presence may help instructors to monitor and support students’ learning progress. Such real-time classifiers are often based on the linguistic analysis of the content of posts made by students. It is unclear whether these classifiers could be improved by considering other learning traces, such as files attached to students’ posts. We aimed to develop a German-language cognitive presence classifier that includes linguistic analysis using the Linguistic Inquiry and Word Count (LIWC) tool and other learning traces based on 1,521 manually coded meaningful units from an online-based university course. As learning traces, we included not only the linguistic features from the LIWC tool, but also features such as attaching files to a post, tagging, or using terms from the course glossary. We used the k -nearest neighbor method, a random forest model, and a multilayer perceptron as classifiers. The results showed an accuracy of up to 82% and a Cohen’s κ of 0.76 for the cognitive presence classifier for German posts. Including learning traces did not improve the predictive ability. In conclusion, we developed an automatic classifier for German-language courses based on a linguistic analysis of students’ posts. This classifier is a step toward a teacher dashboard. Our work also provides the first fully CoI-coded German dataset for future research on cognitive presence.
... Given the crosswalk between the system levels, using AI to support learning at the meso level also requires significant sharing of learners' micro-level data. Data sharing brings challenges around ethics, data privacy, consent and transparency (Gašević et al., 2016;Pardo & Siemens, 2014) and raises the question of who owns such AI systems and what the system knows about individual students. ...
This paper discusses a three‐level model that synthesizes and unifies existing learning theories to model the roles of artificial intelligence (AI) in promoting learning processes. The model, drawn from developmental psychology, computational biology, instructional design, cognitive science, complexity and sociocultural theory, includes a causal learning mechanism that explains how learning occurs and works across micro, meso and macro levels. The model also explains how information gained through learning is aggregated, or brought together, as well as dissipated, or released and used within and across the levels. Fourteen roles for AI in education are proposed, aligned with the model's features: four roles at the individual or micro level, four roles at the meso level of teams and knowledge communities and six roles at the macro level of cultural historical activity. Implications for research and practice, evaluation criteria and a discussion of limitations are included. Armed with the proposed model, AI developers can focus their work with learning designers, researchers and practitioners to leverage the proposed roles to improve individual learning, team performance and building knowledge communities. Practitioner notes What is already known about this topic Numerous learning theories exist with significant cross‐over of concepts, duplication and redundancy in terms and structure that offer partial explanations of learning. Frameworks concerning learning have been offered from several disciplines such as psychology, biology and computer science but have rarely been integrated or unified. Rethinking learning theory for the age of artificial intelligence (AI) is needed to incorporate computational resources and capabilities into both theory and educational practices. What this paper adds A three‐level theory (ie, micro, meso and macro) of learning that synthesizes and unifies existing theories is proposed to enhance computational modelling and further develop the roles of AI in education. A causal model of learning is defined, drawing from developmental psychology, computational biology, instructional design, cognitive science and sociocultural theory, which explains how learning occurs and works across the levels. The model explains how information gained through learning is aggregated, or brought together, as well as dissipated, or released and used within and across the levels. Fourteen roles for AI in education are aligned with the model's features: four roles at the individual or micro level, four roles at the meso level of teams and knowledge communities and six roles at the macro level of cultural historical activity. Implications for practice and policy Researchers may benefit from referring to the new theory to situate their work as part of a larger context of the evolution and complexity of individual and organizational learning and learning systems. Mechanisms newly discovered and explained by future researchers may be better understood as contributions to a common framework unifying the scientific understanding of learning theory.
... As predictive modeling techniques in education matured, the focus has expanded to advancing our theoretical understandings of student learning in a particular context (Shmueli, 2010). This includes understanding students' learning strategies and self-regulated learning (Di Mitri et al., 2017;Maldonado-Mahauad et al., 2018;Moreno-Marcos et al., 2020;Sierens et al., 2009), affect detection (Calvo & D'Mello, 2010;Hussain et al., 2011), reading comprehension (Allen et al., 2015), critical thinking (Barbosa et al., 2020;Kovanović et al., 2014Kovanović et al., , 2016Neto et al., 2021Neto et al., , 2018Waters et al., 2015), reflection , motivation (Sharma et al., 2020;Wen et al., 2014), feedback engagement (Iraj et al., 2020), social interactions (Joksimović et al., 2015;Yoo & Kim, 2012), and team performance (Yoo 16. Predictive modeling of student success Christopher Brooks, Vitomir Kovanović and Quan Nguyen Christopher Brooks, Vitomir Kovanović, and Quan Nguyen -9781800375413 Downloaded from PubFactory at 05/04/2023 06:23:05PM via University of South Australia & Kim, 2013). ...
... Social presence captures different social relationships in learning groups that can have significant effects on the successful and high-quality learning process (Kovanović et al., 2016). Picciano (2019) conducted a study on the social presence, interaction, and learning of students enrolled in online courses and found powerful correlations among these variables. ...
Full-text available
This study explored the interaction between cognition and emotion in blended collaborative learning. The participants (n = 30) of this study were undergraduate students enrolled in a 16-week course on information technology teaching. These students were divided into six groups of five people each. The behavior modes of the participants were analyzed using a heuristic mining algorithm and inductive miner algorithm. Compared with the groups with low task scores, the high-scoring groups exhibited more reflection phases and cycles in the interaction process and thus more frequent self-evaluation and regulation behavior for forethought and performance. Moreover, the frequency of emotion events unrelated with cognition was higher for the high-scoring groups than for the low-scoring groups. On the basis of the research results, this paper presents suggestions for developing online and offline blended courses.
Learning institutions in the Philippines employed online distance learning using the synchronous approach for real-time interaction among learners and teachers. While the arrangement is exploratory, contingent, and responsive to the continuity of learning during the pandemic, its impact remains uncertain. It is of interest, including the teaching-learning process in online classes concerning the domains of Community of Inquiry focusing on the cognitive, teaching, and social presence. In generating data, this study utilized focus group discussions involving junior high school teachers (N=15) from public and private schools who tackled questions anchored on said domains. Codified and categorized according to themes, the respondents revealed that personal and professional gains are considered benefits of online classes but retorted that financial, technical, and teaching-learning support are among the challenges. Learning tasks and skills application draw attention under the cognitive arena, while technology, support for learner engagement, and student engagement strategies warrant emphasis under teaching presence. Modes of interaction, resources, and approaches to ensuring belongingness underpin the areas under social presence. Results further showed that the affordances and challenges faced by the teachers might be translated into opportunities for innovation, growth, and development.
Full-text available
Learning institutions in the Philippines employed online distance learning using the synchronous approach for real-time interaction among learners and teachers. While the arrangement is exploratory, contingent, and responsive to the continuity of learning during the pandemic, its impact remains uncertain. It is of interest, including the teaching-learning process in online classes concerning the domains of Community of Inquiry focusing on the cognitive, teaching, and social presence. In generating data, this study utilized focus group discussions involving junior high school teachers (N=15) from public and private schools who tackled questions anchored on said domains. Codified and categorized according to themes, the respondents revealed that personal and professional gains are considered benefits of online classes but retorted that financial, technical, and teaching-learning support are among the challenges. Learning tasks and skills application draw attention under the cognitive arena, while technology, support for learner engagement, and student engagement strategies warrant emphasis under teaching presence. Modes of interaction, resources, and approaches to ensuring belongingness underpin the areas under social presence. Results further showed that the affordances and challenges faced by the teachers might be translated into opportunities for innovation, growth, and development.
Video content analysis is of importance for researchers in technology-enhanced learning. A common starting point typically involves transcribing video into textual transcripts that enable the application of a coding scheme to group the text into key themes. However, manual coding is demanding and requires time and effort of human annotators. Therefore, this study explores the possibility of using Generative Pre-trained Transformer 3 (GPT-3) models for automating the text data coding compared to baseline classical machine learning approaches using a dataset manually coded for the orchestration actions of six teachers in classroom collaborative learning sessions. The findings of our study showed that a fine-tuned GPT-3 (curie) model outperformed classical approaches (F1 score of 0.87) and reached a 0.77 Cohen’s kappa, which indicated a moderate agreement between manual and machine coding. The study also brings out the limitations of our text transcripts and highlights the importance of multimodal observations that capture the context of orchestration actions.KeywordsCollaborative learningAutomatic codingMachine LearningGenerative AIClassroom orchestration
This essay uses machine learning methods to predict students’ online learning performance. The characteristicsof student’s characteristics, learning behavior characteristics, and environmental factors are extracted by featureengineering, and the random forest model is used for prediction. The experimental results show that the predictionaccuracy of the random forest model reaches 0.82, which provides an effective means for the prediction of onlinelearning performance.
While race and gender academic disparities have often been categorized via differences in final grade performance, the day-to-day experiences of minoritized student populations may not be accounted for when only concentrating on final grade outcomes. However, more fine-grained information on student behavior analyzed using AI and machine learning techniques may help to highlight the differences in day-to-day experiences. This study explores how linguistic features related to exclusion and social dynamics vary across discussion forum structures and how the variation depends on race and gender. We applied linear mixed-effect analysis to discussion posts across six semesters to investigate the effect of discussion forum structure, race, and gender on linguistic features. These results can be used to suggest design changes to instructors’ online discussion forums that will support students in feeling included. KeywordsSocial Network AnalysisLIWCInclusion
Full-text available
Connectivist pedagogies are geared towards building a network of learners that actively employ technologies to establish interpersonal connections in open online settings. In this context, as course participants increasingly establish interpersonal relationships among peers they have greater opportunity to draw on and leverage the latent social capital that resides in such a distributed learning environment. However, to date there have been a limited number of studies exploring how learners build their social capital in open large-scale courses. To inform the facilitation of learner networks in open online settings and beyond, this study analyzed factors associated with how learners accumulate social capital in the form of learner connections over time. The study was conducted in two massive open online course offerings (Connectivism and Connective Knowledge) that were designed on the principles of connectivist pedagogy and that made use of data about social interaction from Twitter, blogs, and Facebook. For this purpose, linear mixed modeling was used to understand the associations between learner social capital, linguistic and discourse patterns, media used for interaction, as well as the time in the course when interaction took place. The results highlight the association between the language used by the learners and the creation of ties between them. Analyses on the accumulation of connections over time have implications for the pedagogical choices that would be expected to help learners leverage access to potential social capital in a networked context.
Conference Paper
Full-text available
Within online learning communities, receiving timely and meaningful insights into the quality of learning activities is an important part of an effective educational experience. Commonly adopted methods – such as the Community of Inquiry framework – rely on manual coding of online discussion transcripts, which is a costly and time consuming process. There are several efforts underway to enable the automated classification of online discussion messages using supervised machine learning, which would enable the real-time analysis of interactions occurring within online learning communities. This paper investigates the importance of incorporating features that utilise the structure of on-line discussions for the classification of " cognitive presence " – the central dimension of the Community of Inquiry framework focusing on the quality of students' critical thinking within online learning communities. We implemented a Conditional Random Field classification solution, which incorporates structural features that may be useful in increasing classification performance over other implementations. Our approach leads to an improvement in classification accuracy of 5.8% over current existing techniques when tested on the same dataset, with a precision and recall of 0.630 and 0.504 respectively.
Full-text available
The field of learning analytics recently attracted attention from educational practitioners and researchers interested in the use of large amounts of learning data for understanding learning process and improving learning and teaching practices. In this chapter, we introduce content analytics – a particular form of learning analytics focused on the analysis of different forms of content related to learning. While several publications provided brief overviews of content analytics, the goal of this chapter is to define content analytics and provide a comprehensive overview of the most important studies in the published literature to date. Given the early stage of the learning analytics field, the focus of this chapter is on the important problems and challenges for which existing content analytics approaches are suitable and have been successfully used in the past. We also reflect on the current trends in content analytics and their position within a broader domain of educational research.
Full-text available
We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifi ers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearestneighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classi ers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively). © 2014 Manuel Fernández-Delgado, Eva Cernadas, Senén Barro and Dinani Amorim.
Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, offers a thorough grounding in machine learning concepts, along with practical advice on applying these tools and techniques in real-world data mining situations. This highly anticipated fourth edition of the most acclaimed work on data mining and machine learning teaches readers everything they need to know to get going, from preparing inputs, interpreting outputs, evaluating results, to the algorithmic methods at the heart of successful data mining approaches. Extensive updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including substantial new chapters on probabilistic methods and on deep learning. Accompanying the book is a new version of the popular WEKA machine learning software from the University of Waikato. Authors Witten, Frank, Hall, and Pal include today's techniques coupled with the methods at the leading edge of contemporary research. Please visit the book companion website at It contains Powerpoint slides for Chapters 1-12. This is a very comprehensive teaching resource, with many PPT slides covering each chapter of the book Online Appendix on the Weka workbench; again a very comprehensive learning aid for the open source software that goes with the book Table of contents, highlighting the many new sections in the 4th edition, along with reviews of the 1st edition, errata, etc. Provides a thorough grounding in machine learning concepts, as well as practical advice on applying the tools and techniques to data mining projects Presents concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods Includes a downloadable WEKA software toolkit, a comprehensive collection of machine learning algorithms for data mining tasks-in an easy-to-use interactive interface Includes open-access online courses that introduce practical applications of the material in the book.
Coh-Metrix is among the broadest and most sophisticated automated textual assessment tools available today. Automated Evaluation of Text and Discourse with Coh-Metrix describes this computational tool, as well as the wide range of language and discourse measures it provides. Section I of the book focuses on the theoretical perspectives that led to the development of Coh-Metrix, its measures, and empirical work that has been conducted using this approach. Section II shifts to the practical arena, describing how to use Coh-Metrix and how to analyze, interpret, and describe results. Coh-Metrix opens the door to a new paradigm of research that coordinates studies of language, corpus analysis, computational linguistics, education, and cognitive science. This tool empowers anyone with an interest in text to pursue a wide array of previously unanswerable research questions..