ArticlePDF Available

Abstract and Figures

Artificial intelligence generally and machine learning specifically have become deeply woven into the lives and technologies of modern life. Machine learning is dramatically changing scientific research and industry and may also hold promise for addressing limitations encountered in mental health care and psychotherapy. The current paper introduces machine learning and natural language processing as related methodologies that may prove valuable for automating the assessment of meaningful aspects of treatment. Prediction of therapeutic alliance from session recordings is used as a case in point. Recordings from 1,235 sessions of 386 clients seen by 40 therapists at a university counseling center were processed using automatic speech recognition software. Machine learning algorithms learned associations between client ratings of therapeutic alliance exclusively from session linguistic content. Using a portion of the data to train the model, machine learning algorithms modestly predicted alliance ratings from session content in an independent test set (Spearman's ρ = .15, p < .001). These results highlight the potential to harness natural language processing and machine learning to predict a key psychotherapy process variable that is relatively distal from linguistic content. Six practical suggestions for conducting psychotherapy research using machine learning are presented along with several directions for future research. Questions of dissemination and implementation may be particularly important to explore as machine learning improves in its ability to automate assessment of psychotherapy process and outcome. (PsycInfo Database Record (c) 2020 APA, all rights reserved).
Content may be subject to copyright.
Machine Learning and Natural Language Processing in Psychotherapy
Research: Alliance as Example Use Case
Simon B. Goldberg
University of Wisconsin–Madison
Nikolaos Flemotomos and Victor R. Martinez
University of Southern California
Michael J. Tanana and Patty B. Kuo
University of Utah
Brian T. Pace
University of Utah and Veterans Affairs Palo Alto Health
Care System, Palo Alto, California
Jennifer L. Villatte
University of Washington
Panayiotis G. Georgiou
University of Southern California
Jake Van Epps and Zac E. Imel
University of Utah
Shrikanth S. Narayanan
University of Southern California
David C. Atkins
University of Washington
Artificial intelligence generally and machine learning specifically have become deeply woven into
the lives and technologies of modern life. Machine learning is dramatically changing scientific
research and industry and may also hold promise for addressing limitations encountered in mental
health care and psychotherapy. The current paper introduces machine learning and natural language
processing as related methodologies that may prove valuable for automating the assessment of
meaningful aspects of treatment. Prediction of therapeutic alliance from session recordings is used
as a case in point. Recordings from 1,235 sessions of 386 clients seen by 40 therapists at a university
counseling center were processed using automatic speech recognition software. Machine learning
algorithms learned associations between client ratings of therapeutic alliance exclusively from
session linguistic content. Using a portion of the data to train the model, machine learning algorithms
modestly predicted alliance ratings from session content in an independent test set (Spearman’s ␳⫽
.15, p.001). These results highlight the potential to harness natural language processing and
machine learning to predict a key psychotherapy process variable that is relatively distal from
linguistic content. Six practical suggestions for conducting psychotherapy research using machine
Editor’s Note. Sigal Zilcha-Mano served as the action editor for this
article.—DMK Jr.
XSimon B. Goldberg, Department of Counseling Psychology, Univer-
sity of Wisconsin–Madison; Nikolaos Flemotomos, Department of Elec-
trical Engineering, University of Southern California; Victor R. Martinez,
Department of Computer Science, University of Southern California; Mi-
chael J. Tanana, College of Social Work, University of Utah; Patty B. Kuo,
Department of Educational Psychology, University of Utah; Brian T. Pace,
Department of Educational Psychology, University of Utah, and Veterans
Affairs Palo Alto Health Care System, Palo Alto, California; Jennifer L.
Villatte, Department of Psychiatry and Behavioral Sciences, University of
Washington; Panayiotis G. Georgiou, Department of Electrical Engineering,
University of Southern California; Jake Van Epps, University of Utah Coun-
seling Center, University of Utah; Zac E. Imel, Department of Educational
Psychology, University of Utah; Shrikanth S. Narayanan, Department of
Electrical Engineering, University of Southern California; David C. Atkins,
Department of Psychiatry and Behavioral Sciences, University of Washington.
Michael J. Tanana, David C. Atkins, Shrikanth S. Narayanan, and Zac E.
Imel are cofounders with equity stake in a technology company,,
focused on tools to support training, supervision, and quality assurance of
psychotherapy and counseling. Shrikanth S. Narayanan is chief scientist
and co-founder with equity stake of Behavioral Signals, a technology
company focused on creating technologies for emotional and behavioral
machine intelligence. The remaining authors report no conflicts of interest.
Portions of the data presented in this article were reported at the North
American Society for Psychotherapy Research meeting in Park City, UT in
September 2018. Funding was provided by the National Institutes of
Health/National Institute on Alcohol Abuse and Alcoholism (Award R01/
AA018673). Support for this research was also provided by the University
of Wisconsin-Madison, Office of the Vice Chancellor for Research and
Graduate Education with funding from the Wisconsin Alumni Research
Correspondence concerning this article should be addressed to Simon B.
Goldberg, Department of Counseling Psychology, University of Wisconsin–
Madison, 335 Education Building, 1000 Bascom Mall, Madison, WI 53703.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Journal of Counseling Psychology
© 2020 American Psychological Association 2020, Vol. 67, No. 4, 438– 448
ISSN: 0022-0167
learning are presented along with several directions for future research. Questions of dissemination
and implementation may be particularly important to explore as machine learning improves in its
ability to automate assessment of psychotherapy process and outcome.
Public Significance Statement
Our study suggests that client-rated therapeutic alliance can be predicted using session content
through machine learning models, albeit modestly.
Keywords: machine learning, natural language processing, methodology, artificial intelligence,
therapeutic alliance
Supplemental materials:
New directions in science are launched by new tools much more than
by new concepts. The effect of a concept-driven revolution is to
explain old things in new ways. The effect of a tool driven revolution
is to discover new things that have to be explained. (Freeman Dyson,
1998, pp. 50 –51)
Whether or not we know it, and certainly whether or not we like
it, machine learning (ML) is transforming modern life. From eerily
prescient Google search suggestions or Amazon product recom-
mendations to iPhones capable of understanding spoken language
(i.e., Siri), ML undergirds many of the most commonplace tech-
nologies of industrialized society. Manifestations range from the
seemingly benign or mundane to the perhaps more pernicious (e.g.,
targeted advertising). These contemporary conveniences are based
on a family of quantitative methods that are rapidly changing
science and technology and fall under the general umbrella of
artificial intelligence. The term artificial intelligence has been
defined as “the study of agents that receive percepts from the
environment and perform actions” (Russell & Norvig, 2016,p.
viii). Early work on artificial intelligence dates back to the 1950s
(e.g., Turing, 1950). ML combines pattern recognition and statis-
tical inference and plays an integral role within the inner workings
of artificial intelligence. ML can be defined as “the study of
computer algorithms capable of learning to improve their perfor-
mance of a task on the basis of their own previous experience”
(Mjolsness & DeCoste, 2001, p. 2051).
The ways that ML has impacted scientific research and industry is
hard to overstate (Jordan & Mitchell, 2015;Mjolsness & DeCoste,
2001;Stead, 2018). Evidence for the widespread relevance of ML
dates back several decades (e.g., detecting fraudulent credit card
transactions; Mitchell, 1997). More recent ML-based innovations
in medicine include detection of diabetic retinopathy (Gulshan et al.,
2016), informing cancer treatment decision making (Bibault, Gi-
raud, & Burgun, 2016), and predicting disease outbreak (Chen,
Hao, Hwang, Wang, & Wang, 2017). Innovations based on ML are
occurring in basic science as well (e.g., materials science; Butler,
Davies, Cartwright, Isayev, & Walsh, 2018). While not all ML
applications in science and technology have gone smoothly (e.g.,
Google Flu consistently overestimating flu occurrence; Lazer,
Kennedy, King, & Vespignani, 2014), the potential is unequivocal.
Efforts to apply ML within mental health care are also underway
(for a recent scoping review, see Shatte, Hutchinson, & Teague,
2019). Examples include the use of passive sensing to predict
psychosis (e.g., data collected from sensors built into modern
smartphones; Insel, 2017;Wang et al., 2016), analysis of speech
signals to infer symptoms of depression (France, Shiavi, Silver-
man, Silverman, & Wilkes, 2000;Moore, Clements, Peifer, &
Weisser, 2008), prediction of treatment dropout from ecological
momentary assessment (Lutz et al., 2018), and the use of conver-
sational agents (i.e., computers) for clinical assessment and even
treatment (Miner, Milstein, & Hancock, 2017). While not incor-
porated in most settings, these ML-based innovations could dra-
matically change how mental health treatment and psychotherapy,
in particular, is provided. Importantly, once an ML algorithm has
been appropriately trained, it can be deployed at scale without
additional human judgment.
The Need for Innovation in Psychotherapy
Psychotherapy is in need of innovation. For one, mental health
care matters: mental health conditions are extremely common and
associated with enormous economic and social costs (Substance
Abuse and Mental Health Services Administration, 2014;Whit-
eford et al., 2013). Psychotherapy is a frontline treatment approach
(Cuijpers et al., 2014), with efficacy similar to psychotropic med-
ications and with potentially longer lasting benefits and fewer side
effects (Berwian, Walter, Seifritz, & Huys, 2017). Yet despite
enormous investment in psychotherapy in terms of therapist and
client time and health care dollars (Olfson & Marcus, 2010), what
actually happens in psychotherapy is largely unknown (i.e., is
unobserved). Psychotherapy research remains heavily reliant on
retrospective client or therapist self-report (e.g., Elliott, Bohart,
Watson, & Murphy, 2018;Flückiger, Del Re, Wampold, & Hor-
vath, 2018), limiting our understanding of actual therapist-client
interactions that drive treatment. We do know that treatment out-
comes vary widely, related to client (Lambert & Barley, 2001;
Thompson, Goldberg, & Nielsen, 2018), therapist (Baldwin &
Imel, 2013;Johns, Barkham, Kellett, & Saxon, 2019), relationship
(e.g., therapeutic alliance; Flückiger et al., 2018), and treatment-
specific factors.
One source of variability may be treatment quality. To date,
however, there are no established and routinely implemented
methods for quality control. The absence of quality control limits
clinical training, supervision, and the development of therapist
expertise (Tracey, Wampold, Lichtenberg, & Goodyear, 2014);
decreases the ability to demonstrate quality to payers (Fortney et
al., 2017); slows scientific progress in determining which treat-
ments are likely to succeed and why; and restricts efforts to
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
improve service delivery (Fairburn & Cooper, 2011). For these
reasons, psychotherapy researchers have developed numerous ob-
server rating systems to evaluate aspects of treatment quality (e.g.,
adherence and competence; Goldberg, Baldwin, et al., 2019;
Webb, DeRubeis, & Barber, 2010). Behavioral coding has been
invaluable in allowing researchers to understand what occurs in the
moment between therapists and clients that may contribute to
therapeutic change. However, human-coded rating systems are
labor intensive, expensive to implement, and not widely used in
community-based therapy (Fairburn & Cooper, 2011). Clients may
also be asked to provide evaluation of treatment quality (e.g.,
measures of satisfaction, therapeutic alliance; Flückiger et al.,
2018). Regular use of these kinds of measures, while robust
predictors of outcome (Flückiger et al., 2018), increase burden on
clients and providers, are at risk for response set biases (e.g., social
desirability) and random error, and have known psychometric
limitations (e.g., ceiling effects; Tryon, Blackwell, & Hammel,
The New Tools of Psychotherapy Research
Recent methodological advances may be quickly changing our
ability to process the complex data of psychotherapy (Imel, Cap-
erton, Tanana, & Atkins, 2017) and could allow automated assess-
ment of treatment quality along with other outcome and process
variables. Two related innovations include the development of
natural language processing (NLP) and ML. As spoken language
forms a key component of most psychotherapies, the ability to
rapidly and reliably process speech (or text) data may allow
routine assessment of treatment quality and evaluation of numer-
ous other constructs of interest. Several recent proof-of-concept
examples have appeared in the literature, including using NLP and
ML to reliably code motivational interviewing treatment fidelity
(Atkins, Steyvers, Imel, & Smyth, 2014;Imel et al., in press), to
differentiate classes of psychotherapy (e.g., cognitive-behavioral
therapy and psychodynamic psychotherapy; Imel, Steyvers, &
Atkins, 2015), and to identify linguistic behaviors of effective
counselors in text-based crisis counseling (Althoff, Clark, & Les-
kovec, 2016).
The current study extends these efforts further by employing
NLP and ML to predict one of the most studied process variables
in psychotherapy: the therapeutic alliance (Flückiger et al., 2018).
This was examined within the context of a large, naturalistic
psychotherapy dataset drawn from a university counseling center.
Session recordings were available for 1,235 sessions of 386 clients
seen by 40 therapists. NLP and ML methods were used to predict
client-rated alliance from session recordings.
Alliance is used as a test case to demonstrate the potential
applicability of NLP and ML for several reasons. First, alliance is
important for effective psychotherapy, based on its robust relation-
ship with outcome (Flückiger et al., 2018). Second, alliance, unlike
other more objective linguistic features (e.g., ratio of open and
closed questions in motivational interviewing adherence coding;
Miller, Moyers, Ernst, & Amrhein, 2003), requires a potentially
higher order of processing to assess (e.g., through the cognitive
and affective system of a client, therapist, or observer providing
alliance ratings). This additional level of abstraction likely makes
automated prediction more difficult, but also more widely relevant
if it can be accomplished. Third, alliance represents a relatively old
concept (Bordin, 1979;Greenson, 1965) that may be less viable for
concept-driven innovations (Dyson, 1998). New tools, however,
could drive innovation in this area. There are also important open
questions related to alliance, such as the proportion and cause of
therapist and client contributions to alliance (Baldwin, Wampold,
& Imel, 2007), the source of unreliability in alliance ratings across
rating perspectives (i.e., client, therapist, and observer; Tichenor &
Hill, 1989), the state- versus traitlike qualities of alliance (Zilcha-
Mano, 2017), the potentially causal nature of alliance as a driver of
symptom change (Falkenström, Granström, & Holmqvist, 2013;
Flückiger et al., 2018;Zilcha-Mano & Errázuriz, 2017), and ways
to include alliance assessment in routine clinical care without
increasing participant burden (Duncan et al., 2003;Goldberg,
Rowe, et al., 2019). While NLP and ML are likely not panacea for
resolving all outstanding debates regarding alliance, they may be
useful research tools. Theoretically, these questions could be ad-
dressed more thoroughly if ML enabled alliance assessment on a
much larger scale, particularly if ML models were built in a way
to minimize construct irrelevant variance (e.g., social desirability).
Ultimately, assessment of alliance could be automated using ML,
providing clients and therapists with ongoing information about
this aspect of therapeutic process without the drawbacks (e.g., time
required, psychometric issues) of repeated self-report assessment.
Such technology could also be used to assess alliance directly from
session transcripts or recordings.
Prior to presenting a preliminary attempt at assessing alliance
using NLP and ML, it is worth introducing basic concepts involved
in each methodology. This is, of course, intended to be a cursory
treatment and interested readers are encouraged to review sources
cited below.
Basics of NLP
NLP is a subfield of computer science and linguistics focused on
the interaction between machines and humans through language
(Jurafsky & Martin, 2014). NLP aims to understand human com-
munication by processing and analyzing large quantities of textual
data. Popular applications of NLP include machine translation
(e.g., Google Translate), question-answering systems, or sentiment
analysis (e.g., extraction of sentiments within social media).
Typically, NLP applications start with a collection of raw text
documents (i.e., a language corpus). From this corpus, the first step
is to extract or estimate quantitative features from the text. One of
the most widely used NLP features is the bag-of-words represen-
tation (BoW). In BoW, each document is represented by counts of
its unique words, without regard to the ordering of these words.
Conceptually, BoW is a large crosstabulation table of words by
documents. Other common text features include N-grams (Shan-
non, 1948), which are short multiword phrases with Nelements
(e.g., bigrams include 2-word phrases); dictionary-based features,
such as those provided by Linguistic Inquiry and Word Count
(LIWC; Pennebaker, Boyd, Jordan, & Blackburn, 2015)orthe
General Inquirer (Stone, Bales, Namenwirth, & Ogilvie, 1962);
and dialogue acts (Okada et al., 2016), which try to capture a
high-level interaction between participants in a conversation (i.e.,
“statement,” “question,” etc.). More recently, linguistic units are
converted to a vector-space representation of either word (Mikolov,
Sutskever, Chen, Corrado, & Dean, 2013,Pennington, Socher, &
Manning, 2014) or sentence (Pagliardini, Gupta, & Jaggi, 2017)
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
embeddings, which capture the semantic context. Words (or sen-
tences) that appear in similar contexts appear closer to each other
in vector space, and semantic relationships are represented by the
operations of addition and subtraction (e.g., v(king) v(man)
v(woman) v(queen) where v(w) is represents the vector for
word w).
Basics of ML
The human brain has a remarkable ability to learn and recognize
patterns from its surrounding environment. ML comprises a set of
computational techniques simulating this capability (Haykin,
2009). As opposed to knowledge-based approaches, where a hu-
man designs an algorithm having specific rules in mind, ML is
typically based on data-driven methods and on statistical inference.
ML algorithms derive prediction rules from (typically) large
amounts of data.
Two major paradigms in ML are unsupervised and supervised
learning (Murphy, 2012). Similar to cluster analysis, unsupervised
learning does not involve an outcome to predict but rather focuses
on finding structure within a given set of data. Supervised learning
is similar to regression modeling, in which an outcome (either
discrete or continuous) is associated with a set of input data, and
the ML algorithm is tasked with finding an optimal mapping
function between the input data and the outcome (e.g., linking
linguistic content with alliance ratings). Once such a mapping has
been learned, it can be used to predict outcomes for new data.
Since the goal of ML is to apply the algorithm on previously
unseen data, ML analyses train algorithms on a subset of “training
data” but are evaluated on a separate subset of “test data.” Typical
supervised learning algorithms include support vector machines,
regularized linear or logistic regression, and decision trees (Mur-
phy, 2012). Recently, there has been rapid development and in-
creased focus on artificial neural networks and deep learning
techniques (Goodfellow, Bengio, & Courville, 2016).
Participants and Setting
Data were collected at the counseling center of a large, Western
university. The counseling center provides approximately 10,000
sessions per year, with treatment focused on concerns common
among undergraduate and graduate students (e.g., depression, anx-
iety, substance use, academic concerns, relationship concerns;
Benton, Robertson, Tseng, Newton, & Benton, 2003). Treatment is
provided by a combination of licensed permanent staff (including
social workers, psychologists, and counselors) as well as trainees
pursuing masters- or doctoral-level mental health degrees (e.g.,
masters of social work, doctorate in counseling/clinical psychol-
Data were collected between September 11, 2017 and December
11, 2018. Both clients and therapists provided consent for audio
recording of sessions and for use of recordings for the current
study. Recordings were made from microphones installed in clinic
offices and archived on clinic servers. Two microphones were
hung from the ceiling in each room. One cardioid choir mic was
hung to capture voice anywhere in the room and a second choir
mic pointed in the direction where the therapist generally sits. In
order for sessions to be recorded, clinicians had to start and stop
recordings (i.e., sessions were not recorded automatically). All
recordings were from individual therapy sessions (approximately
50 min in length). All audio recordings with associated alliance
ratings were used (i.e., no exclusions were made). Alliance is
assessed routinely in the clinic, with no standardized instructions
regarding how therapists use these ratings in therapy.
The current study was integrated into the partner clinic with
minimum modifications to the existing clinic workflow. One fea-
ture of the workflow is collecting alliance ratings prior to sessions,
rather than asking clients to complete measures both before (e.g.,
symptom ratings) and after (e.g., alliance ratings) session. When
making alliance ratings prior to session, clients were asked to
reflect on their experience of alliance at their previous session (i.e.,
time 1). In all models, alliance ratings were associated with the
session they were intended to represent (e.g., ratings made prior to
Session 2 were associated with Session 1). No alliance ratings
were made prior to the initial session. Study procedures were
approved by the relevant institutional review board.
Clients were, on average, 23.77 years old (SD 4.86). The
majority of the sample identified as female (n214, 55.4%), with
the remainder identifying as male (n158), nonbinary (n5),
genderqueer (n1), gender neutral (n3), female-to-male
transgender (n1), and questioning (n2), with two choosing
not to respond. The client sample predominantly identified as
White (n294, 76.2%), with the remainder identifying as Latinx
(n33), Asian American (n28), African American (n5),
Pacific Islander (n2), Middle Eastern (n1), and multiracial
(n21), with two choosing not to respond.
Demographic data were available from 26 of the 40 included
therapists. Therapists were, on average, 35.15 years old (SD
14.04). The majority identified as female (n17, 65.4%), with the
remainder identifying as male (n7), or genderqueer (n1).
The majority identified as White (n15, 57.7%), with the
remainder identifying as Latinx (n4), Asian American (n3),
African American (n2), Middle Eastern (n1), and multiracial
Therapeutic alliance was assessed using a previously validated
(Imel, Hubbard, Rutter, & Simon, 2013) four-item version of the
Working Alliance Inventory—Short Form Revised (Hatcher &
Gillaspy, 2006) representing the bond, task, and goal dimensions
of alliance. Items included “_________ and I are working towards
mutually agreed upon goals” (goal), “I believe the way we are
working on my problem is correct” (task), “I feel that _________
appreciates me” (bond), and “_________ really understands me”
(bond). Items were rated ona1(Never)to7(Always) scale. A total
score was computed by averaging across the four items. Internal
consistency reliability was adequate in the current sample (␣⫽
.90). As noted above, ratings were made prior to each session
(starting with the second session) asking clients to reflect back on
their experience of alliance in the previous session. Although
alliance can be rated from various perspectives (e.g., client, ther-
apist, observer; Flückiger et al., 2018), the current study employed
client-rated alliance due to its robust link with treatment outcome,
ease of data collection, and ecological validity (i.e., the experience
of alliance largely exists in the subjective experience of the client).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Data Analysis
For this study, we used 1,235 recorded sessions together with
client-reported alliance, assessed prior to the subsequent session
occurring between the same therapist and client. Audio recordings
were processed through a speech pipeline to generate automatic
speech-to-text transcriptions. The automatic speech recognition
made use of the open-source, freely available Kaldi software
(Povey et al., 2011). Components of the pipeline along with their
corresponding accuracy (vs. human transcription) using data from
the current study include: (a) a voice activity detector, where
speech segments are detected over silence or noise (unweighted
average recall 82.7%); (b) a speaker diarization system, where
the speech is clustered into speaker-homogeneous groups (i.e.,
Speaker A, Speaker B; diarization error rate 6.4%); (c) a speaker
role recognizer, where each group is assigned the label “‘therapist”
or “client” (misclassification rate 0.0%); and (d) an automatic
speech recognizer, which transduces speech to text (word error
rate 36.43%). The modules of the speech pipeline have been
adapted with the Kaldi speech recognition toolkit (Povey et al.,
2011) using psychotherapy sessions provided by the same coun-
seling center, but not used for the alliance prediction, thus not
inducing bias. A similar system architecture is described in Xiao et
al. (2016) and Flemotomos et al. (2019).
Linguistic features were extracted from resulting transcripts,
independently for therapist and client text. We report results using
unigrams and bigrams (i.e., 1- and 2-word pairings) weighted by
the term frequency-inverse document frequency (tf-idf; Salton &
McGill, 1986) or sentence (Sent2vec) embeddings (Pagliardini et
al., 2017). Tf-idf weighting accounts for the frequency with which
words appear within a given document (i.e., session), while also
considering its frequency within the larger corpus of text (i.e., all
sessions). This allows less commonly used words (e.g., suicide)
more weight than commonly used words (e.g., the). Thus, less
common words are treated as more important. Tf-idf weighting
was calculated across all sessions in the train set and applied to the
test set. As described earlier, Sent2vec maps sentences to vectors
of real numbers. Using Sent2vec, the session is represented as the
mean of its sentence embeddings. Models used linear regression
with L2-norm regularization (i.e., ridge regression; Hoerl & Ken-
nard, 1970), which is a method designed for highly correlated
features, which is often the case for NLP data.
To estimate the performance of our method, experiments were
run using a 10-fold cross-validation: data is split into 10 parts, with
nine parts used for training at each iteration (train), and one for
evaluation (test). This is commonly used in ML and allows esti-
mation of the extent to which model results based on the training
set (train) will generalize to an independent sample (test). Train
and test sets were constructed so as not to share therapists between
them, as shared therapists could artificially inflate the model’s
accuracy. The algorithm is therefore expected to learn patterns of
words related to alliance ratings in general instead of capitalizing
on therapist-specific characteristics.
We employed two commonly used metrics of accuracy: mean
square error (MSE) and Spearman’s rank correlation (). These
metrics reflect the accuracy of the ML algorithm when applied to
the test set. Specifically, mean squared error is the average of the
squared differences between the predictions and the true values
and is useful for comparing models, though its absolute value is
not interpretable. Spearman’s rank correlation measures the strength
of association between two variables, ranging from 1 to 1, with
higher values preferred.
Computer Software
Self-report data were processed within the R statistical environ-
ment (R Core Team, 2018). NLP and ML was conducted using the
Python programming language (Python Software Foundation,
2019). Models used the “scikit-learn” toolkit (Pedregosa et al.,
2011) and the “sklearn.linear_model.Ridge” function (Hoerl &
Kennard, 1970; see Table 1 in the online supplemental materials
for syntax). Sent2vec was implemented using the method devel-
oped by Pagliardini et al. (2017) and N-grams obtained using the
text feature extraction in “scikit.”
The time required for running
the speech pipeline and ML models can vary. In the current data,
the speech pipeline required approximately 30 min per 50-min
session using one core of an AMD Opteron Processor 6276 (2.3
GHz). The 10-fold cross-validation models took approximately 10
min on a MacBook Pro with 2.8 GHz Intel Core i7, 16 GB RAM,
and 2133 MHz LPDDR3.
The sample included a total of 1,235 sessions with recordings
and associated alliance ratings (provided at the subsequent session;
n386 clients; 40 therapists). Clients had, on average, 3.20
sessions in the data set (SD 2.50, range 1 to 13) and therapists
had 30.88 (SD 32.97, range 1 to 131). Sessions represented
a variety of points in treatment, with a mean session number of
5.31 (SD 3.37, range 1 to 23). Across the 1,235 alliance
ratings, the mean rating was 5.47 (SD 0.83, median 5.5,
range 1.75 to 6.50; see Figure 1 in the online supplemental
materials). Ratings showed the typical negative skew found in the
assessment of alliance (Tryon et al., 2008).
ML model results are presented in Table 1. Models are shown
using either therapist or client text as the input. Results are also
separated by feature extraction method (tf-idf, Sent2vec). The
baseline model reflects accuracy of the average rating (i.e., 5.47)
and is useful to evaluate model performance.
The predictions of three out of the four models are significantly
better than chance (Spearman’s ␳⬎.00, p.01). The model that
used therapist text and extracted features using tf-idf performed
best overall, with MSE 0.67 and ␳⫽0.15, p.001. For
illustrative purposes only, we extracted the 15 unigrams/bigrams
that were most positively or negatively correlated with alliance
ratings in our best performing model. As these features represent
only a small portion of the corresponding model, they should not
be viewed as a replacement for the full model. The 15 most
positively correlated unigrams/bigrams were: group, really, hus-
band, right, think, phone, values, maybe, divorce, got, yeah, situ-
ation, um right, don think, max. The 15 most negatively correlated
unigrams/bigrams were: counseling, yeah yeah, going, sure, cop-
ing, just want, friends, motivation, feeling, Monday, huh yeah, oh,
physical, pretty, time.
Readers interested in working with text data in Python are encouraged
to read the “scikit” and Kaldi tutorials (
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
The current study introduces two related quantitative methods—
NLP and ML—that have the potential to significantly expand
methodological tools available to psychotherapy researchers and
clinicians. The prediction of client-rated therapeutic alliance from
session recordings was used as a test case for these methods due to
the importance of alliance in psychotherapy and the potential
contribution of technologies able to reliably automate alliance
assessment. Results presented here suggest that ML models mod-
estly predict alliance ratings (␳⫽.15). That is to say, there was
linguistic signal indicative of the strength of the alliance that is
detectable through ML, supporting the notion that ML may be a
useful tool for examining alliance in future studies.
It is worth contextualizing these results within the broader field
of speech signal processing and NLP as well as prior work spe-
cifically within the domain of psychotherapy research. An important
feature of the alliance, and part of the motivation to examine alliance,
is its greater degree of abstraction from the actual linguistic context of
a psychotherapy session. Compare alliance with another commonly
studied psychotherapy process variable—motivational interviewing
fidelity codes. Motivational interviewing codes are primarily lin-
guistic in nature (e.g., open vs. closed question; Miller et al., 2003)
and can be reliably coded by trained human raters and ML algo-
rithms at approximately similar levels (e.g., s.75 for use of
open questions over a session of motivational interviewing; Atkins
et al., 2014). Importantly, aspects of motivational interviewing
fidelity that show lower interrater reliability among human raters
(e.g., empathy) are also more difficult to predict via ML (e.g., s
.25 for talk turns and .00 for sessions; Atkins et al., 2014).
Alliance, in contrast to most motivational interviewing fidelity
dimensions, requires in-depth processing by humans (i.e., client,
therapist, or observer) and is presumably influenced by a variety of
unobservable, nonlinguistic factors. It is exactly this nonlinguistic,
internal processing that may be more difficult for ML models to
replicate. This highlights a truism of NLP methodologies: behav-
iors more distal from linguistic content that are more difficult for
human raters to rate reliably will also be more difficult for ML
models to predict. This may make predicting even more abstracted
aspects of treatment, such as treatment outcome, yet more chal-
lenging to predict using ML.
Practical Suggestions
Given these potential limitations, there are six practical consid-
erations offered here that may increase the viability of ML to
contribute to psychotherapy research. Several of these are funda-
mental principles of ML reviewed previously but are worth high-
lighting due to the possibility that many readers may not be
familiar with them.
1. ML may be most promising for predicting observable
linguistic behaviors. For efforts employing ML using text
data, it may be valuable to start with observable behav-
iors that humans can code reliably using only text data
(e.g., treatment fidelity; Atkins et al., 2014). Human
reliability provides an estimate of the upper limit to
reliability likely to be achieved using ML models. Be-
haviors for which humans have difficulty reaching con-
sensus will likely be more challenging for ML models as
2. ML models should be trained using human coding as the
gold standard. Related to the previous suggestion, it may
be prudent to develop ML models based on behaviors
that are observable and to use human-based ratings as the
standard for training ML algorithms. Thankfully, prom-
ising observer-rated measures of alliance and other psy-
chotherapy processes (e.g., empathy, treatment fidelity)
have been developed that may serve as a basis for future
ML psychotherapy research. While this has been done in
previous work on motivational interviewing (Atkins et
al., 2014;Xiao et al., 2016), this was not used in the
current study due both to resource limitations and an
interest in attempting to predict client (rather than ob-
server) ratings. However, ML models could be con-
structed predicting observer-rated alliance, which may be
less prone to client response set biases (e.g., social de-
sirability). While models using human coding as the basis
are a promising starting point, it may also be useful to
develop models attempting to predict more diffuse con-
structs that are not reliably rated by observers (e.g.,
treatment outcome).
3. ML models should be tested using large data sets. One of
the distinct advantages of ML is its potential to process
large amounts of data, an impractical task when using
human coders. However, for the development of reliable
ML algorithms, large amounts of training data are ideal.
The actual amount of data necessary varies widely de-
pending on the nature of the ML task, but data sets of
10,000 cases or more are commonly used in NLP appli-
cations. Given advances in NLP, researchers and clini-
cians who have access to high fidelity session recordings
may be able to convert existing recordings to text data for
ML models.
4. Develop models using a training set and test models
using a test set. Similar to the rationale for employing
Table 1
Results From Machine Learning Prediction Model
Feature extraction
method MSE p
Therapist tf-idf .67 .15 .001
Sent2vec 3.34 .08 .003
Client tf-idf .69 .11 .001
Sent2vec 3.67 .01 .800
Baseline Average .69 .00 n/a
Note. Models employed unigrams and bigrams (i.e., 1- and 2-word pair-
ings) and a linear regression with L2-norm regularization (i.e., ridge
regression; Hoerl & Kennard, 1970). Models were evaluated using 10-fold
cross-validation with nine parts used for model training and one used for
evaluation. Therapist therapist speech; Client client speech; base-
line model results if model always predicts the mean alliance rating (i.e.,
5.47); MSE mean square error; ␳⫽Spearman’s rank order correlation;
tf-idf term frequency-inverse document frequency weighting based on
(inverse) frequency of occurrence within the document and larger corpus;
Sent2vec sentence embeddings used to map sentences to vectors of real
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
separate sample for exploratory and confirmatory factor
analysis (Gerbing & Hamilton, 1996), evaluation of ML
algorithms requires separate samples. It is possible to get
perfect accuracy within a training set, but this in no way
indicates that results will be perfectly accurate in a future
data set (i.e., for prediction). The need for separate samples
echoes the need for large data sets when conducting ML.
5. Develop interdisciplinary collaborations. Most psychother-
apy researchers are not trained in ML during graduate
school. As these models depart in some important ways
from traditional quantitative methods used in psychology
(e.g., regression and analysis of variance), it may be vital for
researchers interested in ML to build collaborations with
colleagues more versed in the intricacies of ML. Research-
ers with expertise in processing linguistic data, with back-
grounds in computer science and engineering, for example,
may be ideal complements to the clinical and context ex-
pertise brought by psychologists. Of course, interdisciplin-
ary collaborations involve their own complexity, with re-
searchers working across disciplinary cultures, practices,
and standards.
6. Have reasonable expectations and avoid the risk of “al-
chemy.” A final suggestion is that those interested in pur-
suing ML-based psychotherapy research have reasonable
expectations about the promise of these methods, and the
speed with which they will become viable tools. One con-
cern is that ML-based models simply replicate the human
biases in the patient-rated measures: if the model accurately
learns the human rating, it will also include ceiling effects,
social desirability, and other potentially construct-irrelevant
variance. In addition, it is encouraged that ML not be
viewed as form of alchemy (Hutson, 2018) in which ML
becomes a quasimagical black box for researchers and con-
sumers of research. ML research, like other research meth-
odologies, is likely to benefit from transparency, humility,
and replication (Open Science Collaboration, 2015) along
with a healthy dose of skepticism.
Future Directions
Consistent with these practice suggestions, future work should
continue to explore important psychotherapy process and outcome
variables using linguistic, paralinguistic (e.g., prosody, pitch), and
nonverbal therapy behaviors. Ideally this is done using large data
sets (e.g., Ns10,000 sessions). The current study focused on
alliance, but future work could use similar methods to predict
treatment outcome (e.g., Hamilton Rating Scale of Depression;
Hamilton, 1960), multicultural competence (Tao, Owen, Pace, &
Imel, 2015), empathy (Imel et al., 2014), interpersonal skill (An-
derson, Ogles, Patterson, Lambert, & Vermeersch, 2009), treat-
ment fidelity (e.g., Cognitive Therapy Rating Scale; Creed et al.,
2016;Goldberg, Baldwin, et al., 2019), and other variables previ-
ously assessed using observer ratings (e.g., innovative moments;
Gonçalves, Ribeiro, Mendes, Matos, & Santos, 2011).
Development will also ideally occur in tandem with attention to
measurement and known issues in psychotherapy research. For
example, future work should consider likely bias in the measure-
ment of alliance. Clients whose ratings are invariant across ses-
sions (e.g., consistently provided alliance ratings at the ceiling of
the measure) could be removed from ML models, perhaps even-
tually providing models that better predict the correlates of alliance
(e.g., treatment retention) than self-report. Or ML models could be
used to determine when collecting self-report alliance data would
provide information beyond what analysis of session content could
provide (e.g., models predicting discrepancies between ML-based
and self-report alliance ratings). It also may be worthwhile at-
tempting to predict therapist-level alliance scores using session
content and ratings aggregated across multiple clients.
The current cross-validation design allowed no therapist to
appear in both the train and test sets. Conceptually, this ML
approach is trying to discover a universal model for mapping
language to alliance, and as such, it is the hardest and most
conservative modeling approach. Alternative strategies would al-
low therapists to be in both train and test sets, which allows a
model to learn individual-specific mappings of text to alliance to
support prediction of future alliance scores for either therapist or
client. It could be valuable to explore these additional models in
future work.
Provided ML models continue to improve in their ability to
detect important aspects of psychotherapy, questions of dissemi-
nation and implementation will become increasingly central. Many
potentially valuable technologies have existed for years (e.g.,
models detecting depression symptoms via speech features; France
et al., 2000), yet are not widely implemented. There are, of course,
numerous reasons that innovations may not be adopted, and con-
siderable scholarship focused on precisely this research-to-practice
impasse (e.g., Wandersman et al., 2008). Part of the solution to
bringing ML-based technologies to market may require research-
ers moving outside of the traditional academic boundaries and
developing collaborations with industry. For clinicians and re-
searchers alike, there may be discomfort with the notion of part-
nering with for-profit entities with fears of disruptions in objec-
tivity that form the theoretical backbone of both science and
practice (DeAngelis, 2000). While these concerns may be valid,
these partnerships may play a central role in bringing novel tech-
nologies such as those based on ML to the therapists and clients
who could benefit from them.
Gaining buy-in from clinicians is another dissemination and
implementation barrier. Clinician discomfort discussed in relation
to measurement-based care (e.g., Boswell, Kraus, Miller, & Lam-
bert, 2015;Fortney et al., 2017;Goldberg et al., 2016) may very
well be magnified when clinicians are asked to routinely record
therapy sessions. Discomfort may be further magnified knowing
that these recordings will subsequently be analyzed by a computer
algorithm to determine treatment quality, therapeutic alliance, or
outcome. Sensitivity to these and other dissemination and imple-
mentation issues will be crucial for moving this work forward.
A final future direction to mention is the importance of ulti-
mately evaluating whether ML-based feedback— be it focused on
alliance, fidelity, or any other aspect of treatment—actually pro-
vides benefits. The benefit of interest may depend on the stake-
holder: for payers, this may involve demonstrating the quality of
services; for clinicians, this may involve demonstrating improved
client outcomes; and for researchers, this may involve demonstrat-
ing reliability and validity with reduced cost of research team time
and money. It is likely these metrics will ultimately determine
whether ML can transform psychotherapy.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
While promising, the current study has several important limi-
tations. The first is the relatively modest sample size. While large
by human coding standards, the current number of sessions eval-
uated is well below the samples often used for ML. As noted
previously, ML models improve with larger amounts of training
data. Thus, the available sample size may have reduced the ability
to predict alliance ratings from session recordings.
Another limitation is related to the available speech signal
processing technology. In particular, existing NLP technologies
have known limitations, including inaccuracy in transcription (i.e.,
misinterpreting spoken words) and errors in assigning speech to a
given speaker (i.e., diarization). These factors introduce error
variability into the text data which functions to reduce statistical
power and the accuracy of the ML models.
A third key limitation is related to the assessment of alliance.
For one, ratings were made retrospectively (i.e., about a prior
session). Collecting ratings at time points more distant from the
actual session may have reduced linkages between ratings and
session content and thereby decreased the signal available for
detection (i.e., exerting a conservative rather than liberal bias on
our ability to predict alliance ratings from session content). Sim-
ilarly, there was evidence that alliance ratings in the current study
suffered from range restriction due to the well-documented ceiling
effects for ratings of alliance (Tryon et al., 2008). Range restriction
also may have decreased statistical power and the ability to reli-
ably predict alliance ratings (Cohen, Cohen, West, & Aiken,
2003). For this reason, it may be useful to examine alliance in
other contexts in which ratings may be more variable (e.g., clients
with more severe personality psychopathology). Lastly, alliance
was assessed only by clients. While relevant and ecologically
valid, accuracy may have been improved for predicting observer-
rated alliance in which observers and ML algorithms had access to
the same information (i.e., session text).
Clinical Vignette
The algorithm developed in the current study is only a first
attempt at predicting alliance ratings using ML, but these initial
results suggest a potential future for using these technologies in
clinical research and practice. We imagine a future application in
the following vignette. This example indicates how ML-generated
analytics derived directly from the session encounter can be used
as another source of information for the therapist to reflect on their
work and potentially improve the process of therapy.
Sandra is a 43-year old, married, African American, cisgender
female who has been struggling with social anxiety since adoles-
cence. She is a school librarian and the mother of two teenage
sons. She has recently begun working with a psychologist, Dr.
Martinez, due to “increasing stress and anxiety” at work which is
beginning to spill over into Sandra’s family life. She reports she
has trouble “asserting herself and expressing her needs” at home
and at work.
During the intake session, Dr. Martinez shares with Sandra that
the clinic has been using a recording platform that can provide Dr.
Martinez with information about how therapy might be going, in
particular, feedback on the therapy “relationship.” Sandra provides
her consent for use of the platform. Therapy starts out smoothly,
with Sandra sharing more about the difficulties she is experienc-
ing, which in recent months have included periodic panic attacks
in social situations. Dr. Martinez, who primarily operates from a
cognitive-behavioral therapy perspective, introduces exposure therapy
as a treatment approach for reducing her symptoms.
During the fifth session, Dr. Martinez initiates a conversation
about Sandra’s progress in treatment. Sandra reports that therapy is
going “just fine” and she apologies for not having had the time to
complete the exposure exercises Dr. Martinez had recommended.
Dr. Martinez reflects that she knows it can be challenging to make
the time for engaging in therapy “homework” and that the expo-
sures themselves can be unpleasant. Sandra quickly assures Dr.
Martinez that she will try to do a better job making time for
Through the treatment, Dr. Martinez has been reviewing ses-
sions and automated feedback on the quality of her relationship
with Sandra and has noticed that the alliance scores generated by
the system have been low in the past two sessions. Although
Sandra indicated in session that treatment was going fine, the
alliance algorithm was built using observer-rated alliance that is
less contaminated with self-report biases (e.g., social desirability).
Dr. Martinez uses this opportunity to discuss the automated feed-
back with Sandra:
You know Sandra, I was reviewing some feedback I received on our
sessions last week, and it suggested that it might be smart for me to
check in with you again on how things are going. I know you said,
things are fine, but I can’t help wonder if there’s something I’m
missing. I’d really like to know.
At this point, Sandra notes that she has been having trouble with
Dr. Martinez’s therapeutic approach. Sandra shares that she has
been having significant difficulties in her marriage recently and
has experienced several racial microaggressions at work that have
contributed to her anxiety. Sandra notes that she was hoping to
discuss these events in therapy but was not sure how to bring them
up, given Dr. Martinez’s emphasis on exposure therapy and San-
dra’s difficulty completing her exposure exercises. Dr. Martinez
expresses her appreciation to Sandra for sharing this. They begin
a discussion of ways to refocus treatment to include these addi-
tional dimensions.
The current study introduced and attempted to model ML as a
statistical approach that may be relevant for addressing important
questions about psychotherapy. Just as ML is centrally involved in
numerous cultural, technological, and social changes, it may also
play a leading role in future innovation within psychotherapy
research and practice. Our prediction of therapeutic alliance dis-
cussed here is one of several recent examinations of potential
synergy between ML and psychotherapy. As available sample
sizes grow and technology evolves, it may well be that ML
algorithms can be developed to even more reliably detect treatment
features like alliance from session recordings. Clearly such tech-
nologies could dramatically revolutionize training and provision of
clinical services. In a way, these methods, while heavily reliant on
computers and artificial intelligence, may prove crucial in helping
human researchers and clinicians unravel the dizzying complexity
of the human interaction that is psychotherapy.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Althoff, T., Clark, K., & Leskovec, J. (2016). Large-scale analysis of
counseling conversations: An application of natural language processing
to mental health. Transactions of the Association for Computational
Linguistics, 4, 463– 476.
Anderson, T., Ogles, B. M., Patterson, C. L., Lambert, M. J., & Ver-
meersch, D. A. (2009). Therapist effects: Facilitative interpersonal skills
as a predictor of therapist success. Journal of Clinical Psychology, 65,
Atkins, D. C., Steyvers, M., Imel, Z. E., & Smyth, P. (2014). Scaling up the
evaluation of psychotherapy: Evaluating motivational interviewing fi-
delity via statistical text classification. Implementation Science, 9, 49.
Baldwin, S. A., & Imel, Z. E. (2013). Therapist effects: Findings and
methods. In M. J. Lambert (Ed.), Bergin and Garfield’s handbook of
psychotherapy and behavior change (6th ed., pp. 258 –297). Hoboken,
NJ: Wiley.
Baldwin, S. A., Wampold, B. E., & Imel, Z. E. (2007). Untangling the
alliance-outcome correlation: Exploring the relative importance of ther-
apist and patient variability in the alliance. Journal of Consulting and
Clinical Psychology, 75, 842– 852. 852.
Benton, S. A., Robertson, J. M., Tseng, W. C., Newton, F. B., & Benton,
S. L. (2003). Changes in counseling center client problems across 13
years. Professional Psychology: Research and Practice, 34, 66 –72.
Berwian, I. M., Walter, H., Seifritz, E., & Huys, Q. J. (2017). Predicting
relapse after antidepressant withdrawal - a systematic review. Psychological
Medicine, 47, 426 – 437.
Bibault, J. E., Giraud, P., & Burgun, A. (2016). Big data and machine learning
in radiation oncology: State of the art and future prospects. Cancer Letters,
382, 110 –117.
Bordin, E. S. (1979). The generalizability of the psychoanalytic concept of
the working alliance. Psychotherapy: Theory, Research & Practice, 16,
Boswell, J. F., Kraus, D. R., Miller, S. D., & Lambert, M. J. (2015).
Implementing routine outcome monitoring in clinical practice: Benefits,
challenges, and solutions. Psychotherapy Research, 25, 6 –19. http://dx
Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O., & Walsh, A.
(2018). Machine learning for molecular and materials science. Nature,
559, 547–555.
Chen, M., Hao, Y., Hwang, K., Wang, L., & Wang, L. (2017). Disease
prediction by machine learning over big data from healthcare commu-
nities. IEEE Access: Practical Innovations, Open Solutions, 5, 8869 –
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple
regression/correlation analysis for the behavioral sciences (3rd ed.).
Mahwah, NJ: Erlbaum.
Creed, T. A., Frankel, S. A., German, R. E., Green, K. L., Jager-Hyman, S.,
Taylor, K. P.,...Beck, A. T. (2016). Implementation of transdiagnostic
cognitive therapy in community behavioral health: The Beck Commu-
nity Initiative. Journal of Consulting and Clinical Psychology, 84,
1116 –1126.
Cuijpers, P., Sijbrandij, M., Koole, S. L., Andersson, G., Beekman, A. T.,
& Reynolds, C. F., III. (2014). Adding psychotherapy to antidepressant
medication in depression and anxiety disorders: A meta-analysis. World
Psychiatry, 13, 56 – 67.
DeAngelis, C. D. (2000). Conflict of interest and the public trust. Journal
of the American Medical Association, 284, 2237–2238.
Duncan, B. L., Miller, S. D., Sparks, J. A., Claud, D. A., Reynolds, L. R.,
& Johnson, L. D. (2003). The Session Rating Scale: Preliminary psy-
chometric properties of a “working” alliance measure. Journal of Brief
Therapy, 3, 3–12.
Dyson, F. J. (1998). Imagined worlds (Vol. 6). Cambridge, MA: Harvard
University Press.
Elliott, R., Bohart, A. C., Watson, J. C., & Murphy, D. (2018). Therapist
empathy and client outcome: An updated meta-analysis. Psychotherapy,
55, 399 – 410.
Fairburn, C. G., & Cooper, Z. (2011). Therapist competence, therapy
quality, and therapist training. Behaviour Research and Therapy, 49,
Falkenström, F., Granström, F., & Holmqvist, R. (2013). Therapeutic
alliance predicts symptomatic improvement session by session. Journal
of Counseling Psychology, 60, 317–328.
Flemotomos, N., Martinez, V., Chen, Z., Singla, K., Peri, R., Ardulov, V.,
&Narayanan, S. (2019). A speech and language pipeline for quality
assessment of recorded psychotherapy sessions. Manuscript in prepara-
Flückiger, C., Del Re, A. C., Wampold, B. E., & Horvath, A. O. (2018).
The alliance in adult psychotherapy: A meta-analytic synthesis. Psycho-
therapy, 55, 316 –340.
Fortney, J. C., Unützer, J., Wrenn, G., Pyne, J. M., Smith, G. R., Schoe-
nbaum, M.,...Harbin, H. T. (2017). A tipping point for measurement-
based care. Psychiatric Services, 68, 179 –188.
France, D. J., Shiavi, R. G., Silverman, S., Silverman, M., & Wilkes, D. M.
(2000). Acoustical properties of speech as indicators of depression and
suicidal risk. IEEE Transactions on Biomedical Engineering, 47, 829 –
Gerbing, D. W., & Hamilton, J. G. (1996). Viability of exploratory factor
analysis as a precursor to confirmatory factor analysis. Structural Equa-
tion Modeling, 3, 62–72.
Goldberg, S. B., Babins-Wagner, R., Rousmaniere, T., Berzins, S., Hoyt,
W. T., Whipple, J. L.,...Wampold, B. E. (2016). Creating a climate for
therapist improvement: A case study of an agency focused on outcomes
and deliberate practice. Psychotherapy, 53, 367–375.
Goldberg, S. B., Baldwin, S. A., Merced, K., Caperton, D., Imel, Z. E., Atkins,
D. C., & Creed, T. (2019). The structure of competence: Evaluating the
factor structure of the Cognitive Therapy Rating Scale. Behavior Therapy.
Advance online publication.
Goldberg, S. B., Rowe, G., Malte, C. A., Ruan, H., Owen, J. J., & Miller,
S. D. (2019). Routine monitoring of therapeutic alliance to predict
treatment engagement in a Veterans Affairs substance use disorders
clinic. Psychological Services. Advance online publication. http://dx.doi
Gonçalves, M. M., Ribeiro, A. P., Mendes, I., Matos, M., & Santos, A.
(2011). Tracking novelties in psychotherapy process research: The in-
novative moments coding system. Psychotherapy Research, 21, 497–
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cam-
bridge, MA: MIT Press.
Greenson, R. R. (1965). The working alliance and the transference neurosis.
The Psychoanalytic Quarterly, 34, 155–181.
Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanas-
wamy, A., . . . Webster, D. R. (2016). Development and validation of a
deep learning algorithm for detection of diabetic retinopathy in retinal
fundus photographs. Journal of the American Medical Association, 316,
Hamilton, M. (1960). A rating scale for depression. Journal of Neurology,
Neurosurgery and Psychiatry, 23, 56 – 62.
Hatcher, R. L., & Gillaspy, J. A. (2006). Development and validation of a
revised short version of the Working Alliance Inventory. Psychotherapy
Research, 16, 12–25.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Haykin, S. S. (2009). Neural networks and learning machines (3rd ed.).
Upper Saddle River, NJ: Pearson.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estima-
tion for nonorthogonal problems. Technometrics, 12(1), 55– 67.
Hutson, M. (2018). Has artificial intelligence become alchemy? Science,
360, 478.
Imel, Z. E., Barco, J. S., Brown, H. J., Baucom, B. R., Baer, J. S., Kircher,
J. C., & Atkins, D. C. (2014). The association of therapist empathy and
synchrony in vocally encoded arousal. Journal of Counseling Psychol-
ogy, 61, 146 –153.
Imel, Z. E., Caperton, D. D., Tanana, M., & Atkins, D. C. (2017). Technology-
enhanced human interaction in psychotherapy. Journal of Counseling Psy-
chology, 64, 385–393.
Imel, Z. E., Hubbard, R. A., Rutter, C. M., & Simon, G. (2013). Patient-
rated alliance as a measure of therapist performance in two clinical
settings. Journal of Consulting and Clinical Psychology, 81, 154 –165.
Imel, Z. E., Pace, B. T., Soma, C. S., Tanana, M., Gibson, J., Hirsch, T.,
. . . Atkins, D. A. (in press). Initial development and evaluation of an
automated, interactive, web-based therapist feedback system for moti-
vational interviewing fidelity. Psychotherapy.
Imel, Z. E., Steyvers, M., & Atkins, D. C. (2015). Computational psycho-
therapy research: Scaling up the evaluation of patient-provider interac-
tions. Psychotherapy, 52, 19 –30.
Insel, T. R. (2017). Digital phenotyping. Journal of the American Medical
Association, 318, 1215–1216.
Johns, R. G., Barkham, M., Kellett, S., & Saxon, D. (2019). A systematic
review of therapist effects: A critical narrative update and refinement to
review. Clinical Psychology Review, 67, 78 –93.
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspec-
tives, and prospects. Science, 349, 255–260.
Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (2nd
ed.). London, UK: Pearson.
Lambert, M. J., & Barley, D. E. (2001). Research summary on the therapeutic
relationship and psychotherapy outcome. Psychotherapy: Theory, Research,
Practice, Training, 38, 357–361.
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). Big data. The
parable of Google Flu: Traps in big data analysis. Science, 343, 1203–
Lutz, W., Schwartz, B., Hofmann, S. G., Fisher, A. J., Husen, K., & Rubel,
J. A. (2018). Using network analysis for the prediction of treatment
dropout in patients with mood and anxiety disorders: A methodological
proof-of-concept study. Scientific Reports, 8, 7819.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. (2013).
Distributed representations of words and phrases and their composition-
ality. Proceedings of Advances in Neural Information Processing Sys-
tems, 26, 3111–3119.
Miller, W. R., Moyers, T. B., Ernst, D., & Amrhein, P. (2003). Manual for
the motivational interviewing skills code v. 2.0. Retrieved from http://
Miner, A. S., Milstein, A., & Hancock, J. T. (2017). Talking to machines about
personal mental health problems. Journal of the American Medical Asso-
ciation, 318, 1217–1218.
Mitchell, T. M. (1997). Does machine learning really work? AI Magazine,
18, 11–20.
Mjolsness, E., & DeCoste, D. (2001). Machine learning for science: State
of the art and future prospects. Science, 293, 2051–2055. http://dx.doi
Moore, E., II, Clements, M. A., Peifer, J. W., & Weisser, L. (2008). Critical
analysis of the impact of glottal features in the classification of clinical
depression in speech. IEEE Transactions on Biomedical Engineering,
55, 96 –107.
Murphy, K. P. (2012). Machine learning: A probabilistic perspective.
Cambridge, MA: MIT Press.
Okada, S., Ohtake, Y., Nakano, Y. I., Hayashi, Y., Huang, H. H., Takase,
Y., & Nitta, K. (2016). Estimating communication skills using dialogue
acts and nonverbal features in multiple discussion datasets. Proceedings
of the 18th ACM International Conference on Multimodal Interaction
(pp. 169 –176). New York, NY: ACM.
Olfson, M., & Marcus, S. C. (2010). National trends in outpatient psycho-
therapy. The American Journal of Psychiatry, 167, 1456 –1463. http://
Open Science Collaboration. (2015). Estimating the reproducibility of
psychological science. Science, 349, aac4716.
Pagliardini, M., Gupta, P., & Jaggi, M. (2017). Unsupervised learning of
sentence embeddings using compositional n-gram features. CoRRarXiv:
1703.02507. Retrieved from
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,
Grisel, O.,...Vanderplas, J. (2011). Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research, 12, 2825–2830.
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The
development and psychometric properties of LIWC2015. Austin: Uni-
versity of Texas at Austin. Technical Report.
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors
for word representation. Proceedings of the 2014 Conference on Empir-
ical Methods in Natural Language Processing (EMNLP) (pp.1532–
1543). Stroudsburg, PA: Association for Computational Linguistics.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N.,
. . . Silovsky, J. (2011). The Kaldi speech recognition toolkit. IEEE 2011
Workshop on Automatic Speech Recognition and Understanding. Big
Island, Hawaii: IEEE Signal Processing Society.
Python Software Foundation. (2019). Python language reference (Version
3.7.2) [Computer software]. Retrieved from
R Core Team. (2018). R: A language and environment for statistical
computing. Vienna, Austria: R Foundation for Statistical Computing.
Retrieved from
Russell, S. J., & Norvig, P. (2016). Artificial intelligence: A modern
approach (3rd ed.). Essex, UK: Pearson Education.
Salton, G., & McGill, M. J. (1986). Introduction to modern information
retrieval. New York, NY: McGraw-Hill.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell
System Technical Journal, 27, 379 – 423.
Shatte, A. B. R., Hutchinson, D. M., & Teague, S. J. (2019). Machine
learning in mental health: A scoping review of methods and applications.
Psychological Medicine, 49, 1426 –1448.
Stead, W. W. (2018). Clinical implications and challenges of artificial
intelligence and deep learning. Journal of the American Medical Asso-
ciation, 320, 1107–1108.
Stone, P. J., Bales, R. F., Namenwirth, J. Z., & Ogilvie, D. M. (1962). The
general inquirer: A computer system for content analysis and retrieval
based on the sentence as a unit of information. Behavioral Science, 7,
484 – 498.
Substance Abuse and Mental Health Services Administration. (2014).
Projections of national expenditures for treatment of mental and sub-
stance use disorders, 2010 –2020. Rockville, MD: Author.
Tao, K. W., Owen, J., Pace, B. T., & Imel, Z. E. (2015). A meta-analysis of
multicultural competencies and psychotherapy process and outcome. Jour-
nal of Counseling Psychology, 62, 337–350.
Thompson, M. N., Goldberg, S. B., & Nielsen, S. L. (2018). Patient
financial distress and treatment outcomes in naturalistic psychotherapy.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Journal of Counseling Psychology, 65, 523–530.
Tichenor, V., & Hill, C. E. (1989). A comparison of six measures of
working alliance. Psychotherapy: Theory, Research, Practice, Training,
26, 195–199.
Tracey, T. J. G., Wampold, B. E., Lichtenberg, J. W., & Goodyear, R. K.
(2014). Expertise in psychotherapy: An elusive goal? American Psychol-
ogist, 69, 218 –229.
Tryon, G. S., Blackwell, S. C., & Hammel, E. F. (2008). The magnitude of
client and therapist working alliance ratings. Psychotherapy: Theory, Re-
search, Practice, Training, 45, 546 –551.
Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59,
433– 460.
Wandersman, A., Duffy, J., Flaspohler, P., Noonan, R., Lubell, K., Still-
man, L.,...Saul, J. (2008). Bridging the gap between prevention
research and practice: The interactive systems framework for dissemi-
nation and implementation. American Journal of Community Psychol-
ogy, 41(3– 4), 171–181.
Wang, R., Aung, M. S., Abdullah, S., Brian, R., Campbell, A. T., Choud-
hury, T.,...Tseng, V. W. (2016, September). CrossCheck: Toward
passive sensing and detection of mental health changes in people with
schizophrenia. Proceedings of the 2016 ACM International Joint Con-
ference on Pervasive and Ubiquitous Computing (pp. 886-897). New
York, NY: Association for Computing Machinery.
Webb, C. A., DeRubeis, R. J., & Barber, J. P. (2010). Therapist adherence/
competence and treatment outcome: A meta-analytic review. Journal of
Consulting and Clinical Psychology, 78, 200 –211.
Whiteford, H. A., Degenhardt, L., Rehm, J., Baxter, A. J., Ferrari, A. J.,
Erskine, H. E.,...Vos, T. (2013). Global burden of disease attributable
to mental and substance use disorders: Findings from the Global Burden
of Disease Study 2010. The Lancet, 382, 1575–1586.
Xiao, B., Huang, C., Imel, Z. E., Atkins, D. C., Georgiou, P., & Narayanan,
S. S. (2016). A technology prototype system for rating therapist empathy
from audio recordings in addiction counseling. PeerJ Computer Science,
2, e59.
Zilcha-Mano, S. (2017). Is the alliance really therapeutic? Revisiting this
question in light of recent methodological advances. American Psychol-
ogist, 72, 311–325.
Zilcha-Mano, S., & Errázuriz, P. (2017). Early development of mechanisms of
change as a predictor of subsequent change and treatment outcome: The
case of working alliance. Journal of Consulting and Clinical Psychology,
85, 508 –520.
Received March 9, 2019
Revision received July 8, 2019
Accepted August 8, 2019
E-Mail Notification of Your Latest Issue Online!
Would you like to know when the next issue of your favorite APA journal will be available
online? This service is now available to you. Sign up at and you will
be notified by e-mail when issues of interest to you become available!
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
... Recent advancements in the field of computer science resulted in a strong development of predictive models using large amounts of data with Data Mining (DM) and Machine learning (ML) techniques that proved to be relevant for mental health care 28 and specifically for psychotherapy 29,30 . These techniques have been proposed as a promising tool for addressing the complexity of the psychotherapy process, namely by accounting for the dynamic process that occurs between therapist and client throughout the therapy 29,31,32 . A scoping review exploring broadly the applications of ML in psychotherapy 29 identified fifty-one studies, from which 44 were aimed to develop or test ML models and to inform on methods and applications of ML in the context of psychotherapy. ...
... Specifically in the context of therapeutic relationship research, some studies have demonstrated the relevant application of ML to assess therapists' interpersonal and relational skills 33 . Specifically, some studies focusing on therapeutic alliance 31 have discussed the applicability of machine learning and natural language processing to session recordings to predict the client-rated therapeutic alliance by using a large naturalistic psychotherapy dataset. Based on their results, the authors concluded that linguistic signals were indicative of the strength of the alliance, showing that ML techniques can be a useful tool for analyzing therapeutic alliances. ...
... Moreover, these algorithms are capable of capturing intricate patterns and relationships, enabling more accurate predictions of the therapeutic alliance. However, to our knowledge, such technologies were rarely used in the area of therapeutic alliance 31,34 . We have therefore shown that these techniques can be employed to strengthen TA. ...
Full-text available
Therapeutic Alliance (TA) has been consistently reported as a robust predictor of therapy outcomes and is one of the most investigated therapy relational factors. Research on therapists' and clients’ contributions to the alliance development and the alliance-outcome relationship had shown mixed results. The relation of the therapist’s and client’s biological markers with the alliance is an important and under-investigated topic. Taking advantage of data mining techniques, this exploratory study aimed to investigate the role of different therapist and client factors, including heart rate (HR) and electrodermal activity (EDA), in relation to TA. Twenty-two dyads with 6 therapists and 22 clients participated in the study. The Working Alliance Inventory (WAI) was used to evaluate the client’s and therapist's perception of the alliance at the end of each session and through the therapy processes. The Cross-Industry Standard Process for Data Mining (CRISP-DM) was used to explore patterns that may contribute to TA. Machine Learning (ML) models have been employed to provide insights into the predictors and correlates of TA. Our results showed that Linear Regression (LR) was the best technique for predicting the therapist’s TA, with client “Diagnostic” and therapy “Termination” being identified as significant predictors of the therapist’s TA. In addition, for clients’ TA, the Random Forest (RF) was shown to have the best performance. The therapist’s TA and therapy “Outcome” were observed as the most influential predictors for the client’s TA. In addition, while the Heart Rate (therapist) was negatively associated with the therapist’s TA, EDA in the client was a physiological indicator related to the client’s TA. Overall, these findings can assist in identifying key factors that therapists should focus on to enhance the quality of therapeutic alliance. Results are discussed in terms of their consistency with empirical literature, innovative and interdisciplinary research on the therapeutic alliance field, and, in particular, the use of the Data Mining approach in a psychotherapy context.
... Other than the above three major data source types, audio was used and automatically converted to text by two articles (Cohen et al., 2020;Goldberg et al., 2020), which is not considered textual source data. One study used behavioral data collected in a virtual environment for the detection of early-stage cognitive impairments (Tsai et al., 2021). ...
... Regarding the level of analysis, all the articles adopted an individual level of analysis, with the exception of Goldberg et al. (Goldberg et al., 2020), which included an interpersonal level of analysis. ...
... Pearson et al. (Pearson et al., 2019)). Tasks relating to therapy were explored by 7 articles, some of the topics included predicting therapeutic alliance (Goldberg et al., 2020), detecting the therapist's interpersonal skills (Goldberg et al., 2021), providing counseling (Ardiana et al., 2020), and emotion detection systems . Finally, prescription and documentation were only studied by one article. ...
... Conceptually, NLP can be used to detect specific therapeutic processes, such as empathy, emotional regulation, and therapeutic alliance, and quantify their frequency and intensity. By using machine learning algorithms to identify patterns in the data, patterns that represent synchrony can be quantified and predictions about therapeutic outcomes can then be made (Goldberg et al., 2020). Common NLP algorithms that can be used for empathy detection are sentiment analysis, discourse analysis, word embeddings, and named entity recognition. ...
... This work highlights the ability of ML to explore clinically relevant psychotherapy constructs, which are often significant predictors of outcomes but difficult to measure qualitatively in traditional psychotherapy process research. A study exploring the interplay between therapeutic alliance and NLP found a direct association between a strong alliance and improved treatment outcomes (Goldberg et al., 2020). This growing body of evidence underscores the potential for integrating NLP and ML techniques in advancing our understanding of psychotherapy processes and improving client outcomes. ...
Full-text available
The evidence-based treatment (EBT) movement has primarily focused on core intervention content or treatment fidelity and has largely ignored practitioner skills to manage interpersonal process issues that emerge during treatment, especially with difficult-to-treat adolescents (delinquent, substance-using, medical non-adherence) and those of color. A chief complaint of “real world” practitioners about manualized treatments is the lack of correspondence between following a manual and managing microsocial interpersonal processes (e.g. negative affect) that arise in treating “real world clients.” Although family-based EBTs share core similarities (e.g. focus on family interactions, emphasis on practitioner engagement, family involvement), most of these treatments do not have an evidence base regarding common implementation and treatment process problems that practitioners experience in delivering particular models, especially in mid-treatment when demands on families to change their behavior is greatest in treatment – a lack that characterizes the field as a whole. Failure to effectively address common interpersonal processes with difficult-to-treat families likely undermines treatment fidelity and sustained use of EBTs, treatment outcome, and contributes to treatment dropout and treatment nonadherence. Recent advancements in wearables, sensing technologies, multivariate time-series analyses, and machine learning allow scientists to make significant advancements in the study of psychotherapy processes by looking “under the skin” of the provider–client interpersonal interactions that define therapeutic alliance, empathy, and empathic accuracy, along with the predictive validity of these therapy processes (therapeutic alliance, therapist empathy) to treatment outcome. Moreover, assessment of these processes can be extended to develop procedures for training providers to manage difficult interpersonal processes while maintaining a physiological profile that is consistent with astute skills in psychotherapeutic processes. This paper argues for opening the “black box” of therapy to advance the science of evidence-based psychotherapy by examining the clinical interior of evidence-based treatments to develop the next generation of audit- and feedback- (i.e., systemic review of professional performance) supervision systems.
... It requires AI to interpret vast amounts of data and understand and generate human-like texts to assist decisionmaking processes [26,27]. The application of AI in healthcare goes beyond the raw analysis of numerical data, venturing into the realm of natural language processing (NLP) to contextualize and streamline the complexities of medical data interpretation [28,29]. This intersection between advanced language models and clinical expertise heralds a new frontier where the synergy between AI and human clinicians could lead to unprecedented improvements in patient care and medical research [30]. ...
Full-text available
Unlabelled: The rapid progress in artificial intelligence, machine learning, and natural language processing has led to increasingly sophisticated large language models (LLMs) for use in healthcare. This study assesses the performance of two LLMs, the GPT-3.5 and GPT-4 models, in passing the MIR medical examination for access to medical specialist training in Spain. Our objectives included gauging the model's overall performance, analyzing discrepancies across different medical specialties, discerning between theoretical and practical questions, estimating error proportions, and assessing the hypothetical severity of errors committed by a physician. Material and methods: We studied the 2022 Spanish MIR examination results after excluding those questions requiring image evaluations or having acknowledged errors. The remaining 182 questions were presented to the LLM GPT-4 and GPT-3.5 in Spanish and English. Logistic regression models analyzed the relationships between question length, sequence, and performance. We also analyzed the 23 questions with images, using GPT-4's new image analysis capability. Results: GPT-4 outperformed GPT-3.5, scoring 86.81% in Spanish (p < 0.001). English translations had a slightly enhanced performance. GPT-4 scored 26.1% of the questions with images in English. The results were worse when the questions were in Spanish, 13.0%, although the differences were not statistically significant (p = 0.250). Among medical specialties, GPT-4 achieved a 100% correct response rate in several areas, and the Pharmacology, Critical Care, and Infectious Diseases specialties showed lower performance. The error analysis revealed that while a 13.2% error rate existed, the gravest categories, such as "error requiring intervention to sustain life" and "error resulting in death", had a 0% rate. Conclusions: GPT-4 performs robustly on the Spanish MIR examination, with varying capabilities to discriminate knowledge across specialties. While the model's high success rate is commendable, understanding the error severity is critical, especially when considering AI's potential role in real-world medical practice and its implications for patient safety.
... In the case of using high-dimensional or time-series electronic health records (EHRs) data, natural language processing (NLP) methods can be explored to extract meaningful information and further improve the predictive accuracy [38]. However, existing NLP methods are known to have limitations due to transcriptional inaccuracies (i.e., misinterpreting spoken words) and speech assignment errors (i.e., diarization) [39]. Chief complaint concepts can be handled with UMLS codes that contain a variety of information in a "source of knowledge" format. ...
Full-text available
Overcrowding of emergency department (ED) has put a strain on national healthcare systems and adversely affected the clinical outcomes of critically ill patients. Early identification of critically ill patients prior to ED visits can help induce optimal patient flow and allocate medical resources effectively. This study aims to develop ML-based models for predicting critical illness in the community, paramedic, and hospital stages using Korean National Emergency Department Information System (NEDIS) data. Random forest and light gradient boosting machine (LightGBM) were applied to develop predictive models. The predictive model performance based on AUROC in community stage, paramedic stage, and hospital stage was estimated to be 0.870 (95% CI: 0.869–0.871), 0.897 (95% CI: 0.896–0.898), and 0.950 (95% CI: 0.949–0.950) in random forest and 0.877 (95% CI: 0.876–0.878), 0.899 (95% CI: 0.898–0.900), and 0.950 (95% CI: 0.950–0.951) in LightGBM, respectively. The ML models showed high performance in predicting critical illness using variables available at each stage, which can be helpful in guiding patients to appropriate hospitals according to their severity of illness. Furthermore, a simulation model can be developed for proper allocation of limited medical resources.
Full-text available
Neuropsychiatric disorders pose a high societal cost, but their treatment is hindered by lack of objective outcomes and fidelity metrics. AI technologies and specifically Natural Language Processing (NLP) have emerged as tools to study mental health interventions (MHI) at the level of their constituent conversations. However, NLP's potential to address clinical and research challenges remains unclear. We therefore conducted a pre-registered systematic review of NLP-MHI studies using PRISMA guidelines ( to evaluate their models, clinical applications, and to identify biases and gaps. Candidate studies (n = 19,756), including peer-reviewed AI conference manuscripts, were collected up to January 2023 through PubMed, PsycINFO, Scopus, Google Scholar, and ArXiv. A total of 102 articles were included to investigate their computational characteristics (NLP algorithms, audio features, machine learning pipelines, outcome metrics), clinical characteristics (clinical ground truths, study samples, clinical focus), and limitations. Results indicate a rapid growth of NLP MHI studies since 2019, characterized by increased sample sizes and use of large language models. Digital health platforms were the largest providers of MHI data. Ground truth for supervised learning models was based on clinician ratings (n = 31), patient self-report (n = 29) and annotations by raters (n = 26). Text-based features contributed more to model accuracy than audio markers. Patients' clinical presentation (n = 34), response to intervention (n = 11), intervention monitoring (n = 20), providers' characteristics (n = 12), relational dynamics (n = 14), and data preparation (n = 4) were commonly investigated clinical categories. Limitations of reviewed studies included lack of linguistic diversity, limited reproducibility, and population bias. A research framework is developed and validated (NLPxMHI) to assist computational and clinical researchers in addressing the remaining gaps in applying NLP to MHI, with the goal of improving clinical utility, data access, and fairness.
Full-text available
p style="text-align: justify;">This article presents a narrative review of the development of the concept of the “alliance” in psychotherapy and counseling and its current research trends. The paper describes the change in the perceptions of the alliance — from its perception as neurotic transfer to a separate parameter of relationship, consisting of successful collaboration and trusting interpersonal connection. The most commonly used survey methods for assessing the alliance are reviewed, with psychometric properties provided. The article gives an overview of the use of psycho-physiological and behavioral parameters of the therapist and client as correlates of the alliance. The advantages and disadvantages of objective methods of studying the alliance are discussed. The authors note the relationship between the quality of the alliance in the therapist-client dyads and the degree of interpersonal synchronization of the dyads at different levels during sessions, including measures of oxytocin, the convergence of dyad language styles, and the level of brain-to-brain synchronization. The authors conclude that conducting multi-level, interdisciplinary studies that combine objective and subjective parameters is necessary for the formation of a model of the alliance that includes its cognitive and affective aspects.</p
The usage of violent language has significantly increased due to social media and networking. A key component in this is the younger generation. More than half of young people who use social media are affected by cyberbullying. Harmful interactions occur as a result of insults expressed on social net-working websites. These comments foster an unprofessional tone on the internet, which is usually un-derstood and mitigated through passive mechanisms and techniques. Additionally, the recall rates of current systems that combine insult detection with machine learning and natural language processing are incredibly poor. To establish a viable classification scheme for such concepts, the research ana-lyzes how to identify bullying in writing by examining and testing various approaches. We propose an effective method to assess bullying, identify aggressive comments, and analyze their veracity. NLP and machine learning are employed to examine social perception and identify the aggressive impact on in-dividuals or groups. The ideal prototyping system for identifying cyber dangers in social media relies heavily on an efficient classifier. The goal of the paper is to emphasize the critical role that learning strategies play in enhancing natural language processing efficiency.
Full-text available
Natural language processing (NLP) is a subfield of machine learning that may facilitate the evaluation of therapist–client interactions and provide feedback to therapists on client outcomes on a large scale. However, there have been limited studies applying NLP models to client-outcome prediction that have (a) used transcripts of therapist–client interactions as direct predictors of client-symptom improvement, (b) accounted for contextual linguistic complexities, and (c) used best practices in classical training and test splits in model development. Using 2,630 session recordings from 795 clients and 56 therapists, we developed NLP models that directly predicted client symptoms of a given session based on session recordings of the previous session (Spearman’s ρ = .32, p < .001). Our results highlight the potential for NLP models to be implemented in outcome-monitoring systems to improve quality of care. We discuss implications for future research and applications.
Full-text available
The Cognitive Therapy Rating Scale (CTRS) is an observer-rated measure of cognitive behavioral therapy (CBT) treatment fidelity. Although widely used, the factor structure and psychometric properties of the CTRS are not well established. Evaluating the factorial validity of the CTRS may increase its utility for training and fidelity monitoring in clinical practice and research. The current study used multilevel exploratory factor analysis to examine the factor structure of the CTRS in a large sample of therapists (n = 413) and observations (n = 1,264) from community-based CBT training. Examination of model fit and factor loadings suggested that three within-therapist factors and one between-therapist factor provided adequate fit and the most parsimonious and interpretable factor structure. The three within-therapist factors included items related to (a) session structure, (b) CBT-specific skills and techniques, and (c) therapeutic relationship skills, although three items showed some evidence of cross-loading. All items showed moderate to high loadings on the single between-therapist factor. Results support continued use of the CTRS and suggest factors that may be a relevant focus for therapists, trainers, and researchers.
Full-text available
Measurement-based care (MBC) can improve mental health treatment outcomes and is a priority within the Department of Veterans Affairs (VA). However, to date, MBC efforts within the VA have focused on assessment of psychological symptoms to the exclusion of psychotherapy process variables such as the therapeutic alliance that may predict treatment response. This quality improvement project involved the implementation of routine monitoring of alliance within a VA substance use disorder (SUD) clinic predominantly serving veterans with serious mental illness. Alliance ratings were provided by 98 veterans following group therapy sessions. Low alliance ratings were used by the clinicians (n = 4) leading the groups (n = 9) as opportunities to discuss veterans' treatment experience and increase engagement. Using multilevel models that accounted for the nested nature of the data and veteran demographics, alliance ratings showed a small increase over time (B = 0.075, p < .001, f2 = 0.033). In addition, maximum alliance rating (i.e., patients' highest rating of alliance across all observations) was significantly but modestly associated with attendance at both MBC group sessions and all SUD-related visits in the 3 months following the initial alliance rating (Bs = 0.96 and 1.79; ps = .006 and .004; f2s = 0.079 and 0.088, respectively). Average alliance rating, however, was not associated with treatment attendance (ps > .050). Findings suggest that assessment of alliance is feasible within a VA SUD clinic and may provide information signaling risk for disengagement that could be used for increasing treatment engagement. (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Full-text available
Here we summarize recent progress in machine learning for the chemical sciences. We outline machine-learning techniques that are suitable for addressing research questions in this domain, as well as future directions for the field. We envisage a future in which the design, synthesis, characterization and application of molecules and materials is accelerated by artificial intelligence.
Full-text available
Although psychotherapy is on the whole an effective health care practice, treatment efficacy for patients with varying levels of reported financial distress is less clear. The purpose of this study was to examine the impact of patient self-reported financial distress on psychotherapy outcomes using a large, naturalistic psychotherapy dataset of college students who sought psychotherapy services (n = 5,078 patients, n = 238 therapists). Multilevel models accounted for the nesting of patients within therapists and treatment outcome was assessed using the Outcome Questionnaire-45. Patients on the whole showed treatment effects in the moderate to large range (d = 0.73). However, patients with higher financial distress at baseline were more likely to drop out of treatment after 1 session and, when controlling for baseline severity, had worse outcomes at the end of treatment. Though the effects were small, these findings held when controlling for age, gender, and treatment length. Further, the relationship between baseline financial distress and treatment retention (but not treatment outcome) varied between therapists, though the effects were also small. Patients’ financial distress specifically and social class more generally may be patient contributors to psychotherapy outcome (and therapist effects) that warrant further attention.
Full-text available
Machine learning needs more rigor, scientists argue.
Background This paper aims to synthesise the literature on machine learning (ML) and big data applications for mental health, highlighting current research and applications in practice. Methods We employed a scoping review methodology to rapidly map the field of ML in mental health. Eight health and information technology research databases were searched for papers covering this domain. Articles were assessed by two reviewers, and data were extracted on the article's mental health application, ML technique, data type, and study results. Articles were then synthesised via narrative review. Results Three hundred papers focusing on the application of ML to mental health were identified. Four main application domains emerged in the literature, including: (i) detection and diagnosis; (ii) prognosis, treatment and support; (iii) public health, and; (iv) research and clinical administration. The most common mental health conditions addressed included depression, schizophrenia, and Alzheimer's disease. ML techniques used included support vector machines, decision trees, neural networks, latent Dirichlet allocation, and clustering. Conclusions Overall, the application of ML to mental health has demonstrated a range of benefits across the areas of diagnosis, treatment and support, research, and clinical administration. With the majority of studies identified focusing on the detection and diagnosis of mental health conditions, it is evident that there is significant room for the application of ML to other areas of psychology and mental health. The challenges of using ML techniques are discussed, as well as opportunities to improve and advance the field.
Artificial intelligence (AI) and deep learning are entering the mainstream of clinical medicine. For example, in December 2016, Gulshan et al¹ reported development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. An accompanying editorial by Wong and Bressler² pointed out limits of the study, the need for further validation of the algorithm in different populations, and unresolved challenges (eg, incorporating the algorithm into clinical work flows and convincing clinicians and patients to “trust a ‘black box’”). Sixteen months later, the Food and Drug Administration (FDA)³ permitted marketing of the first medical device to use AI to detect diabetic retinopathy. FDA reduced the risk of releasing the device by limiting the indication for use to screening adults who do not have visual symptoms for greater than mild retinopathy, to refer them to an eye care specialist.
Objective: To review the therapist effects literature since Baldwin and Imel's (2013) review. Method: Systematic literature review of three databases (PsycINFO, PubMed and Web of Science) replicating Baldwin and Imel (2013) search terms. Weighted averages of therapist effects (TEs) were calculated, and a critical narrative review of included studies conducted. Results: Twenty studies met inclusion criteria (3 RCTs; 17 practice-based) with 19 studies using multilevel modeling. TEs were found in 19 studies. The TE range for all studies was 0.2% to 29% (weighted average = 5%). For RCTs, 1%-29% (weighted average = 8.2%). For practice-based studies, 0.2-21% (weighted average = 5%). The university counseling subsample yielded a lower TE (2.4%) than in other groupings (i.e., primary care, mixed clinical settings, and specialist/focused settings). Therapist sample sizes remained lower than recommended, and few studies appeared to be designed specifically as TE studies, with too few examples of maximising the research potential of large routine patient datasets. Conclusions: Therapist effects are a robust phenomenon although considerable heterogeneity exists across studies. Patient severity appeared related to TE size. TEs from RCTs were highly variable. Using an overall therapist effects statistic may lack precision, and TEs might be better reported separately for specific clinical settings.