Content uploaded by David C Atkins
Author content
All content in this area was uploaded by David C Atkins on Aug 23, 2020
Content may be subject to copyright.
Machine Learning and Natural Language Processing in Psychotherapy
Research: Alliance as Example Use Case
Simon B. Goldberg
University of Wisconsin–Madison
Nikolaos Flemotomos and Victor R. Martinez
University of Southern California
Michael J. Tanana and Patty B. Kuo
University of Utah
Brian T. Pace
University of Utah and Veterans Affairs Palo Alto Health
Care System, Palo Alto, California
Jennifer L. Villatte
University of Washington
Panayiotis G. Georgiou
University of Southern California
Jake Van Epps and Zac E. Imel
University of Utah
Shrikanth S. Narayanan
University of Southern California
David C. Atkins
University of Washington
Artificial intelligence generally and machine learning specifically have become deeply woven into
the lives and technologies of modern life. Machine learning is dramatically changing scientific
research and industry and may also hold promise for addressing limitations encountered in mental
health care and psychotherapy. The current paper introduces machine learning and natural language
processing as related methodologies that may prove valuable for automating the assessment of
meaningful aspects of treatment. Prediction of therapeutic alliance from session recordings is used
as a case in point. Recordings from 1,235 sessions of 386 clients seen by 40 therapists at a university
counseling center were processed using automatic speech recognition software. Machine learning
algorithms learned associations between client ratings of therapeutic alliance exclusively from
session linguistic content. Using a portion of the data to train the model, machine learning algorithms
modestly predicted alliance ratings from session content in an independent test set (Spearman’s ⫽
.15, p⬍.001). These results highlight the potential to harness natural language processing and
machine learning to predict a key psychotherapy process variable that is relatively distal from
linguistic content. Six practical suggestions for conducting psychotherapy research using machine
Editor’s Note. Sigal Zilcha-Mano served as the action editor for this
article.—DMK Jr.
XSimon B. Goldberg, Department of Counseling Psychology, Univer-
sity of Wisconsin–Madison; Nikolaos Flemotomos, Department of Elec-
trical Engineering, University of Southern California; Victor R. Martinez,
Department of Computer Science, University of Southern California; Mi-
chael J. Tanana, College of Social Work, University of Utah; Patty B. Kuo,
Department of Educational Psychology, University of Utah; Brian T. Pace,
Department of Educational Psychology, University of Utah, and Veterans
Affairs Palo Alto Health Care System, Palo Alto, California; Jennifer L.
Villatte, Department of Psychiatry and Behavioral Sciences, University of
Washington; Panayiotis G. Georgiou, Department of Electrical Engineering,
University of Southern California; Jake Van Epps, University of Utah Coun-
seling Center, University of Utah; Zac E. Imel, Department of Educational
Psychology, University of Utah; Shrikanth S. Narayanan, Department of
Electrical Engineering, University of Southern California; David C. Atkins,
Department of Psychiatry and Behavioral Sciences, University of Washington.
Michael J. Tanana, David C. Atkins, Shrikanth S. Narayanan, and Zac E.
Imel are cofounders with equity stake in a technology company, Lyssn.io,
focused on tools to support training, supervision, and quality assurance of
psychotherapy and counseling. Shrikanth S. Narayanan is chief scientist
and co-founder with equity stake of Behavioral Signals, a technology
company focused on creating technologies for emotional and behavioral
machine intelligence. The remaining authors report no conflicts of interest.
Portions of the data presented in this article were reported at the North
American Society for Psychotherapy Research meeting in Park City, UT in
September 2018. Funding was provided by the National Institutes of
Health/National Institute on Alcohol Abuse and Alcoholism (Award R01/
AA018673). Support for this research was also provided by the University
of Wisconsin-Madison, Office of the Vice Chancellor for Research and
Graduate Education with funding from the Wisconsin Alumni Research
Foundation.
Correspondence concerning this article should be addressed to Simon B.
Goldberg, Department of Counseling Psychology, University of Wisconsin–
Madison, 335 Education Building, 1000 Bascom Mall, Madison, WI 53703.
E-mail: sbgoldberg@wisc.edu
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Journal of Counseling Psychology
© 2020 American Psychological Association 2020, Vol. 67, No. 4, 438– 448
ISSN: 0022-0167 http://dx.doi.org/10.1037/cou0000382
438
learning are presented along with several directions for future research. Questions of dissemination
and implementation may be particularly important to explore as machine learning improves in its
ability to automate assessment of psychotherapy process and outcome.
Public Significance Statement
Our study suggests that client-rated therapeutic alliance can be predicted using session content
through machine learning models, albeit modestly.
Keywords: machine learning, natural language processing, methodology, artificial intelligence,
therapeutic alliance
Supplemental materials: http://dx.doi.org/10.1037/cou0000382.supp
New directions in science are launched by new tools much more than
by new concepts. The effect of a concept-driven revolution is to
explain old things in new ways. The effect of a tool driven revolution
is to discover new things that have to be explained. (Freeman Dyson,
1998, pp. 50 –51)
Whether or not we know it, and certainly whether or not we like
it, machine learning (ML) is transforming modern life. From eerily
prescient Google search suggestions or Amazon product recom-
mendations to iPhones capable of understanding spoken language
(i.e., Siri), ML undergirds many of the most commonplace tech-
nologies of industrialized society. Manifestations range from the
seemingly benign or mundane to the perhaps more pernicious (e.g.,
targeted advertising). These contemporary conveniences are based
on a family of quantitative methods that are rapidly changing
science and technology and fall under the general umbrella of
artificial intelligence. The term artificial intelligence has been
defined as “the study of agents that receive percepts from the
environment and perform actions” (Russell & Norvig, 2016,p.
viii). Early work on artificial intelligence dates back to the 1950s
(e.g., Turing, 1950). ML combines pattern recognition and statis-
tical inference and plays an integral role within the inner workings
of artificial intelligence. ML can be defined as “the study of
computer algorithms capable of learning to improve their perfor-
mance of a task on the basis of their own previous experience”
(Mjolsness & DeCoste, 2001, p. 2051).
The ways that ML has impacted scientific research and industry is
hard to overstate (Jordan & Mitchell, 2015;Mjolsness & DeCoste,
2001;Stead, 2018). Evidence for the widespread relevance of ML
dates back several decades (e.g., detecting fraudulent credit card
transactions; Mitchell, 1997). More recent ML-based innovations
in medicine include detection of diabetic retinopathy (Gulshan et al.,
2016), informing cancer treatment decision making (Bibault, Gi-
raud, & Burgun, 2016), and predicting disease outbreak (Chen,
Hao, Hwang, Wang, & Wang, 2017). Innovations based on ML are
occurring in basic science as well (e.g., materials science; Butler,
Davies, Cartwright, Isayev, & Walsh, 2018). While not all ML
applications in science and technology have gone smoothly (e.g.,
Google Flu consistently overestimating flu occurrence; Lazer,
Kennedy, King, & Vespignani, 2014), the potential is unequivocal.
Efforts to apply ML within mental health care are also underway
(for a recent scoping review, see Shatte, Hutchinson, & Teague,
2019). Examples include the use of passive sensing to predict
psychosis (e.g., data collected from sensors built into modern
smartphones; Insel, 2017;Wang et al., 2016), analysis of speech
signals to infer symptoms of depression (France, Shiavi, Silver-
man, Silverman, & Wilkes, 2000;Moore, Clements, Peifer, &
Weisser, 2008), prediction of treatment dropout from ecological
momentary assessment (Lutz et al., 2018), and the use of conver-
sational agents (i.e., computers) for clinical assessment and even
treatment (Miner, Milstein, & Hancock, 2017). While not incor-
porated in most settings, these ML-based innovations could dra-
matically change how mental health treatment and psychotherapy,
in particular, is provided. Importantly, once an ML algorithm has
been appropriately trained, it can be deployed at scale without
additional human judgment.
The Need for Innovation in Psychotherapy
Psychotherapy is in need of innovation. For one, mental health
care matters: mental health conditions are extremely common and
associated with enormous economic and social costs (Substance
Abuse and Mental Health Services Administration, 2014;Whit-
eford et al., 2013). Psychotherapy is a frontline treatment approach
(Cuijpers et al., 2014), with efficacy similar to psychotropic med-
ications and with potentially longer lasting benefits and fewer side
effects (Berwian, Walter, Seifritz, & Huys, 2017). Yet despite
enormous investment in psychotherapy in terms of therapist and
client time and health care dollars (Olfson & Marcus, 2010), what
actually happens in psychotherapy is largely unknown (i.e., is
unobserved). Psychotherapy research remains heavily reliant on
retrospective client or therapist self-report (e.g., Elliott, Bohart,
Watson, & Murphy, 2018;Flückiger, Del Re, Wampold, & Hor-
vath, 2018), limiting our understanding of actual therapist-client
interactions that drive treatment. We do know that treatment out-
comes vary widely, related to client (Lambert & Barley, 2001;
Thompson, Goldberg, & Nielsen, 2018), therapist (Baldwin &
Imel, 2013;Johns, Barkham, Kellett, & Saxon, 2019), relationship
(e.g., therapeutic alliance; Flückiger et al., 2018), and treatment-
specific factors.
One source of variability may be treatment quality. To date,
however, there are no established and routinely implemented
methods for quality control. The absence of quality control limits
clinical training, supervision, and the development of therapist
expertise (Tracey, Wampold, Lichtenberg, & Goodyear, 2014);
decreases the ability to demonstrate quality to payers (Fortney et
al., 2017); slows scientific progress in determining which treat-
ments are likely to succeed and why; and restricts efforts to
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
439
MACHINE LEARNING AND ALLIANCE
improve service delivery (Fairburn & Cooper, 2011). For these
reasons, psychotherapy researchers have developed numerous ob-
server rating systems to evaluate aspects of treatment quality (e.g.,
adherence and competence; Goldberg, Baldwin, et al., 2019;
Webb, DeRubeis, & Barber, 2010). Behavioral coding has been
invaluable in allowing researchers to understand what occurs in the
moment between therapists and clients that may contribute to
therapeutic change. However, human-coded rating systems are
labor intensive, expensive to implement, and not widely used in
community-based therapy (Fairburn & Cooper, 2011). Clients may
also be asked to provide evaluation of treatment quality (e.g.,
measures of satisfaction, therapeutic alliance; Flückiger et al.,
2018). Regular use of these kinds of measures, while robust
predictors of outcome (Flückiger et al., 2018), increase burden on
clients and providers, are at risk for response set biases (e.g., social
desirability) and random error, and have known psychometric
limitations (e.g., ceiling effects; Tryon, Blackwell, & Hammel,
2008).
The New Tools of Psychotherapy Research
Recent methodological advances may be quickly changing our
ability to process the complex data of psychotherapy (Imel, Cap-
erton, Tanana, & Atkins, 2017) and could allow automated assess-
ment of treatment quality along with other outcome and process
variables. Two related innovations include the development of
natural language processing (NLP) and ML. As spoken language
forms a key component of most psychotherapies, the ability to
rapidly and reliably process speech (or text) data may allow
routine assessment of treatment quality and evaluation of numer-
ous other constructs of interest. Several recent proof-of-concept
examples have appeared in the literature, including using NLP and
ML to reliably code motivational interviewing treatment fidelity
(Atkins, Steyvers, Imel, & Smyth, 2014;Imel et al., in press), to
differentiate classes of psychotherapy (e.g., cognitive-behavioral
therapy and psychodynamic psychotherapy; Imel, Steyvers, &
Atkins, 2015), and to identify linguistic behaviors of effective
counselors in text-based crisis counseling (Althoff, Clark, & Les-
kovec, 2016).
The current study extends these efforts further by employing
NLP and ML to predict one of the most studied process variables
in psychotherapy: the therapeutic alliance (Flückiger et al., 2018).
This was examined within the context of a large, naturalistic
psychotherapy dataset drawn from a university counseling center.
Session recordings were available for 1,235 sessions of 386 clients
seen by 40 therapists. NLP and ML methods were used to predict
client-rated alliance from session recordings.
Alliance is used as a test case to demonstrate the potential
applicability of NLP and ML for several reasons. First, alliance is
important for effective psychotherapy, based on its robust relation-
ship with outcome (Flückiger et al., 2018). Second, alliance, unlike
other more objective linguistic features (e.g., ratio of open and
closed questions in motivational interviewing adherence coding;
Miller, Moyers, Ernst, & Amrhein, 2003), requires a potentially
higher order of processing to assess (e.g., through the cognitive
and affective system of a client, therapist, or observer providing
alliance ratings). This additional level of abstraction likely makes
automated prediction more difficult, but also more widely relevant
if it can be accomplished. Third, alliance represents a relatively old
concept (Bordin, 1979;Greenson, 1965) that may be less viable for
concept-driven innovations (Dyson, 1998). New tools, however,
could drive innovation in this area. There are also important open
questions related to alliance, such as the proportion and cause of
therapist and client contributions to alliance (Baldwin, Wampold,
& Imel, 2007), the source of unreliability in alliance ratings across
rating perspectives (i.e., client, therapist, and observer; Tichenor &
Hill, 1989), the state- versus traitlike qualities of alliance (Zilcha-
Mano, 2017), the potentially causal nature of alliance as a driver of
symptom change (Falkenström, Granström, & Holmqvist, 2013;
Flückiger et al., 2018;Zilcha-Mano & Errázuriz, 2017), and ways
to include alliance assessment in routine clinical care without
increasing participant burden (Duncan et al., 2003;Goldberg,
Rowe, et al., 2019). While NLP and ML are likely not panacea for
resolving all outstanding debates regarding alliance, they may be
useful research tools. Theoretically, these questions could be ad-
dressed more thoroughly if ML enabled alliance assessment on a
much larger scale, particularly if ML models were built in a way
to minimize construct irrelevant variance (e.g., social desirability).
Ultimately, assessment of alliance could be automated using ML,
providing clients and therapists with ongoing information about
this aspect of therapeutic process without the drawbacks (e.g., time
required, psychometric issues) of repeated self-report assessment.
Such technology could also be used to assess alliance directly from
session transcripts or recordings.
Prior to presenting a preliminary attempt at assessing alliance
using NLP and ML, it is worth introducing basic concepts involved
in each methodology. This is, of course, intended to be a cursory
treatment and interested readers are encouraged to review sources
cited below.
Basics of NLP
NLP is a subfield of computer science and linguistics focused on
the interaction between machines and humans through language
(Jurafsky & Martin, 2014). NLP aims to understand human com-
munication by processing and analyzing large quantities of textual
data. Popular applications of NLP include machine translation
(e.g., Google Translate), question-answering systems, or sentiment
analysis (e.g., extraction of sentiments within social media).
Typically, NLP applications start with a collection of raw text
documents (i.e., a language corpus). From this corpus, the first step
is to extract or estimate quantitative features from the text. One of
the most widely used NLP features is the bag-of-words represen-
tation (BoW). In BoW, each document is represented by counts of
its unique words, without regard to the ordering of these words.
Conceptually, BoW is a large crosstabulation table of words by
documents. Other common text features include N-grams (Shan-
non, 1948), which are short multiword phrases with Nelements
(e.g., bigrams include 2-word phrases); dictionary-based features,
such as those provided by Linguistic Inquiry and Word Count
(LIWC; Pennebaker, Boyd, Jordan, & Blackburn, 2015)orthe
General Inquirer (Stone, Bales, Namenwirth, & Ogilvie, 1962);
and dialogue acts (Okada et al., 2016), which try to capture a
high-level interaction between participants in a conversation (i.e.,
“statement,” “question,” etc.). More recently, linguistic units are
converted to a vector-space representation of either word (Mikolov,
Sutskever, Chen, Corrado, & Dean, 2013,Pennington, Socher, &
Manning, 2014) or sentence (Pagliardini, Gupta, & Jaggi, 2017)
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
440 GOLDBERG ET AL.
embeddings, which capture the semantic context. Words (or sen-
tences) that appear in similar contexts appear closer to each other
in vector space, and semantic relationships are represented by the
operations of addition and subtraction (e.g., v(king) ⫺v(man) ⫹
v(woman) ⫽v(queen) where v(w) is represents the vector for
word w).
Basics of ML
The human brain has a remarkable ability to learn and recognize
patterns from its surrounding environment. ML comprises a set of
computational techniques simulating this capability (Haykin,
2009). As opposed to knowledge-based approaches, where a hu-
man designs an algorithm having specific rules in mind, ML is
typically based on data-driven methods and on statistical inference.
ML algorithms derive prediction rules from (typically) large
amounts of data.
Two major paradigms in ML are unsupervised and supervised
learning (Murphy, 2012). Similar to cluster analysis, unsupervised
learning does not involve an outcome to predict but rather focuses
on finding structure within a given set of data. Supervised learning
is similar to regression modeling, in which an outcome (either
discrete or continuous) is associated with a set of input data, and
the ML algorithm is tasked with finding an optimal mapping
function between the input data and the outcome (e.g., linking
linguistic content with alliance ratings). Once such a mapping has
been learned, it can be used to predict outcomes for new data.
Since the goal of ML is to apply the algorithm on previously
unseen data, ML analyses train algorithms on a subset of “training
data” but are evaluated on a separate subset of “test data.” Typical
supervised learning algorithms include support vector machines,
regularized linear or logistic regression, and decision trees (Mur-
phy, 2012). Recently, there has been rapid development and in-
creased focus on artificial neural networks and deep learning
techniques (Goodfellow, Bengio, & Courville, 2016).
Method
Participants and Setting
Data were collected at the counseling center of a large, Western
university. The counseling center provides approximately 10,000
sessions per year, with treatment focused on concerns common
among undergraduate and graduate students (e.g., depression, anx-
iety, substance use, academic concerns, relationship concerns;
Benton, Robertson, Tseng, Newton, & Benton, 2003). Treatment is
provided by a combination of licensed permanent staff (including
social workers, psychologists, and counselors) as well as trainees
pursuing masters- or doctoral-level mental health degrees (e.g.,
masters of social work, doctorate in counseling/clinical psychol-
ogy).
Data were collected between September 11, 2017 and December
11, 2018. Both clients and therapists provided consent for audio
recording of sessions and for use of recordings for the current
study. Recordings were made from microphones installed in clinic
offices and archived on clinic servers. Two microphones were
hung from the ceiling in each room. One cardioid choir mic was
hung to capture voice anywhere in the room and a second choir
mic pointed in the direction where the therapist generally sits. In
order for sessions to be recorded, clinicians had to start and stop
recordings (i.e., sessions were not recorded automatically). All
recordings were from individual therapy sessions (approximately
50 min in length). All audio recordings with associated alliance
ratings were used (i.e., no exclusions were made). Alliance is
assessed routinely in the clinic, with no standardized instructions
regarding how therapists use these ratings in therapy.
The current study was integrated into the partner clinic with
minimum modifications to the existing clinic workflow. One fea-
ture of the workflow is collecting alliance ratings prior to sessions,
rather than asking clients to complete measures both before (e.g.,
symptom ratings) and after (e.g., alliance ratings) session. When
making alliance ratings prior to session, clients were asked to
reflect on their experience of alliance at their previous session (i.e.,
time ⫺1). In all models, alliance ratings were associated with the
session they were intended to represent (e.g., ratings made prior to
Session 2 were associated with Session 1). No alliance ratings
were made prior to the initial session. Study procedures were
approved by the relevant institutional review board.
Clients were, on average, 23.77 years old (SD ⫽4.86). The
majority of the sample identified as female (n⫽214, 55.4%), with
the remainder identifying as male (n⫽158), nonbinary (n⫽5),
genderqueer (n⫽1), gender neutral (n⫽3), female-to-male
transgender (n⫽1), and questioning (n⫽2), with two choosing
not to respond. The client sample predominantly identified as
White (n⫽294, 76.2%), with the remainder identifying as Latinx
(n⫽33), Asian American (n⫽28), African American (n⫽5),
Pacific Islander (n⫽2), Middle Eastern (n⫽1), and multiracial
(n⫽21), with two choosing not to respond.
Demographic data were available from 26 of the 40 included
therapists. Therapists were, on average, 35.15 years old (SD ⫽
14.04). The majority identified as female (n⫽17, 65.4%), with the
remainder identifying as male (n⫽7), or genderqueer (n⫽1).
The majority identified as White (n⫽15, 57.7%), with the
remainder identifying as Latinx (n⫽4), Asian American (n⫽3),
African American (n⫽2), Middle Eastern (n⫽1), and multiracial
(n⫽1).
Measures
Therapeutic alliance was assessed using a previously validated
(Imel, Hubbard, Rutter, & Simon, 2013) four-item version of the
Working Alliance Inventory—Short Form Revised (Hatcher &
Gillaspy, 2006) representing the bond, task, and goal dimensions
of alliance. Items included “_________ and I are working towards
mutually agreed upon goals” (goal), “I believe the way we are
working on my problem is correct” (task), “I feel that _________
appreciates me” (bond), and “_________ really understands me”
(bond). Items were rated ona1(Never)to7(Always) scale. A total
score was computed by averaging across the four items. Internal
consistency reliability was adequate in the current sample (␣⫽
.90). As noted above, ratings were made prior to each session
(starting with the second session) asking clients to reflect back on
their experience of alliance in the previous session. Although
alliance can be rated from various perspectives (e.g., client, ther-
apist, observer; Flückiger et al., 2018), the current study employed
client-rated alliance due to its robust link with treatment outcome,
ease of data collection, and ecological validity (i.e., the experience
of alliance largely exists in the subjective experience of the client).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
441
MACHINE LEARNING AND ALLIANCE
Data Analysis
For this study, we used 1,235 recorded sessions together with
client-reported alliance, assessed prior to the subsequent session
occurring between the same therapist and client. Audio recordings
were processed through a speech pipeline to generate automatic
speech-to-text transcriptions. The automatic speech recognition
made use of the open-source, freely available Kaldi software
(Povey et al., 2011). Components of the pipeline along with their
corresponding accuracy (vs. human transcription) using data from
the current study include: (a) a voice activity detector, where
speech segments are detected over silence or noise (unweighted
average recall ⫽82.7%); (b) a speaker diarization system, where
the speech is clustered into speaker-homogeneous groups (i.e.,
Speaker A, Speaker B; diarization error rate ⫽6.4%); (c) a speaker
role recognizer, where each group is assigned the label “‘therapist”
or “client” (misclassification rate ⫽0.0%); and (d) an automatic
speech recognizer, which transduces speech to text (word error
rate ⫽36.43%). The modules of the speech pipeline have been
adapted with the Kaldi speech recognition toolkit (Povey et al.,
2011) using psychotherapy sessions provided by the same coun-
seling center, but not used for the alliance prediction, thus not
inducing bias. A similar system architecture is described in Xiao et
al. (2016) and Flemotomos et al. (2019).
Linguistic features were extracted from resulting transcripts,
independently for therapist and client text. We report results using
unigrams and bigrams (i.e., 1- and 2-word pairings) weighted by
the term frequency-inverse document frequency (tf-idf; Salton &
McGill, 1986) or sentence (Sent2vec) embeddings (Pagliardini et
al., 2017). Tf-idf weighting accounts for the frequency with which
words appear within a given document (i.e., session), while also
considering its frequency within the larger corpus of text (i.e., all
sessions). This allows less commonly used words (e.g., suicide)
more weight than commonly used words (e.g., the). Thus, less
common words are treated as more important. Tf-idf weighting
was calculated across all sessions in the train set and applied to the
test set. As described earlier, Sent2vec maps sentences to vectors
of real numbers. Using Sent2vec, the session is represented as the
mean of its sentence embeddings. Models used linear regression
with L2-norm regularization (i.e., ridge regression; Hoerl & Ken-
nard, 1970), which is a method designed for highly correlated
features, which is often the case for NLP data.
To estimate the performance of our method, experiments were
run using a 10-fold cross-validation: data is split into 10 parts, with
nine parts used for training at each iteration (train), and one for
evaluation (test). This is commonly used in ML and allows esti-
mation of the extent to which model results based on the training
set (train) will generalize to an independent sample (test). Train
and test sets were constructed so as not to share therapists between
them, as shared therapists could artificially inflate the model’s
accuracy. The algorithm is therefore expected to learn patterns of
words related to alliance ratings in general instead of capitalizing
on therapist-specific characteristics.
We employed two commonly used metrics of accuracy: mean
square error (MSE) and Spearman’s rank correlation (). These
metrics reflect the accuracy of the ML algorithm when applied to
the test set. Specifically, mean squared error is the average of the
squared differences between the predictions and the true values
and is useful for comparing models, though its absolute value is
not interpretable. Spearman’s rank correlation measures the strength
of association between two variables, ranging from ⫺1 to 1, with
higher values preferred.
Computer Software
Self-report data were processed within the R statistical environ-
ment (R Core Team, 2018). NLP and ML was conducted using the
Python programming language (Python Software Foundation,
2019). Models used the “scikit-learn” toolkit (Pedregosa et al.,
2011) and the “sklearn.linear_model.Ridge” function (Hoerl &
Kennard, 1970; see Table 1 in the online supplemental materials
for syntax). Sent2vec was implemented using the method devel-
oped by Pagliardini et al. (2017) and N-grams obtained using the
text feature extraction in “scikit.”
1
The time required for running
the speech pipeline and ML models can vary. In the current data,
the speech pipeline required approximately 30 min per 50-min
session using one core of an AMD Opteron Processor 6276 (2.3
GHz). The 10-fold cross-validation models took approximately 10
min on a MacBook Pro with 2.8 GHz Intel Core i7, 16 GB RAM,
and 2133 MHz LPDDR3.
Results
The sample included a total of 1,235 sessions with recordings
and associated alliance ratings (provided at the subsequent session;
n⫽386 clients; 40 therapists). Clients had, on average, 3.20
sessions in the data set (SD ⫽2.50, range ⫽1 to 13) and therapists
had 30.88 (SD ⫽32.97, range ⫽1 to 131). Sessions represented
a variety of points in treatment, with a mean session number of
5.31 (SD ⫽3.37, range ⫽1 to 23). Across the 1,235 alliance
ratings, the mean rating was 5.47 (SD ⫽0.83, median ⫽5.5,
range ⫽1.75 to 6.50; see Figure 1 in the online supplemental
materials). Ratings showed the typical negative skew found in the
assessment of alliance (Tryon et al., 2008).
ML model results are presented in Table 1. Models are shown
using either therapist or client text as the input. Results are also
separated by feature extraction method (tf-idf, Sent2vec). The
baseline model reflects accuracy of the average rating (i.e., 5.47)
and is useful to evaluate model performance.
The predictions of three out of the four models are significantly
better than chance (Spearman’s ⬎.00, p⬍.01). The model that
used therapist text and extracted features using tf-idf performed
best overall, with MSE ⫽0.67 and ⫽0.15, p⬍.001. For
illustrative purposes only, we extracted the 15 unigrams/bigrams
that were most positively or negatively correlated with alliance
ratings in our best performing model. As these features represent
only a small portion of the corresponding model, they should not
be viewed as a replacement for the full model. The 15 most
positively correlated unigrams/bigrams were: group, really, hus-
band, right, think, phone, values, maybe, divorce, got, yeah, situ-
ation, um right, don think, max. The 15 most negatively correlated
unigrams/bigrams were: counseling, yeah yeah, going, sure, cop-
ing, just want, friends, motivation, feeling, Monday, huh yeah, oh,
physical, pretty, time.
1
Readers interested in working with text data in Python are encouraged
to read the “scikit” and Kaldi tutorials (https://scikit-learn.org/stable/tutorial/
text_analytics/working_with_text_data.html;https://kaldi-asr.org/doc/kaldi_
for_dummies.html).
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
442 GOLDBERG ET AL.
Discussion
The current study introduces two related quantitative methods—
NLP and ML—that have the potential to significantly expand
methodological tools available to psychotherapy researchers and
clinicians. The prediction of client-rated therapeutic alliance from
session recordings was used as a test case for these methods due to
the importance of alliance in psychotherapy and the potential
contribution of technologies able to reliably automate alliance
assessment. Results presented here suggest that ML models mod-
estly predict alliance ratings (⫽.15). That is to say, there was
linguistic signal indicative of the strength of the alliance that is
detectable through ML, supporting the notion that ML may be a
useful tool for examining alliance in future studies.
It is worth contextualizing these results within the broader field
of speech signal processing and NLP as well as prior work spe-
cifically within the domain of psychotherapy research. An important
feature of the alliance, and part of the motivation to examine alliance,
is its greater degree of abstraction from the actual linguistic context of
a psychotherapy session. Compare alliance with another commonly
studied psychotherapy process variable—motivational interviewing
fidelity codes. Motivational interviewing codes are primarily lin-
guistic in nature (e.g., open vs. closed question; Miller et al., 2003)
and can be reliably coded by trained human raters and ML algo-
rithms at approximately similar levels (e.g., s⬎.75 for use of
open questions over a session of motivational interviewing; Atkins
et al., 2014). Importantly, aspects of motivational interviewing
fidelity that show lower interrater reliability among human raters
(e.g., empathy) are also more difficult to predict via ML (e.g., s
⬇.25 for talk turns and .00 for sessions; Atkins et al., 2014).
Alliance, in contrast to most motivational interviewing fidelity
dimensions, requires in-depth processing by humans (i.e., client,
therapist, or observer) and is presumably influenced by a variety of
unobservable, nonlinguistic factors. It is exactly this nonlinguistic,
internal processing that may be more difficult for ML models to
replicate. This highlights a truism of NLP methodologies: behav-
iors more distal from linguistic content that are more difficult for
human raters to rate reliably will also be more difficult for ML
models to predict. This may make predicting even more abstracted
aspects of treatment, such as treatment outcome, yet more chal-
lenging to predict using ML.
Practical Suggestions
Given these potential limitations, there are six practical consid-
erations offered here that may increase the viability of ML to
contribute to psychotherapy research. Several of these are funda-
mental principles of ML reviewed previously but are worth high-
lighting due to the possibility that many readers may not be
familiar with them.
1. ML may be most promising for predicting observable
linguistic behaviors. For efforts employing ML using text
data, it may be valuable to start with observable behav-
iors that humans can code reliably using only text data
(e.g., treatment fidelity; Atkins et al., 2014). Human
reliability provides an estimate of the upper limit to
reliability likely to be achieved using ML models. Be-
haviors for which humans have difficulty reaching con-
sensus will likely be more challenging for ML models as
well.
2. ML models should be trained using human coding as the
gold standard. Related to the previous suggestion, it may
be prudent to develop ML models based on behaviors
that are observable and to use human-based ratings as the
standard for training ML algorithms. Thankfully, prom-
ising observer-rated measures of alliance and other psy-
chotherapy processes (e.g., empathy, treatment fidelity)
have been developed that may serve as a basis for future
ML psychotherapy research. While this has been done in
previous work on motivational interviewing (Atkins et
al., 2014;Xiao et al., 2016), this was not used in the
current study due both to resource limitations and an
interest in attempting to predict client (rather than ob-
server) ratings. However, ML models could be con-
structed predicting observer-rated alliance, which may be
less prone to client response set biases (e.g., social de-
sirability). While models using human coding as the basis
are a promising starting point, it may also be useful to
develop models attempting to predict more diffuse con-
structs that are not reliably rated by observers (e.g.,
treatment outcome).
3. ML models should be tested using large data sets. One of
the distinct advantages of ML is its potential to process
large amounts of data, an impractical task when using
human coders. However, for the development of reliable
ML algorithms, large amounts of training data are ideal.
The actual amount of data necessary varies widely de-
pending on the nature of the ML task, but data sets of
10,000 cases or more are commonly used in NLP appli-
cations. Given advances in NLP, researchers and clini-
cians who have access to high fidelity session recordings
may be able to convert existing recordings to text data for
ML models.
4. Develop models using a training set and test models
using a test set. Similar to the rationale for employing
Table 1
Results From Machine Learning Prediction Model
Model
Feature extraction
method MSE p
Therapist tf-idf .67 .15 ⬍.001
Sent2vec 3.34 .08 .003
Client tf-idf .69 .11 ⬍.001
Sent2vec 3.67 .01 .800
Baseline Average .69 .00 n/a
Note. Models employed unigrams and bigrams (i.e., 1- and 2-word pair-
ings) and a linear regression with L2-norm regularization (i.e., ridge
regression; Hoerl & Kennard, 1970). Models were evaluated using 10-fold
cross-validation with nine parts used for model training and one used for
evaluation. Therapist ⫽therapist speech; Client ⫽client speech; base-
line ⫽model results if model always predicts the mean alliance rating (i.e.,
5.47); MSE ⫽mean square error; ⫽Spearman’s rank order correlation;
tf-idf ⫽term frequency-inverse document frequency weighting based on
(inverse) frequency of occurrence within the document and larger corpus;
Sent2vec ⫽sentence embeddings used to map sentences to vectors of real
numbers.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
443
MACHINE LEARNING AND ALLIANCE
separate sample for exploratory and confirmatory factor
analysis (Gerbing & Hamilton, 1996), evaluation of ML
algorithms requires separate samples. It is possible to get
perfect accuracy within a training set, but this in no way
indicates that results will be perfectly accurate in a future
data set (i.e., for prediction). The need for separate samples
echoes the need for large data sets when conducting ML.
5. Develop interdisciplinary collaborations. Most psychother-
apy researchers are not trained in ML during graduate
school. As these models depart in some important ways
from traditional quantitative methods used in psychology
(e.g., regression and analysis of variance), it may be vital for
researchers interested in ML to build collaborations with
colleagues more versed in the intricacies of ML. Research-
ers with expertise in processing linguistic data, with back-
grounds in computer science and engineering, for example,
may be ideal complements to the clinical and context ex-
pertise brought by psychologists. Of course, interdisciplin-
ary collaborations involve their own complexity, with re-
searchers working across disciplinary cultures, practices,
and standards.
6. Have reasonable expectations and avoid the risk of “al-
chemy.” A final suggestion is that those interested in pur-
suing ML-based psychotherapy research have reasonable
expectations about the promise of these methods, and the
speed with which they will become viable tools. One con-
cern is that ML-based models simply replicate the human
biases in the patient-rated measures: if the model accurately
learns the human rating, it will also include ceiling effects,
social desirability, and other potentially construct-irrelevant
variance. In addition, it is encouraged that ML not be
viewed as form of alchemy (Hutson, 2018) in which ML
becomes a quasimagical black box for researchers and con-
sumers of research. ML research, like other research meth-
odologies, is likely to benefit from transparency, humility,
and replication (Open Science Collaboration, 2015) along
with a healthy dose of skepticism.
Future Directions
Consistent with these practice suggestions, future work should
continue to explore important psychotherapy process and outcome
variables using linguistic, paralinguistic (e.g., prosody, pitch), and
nonverbal therapy behaviors. Ideally this is done using large data
sets (e.g., Ns⬎10,000 sessions). The current study focused on
alliance, but future work could use similar methods to predict
treatment outcome (e.g., Hamilton Rating Scale of Depression;
Hamilton, 1960), multicultural competence (Tao, Owen, Pace, &
Imel, 2015), empathy (Imel et al., 2014), interpersonal skill (An-
derson, Ogles, Patterson, Lambert, & Vermeersch, 2009), treat-
ment fidelity (e.g., Cognitive Therapy Rating Scale; Creed et al.,
2016;Goldberg, Baldwin, et al., 2019), and other variables previ-
ously assessed using observer ratings (e.g., innovative moments;
Gonçalves, Ribeiro, Mendes, Matos, & Santos, 2011).
Development will also ideally occur in tandem with attention to
measurement and known issues in psychotherapy research. For
example, future work should consider likely bias in the measure-
ment of alliance. Clients whose ratings are invariant across ses-
sions (e.g., consistently provided alliance ratings at the ceiling of
the measure) could be removed from ML models, perhaps even-
tually providing models that better predict the correlates of alliance
(e.g., treatment retention) than self-report. Or ML models could be
used to determine when collecting self-report alliance data would
provide information beyond what analysis of session content could
provide (e.g., models predicting discrepancies between ML-based
and self-report alliance ratings). It also may be worthwhile at-
tempting to predict therapist-level alliance scores using session
content and ratings aggregated across multiple clients.
The current cross-validation design allowed no therapist to
appear in both the train and test sets. Conceptually, this ML
approach is trying to discover a universal model for mapping
language to alliance, and as such, it is the hardest and most
conservative modeling approach. Alternative strategies would al-
low therapists to be in both train and test sets, which allows a
model to learn individual-specific mappings of text to alliance to
support prediction of future alliance scores for either therapist or
client. It could be valuable to explore these additional models in
future work.
Provided ML models continue to improve in their ability to
detect important aspects of psychotherapy, questions of dissemi-
nation and implementation will become increasingly central. Many
potentially valuable technologies have existed for years (e.g.,
models detecting depression symptoms via speech features; France
et al., 2000), yet are not widely implemented. There are, of course,
numerous reasons that innovations may not be adopted, and con-
siderable scholarship focused on precisely this research-to-practice
impasse (e.g., Wandersman et al., 2008). Part of the solution to
bringing ML-based technologies to market may require research-
ers moving outside of the traditional academic boundaries and
developing collaborations with industry. For clinicians and re-
searchers alike, there may be discomfort with the notion of part-
nering with for-profit entities with fears of disruptions in objec-
tivity that form the theoretical backbone of both science and
practice (DeAngelis, 2000). While these concerns may be valid,
these partnerships may play a central role in bringing novel tech-
nologies such as those based on ML to the therapists and clients
who could benefit from them.
Gaining buy-in from clinicians is another dissemination and
implementation barrier. Clinician discomfort discussed in relation
to measurement-based care (e.g., Boswell, Kraus, Miller, & Lam-
bert, 2015;Fortney et al., 2017;Goldberg et al., 2016) may very
well be magnified when clinicians are asked to routinely record
therapy sessions. Discomfort may be further magnified knowing
that these recordings will subsequently be analyzed by a computer
algorithm to determine treatment quality, therapeutic alliance, or
outcome. Sensitivity to these and other dissemination and imple-
mentation issues will be crucial for moving this work forward.
A final future direction to mention is the importance of ulti-
mately evaluating whether ML-based feedback— be it focused on
alliance, fidelity, or any other aspect of treatment—actually pro-
vides benefits. The benefit of interest may depend on the stake-
holder: for payers, this may involve demonstrating the quality of
services; for clinicians, this may involve demonstrating improved
client outcomes; and for researchers, this may involve demonstrat-
ing reliability and validity with reduced cost of research team time
and money. It is likely these metrics will ultimately determine
whether ML can transform psychotherapy.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
444 GOLDBERG ET AL.
Limitations
While promising, the current study has several important limi-
tations. The first is the relatively modest sample size. While large
by human coding standards, the current number of sessions eval-
uated is well below the samples often used for ML. As noted
previously, ML models improve with larger amounts of training
data. Thus, the available sample size may have reduced the ability
to predict alliance ratings from session recordings.
Another limitation is related to the available speech signal
processing technology. In particular, existing NLP technologies
have known limitations, including inaccuracy in transcription (i.e.,
misinterpreting spoken words) and errors in assigning speech to a
given speaker (i.e., diarization). These factors introduce error
variability into the text data which functions to reduce statistical
power and the accuracy of the ML models.
A third key limitation is related to the assessment of alliance.
For one, ratings were made retrospectively (i.e., about a prior
session). Collecting ratings at time points more distant from the
actual session may have reduced linkages between ratings and
session content and thereby decreased the signal available for
detection (i.e., exerting a conservative rather than liberal bias on
our ability to predict alliance ratings from session content). Sim-
ilarly, there was evidence that alliance ratings in the current study
suffered from range restriction due to the well-documented ceiling
effects for ratings of alliance (Tryon et al., 2008). Range restriction
also may have decreased statistical power and the ability to reli-
ably predict alliance ratings (Cohen, Cohen, West, & Aiken,
2003). For this reason, it may be useful to examine alliance in
other contexts in which ratings may be more variable (e.g., clients
with more severe personality psychopathology). Lastly, alliance
was assessed only by clients. While relevant and ecologically
valid, accuracy may have been improved for predicting observer-
rated alliance in which observers and ML algorithms had access to
the same information (i.e., session text).
Clinical Vignette
The algorithm developed in the current study is only a first
attempt at predicting alliance ratings using ML, but these initial
results suggest a potential future for using these technologies in
clinical research and practice. We imagine a future application in
the following vignette. This example indicates how ML-generated
analytics derived directly from the session encounter can be used
as another source of information for the therapist to reflect on their
work and potentially improve the process of therapy.
Sandra is a 43-year old, married, African American, cisgender
female who has been struggling with social anxiety since adoles-
cence. She is a school librarian and the mother of two teenage
sons. She has recently begun working with a psychologist, Dr.
Martinez, due to “increasing stress and anxiety” at work which is
beginning to spill over into Sandra’s family life. She reports she
has trouble “asserting herself and expressing her needs” at home
and at work.
During the intake session, Dr. Martinez shares with Sandra that
the clinic has been using a recording platform that can provide Dr.
Martinez with information about how therapy might be going, in
particular, feedback on the therapy “relationship.” Sandra provides
her consent for use of the platform. Therapy starts out smoothly,
with Sandra sharing more about the difficulties she is experienc-
ing, which in recent months have included periodic panic attacks
in social situations. Dr. Martinez, who primarily operates from a
cognitive-behavioral therapy perspective, introduces exposure therapy
as a treatment approach for reducing her symptoms.
During the fifth session, Dr. Martinez initiates a conversation
about Sandra’s progress in treatment. Sandra reports that therapy is
going “just fine” and she apologies for not having had the time to
complete the exposure exercises Dr. Martinez had recommended.
Dr. Martinez reflects that she knows it can be challenging to make
the time for engaging in therapy “homework” and that the expo-
sures themselves can be unpleasant. Sandra quickly assures Dr.
Martinez that she will try to do a better job making time for
exposures.
Through the treatment, Dr. Martinez has been reviewing ses-
sions and automated feedback on the quality of her relationship
with Sandra and has noticed that the alliance scores generated by
the system have been low in the past two sessions. Although
Sandra indicated in session that treatment was going fine, the
alliance algorithm was built using observer-rated alliance that is
less contaminated with self-report biases (e.g., social desirability).
Dr. Martinez uses this opportunity to discuss the automated feed-
back with Sandra:
You know Sandra, I was reviewing some feedback I received on our
sessions last week, and it suggested that it might be smart for me to
check in with you again on how things are going. I know you said,
things are fine, but I can’t help wonder if there’s something I’m
missing. I’d really like to know.
At this point, Sandra notes that she has been having trouble with
Dr. Martinez’s therapeutic approach. Sandra shares that she has
been having significant difficulties in her marriage recently and
has experienced several racial microaggressions at work that have
contributed to her anxiety. Sandra notes that she was hoping to
discuss these events in therapy but was not sure how to bring them
up, given Dr. Martinez’s emphasis on exposure therapy and San-
dra’s difficulty completing her exposure exercises. Dr. Martinez
expresses her appreciation to Sandra for sharing this. They begin
a discussion of ways to refocus treatment to include these addi-
tional dimensions.
Conclusion
The current study introduced and attempted to model ML as a
statistical approach that may be relevant for addressing important
questions about psychotherapy. Just as ML is centrally involved in
numerous cultural, technological, and social changes, it may also
play a leading role in future innovation within psychotherapy
research and practice. Our prediction of therapeutic alliance dis-
cussed here is one of several recent examinations of potential
synergy between ML and psychotherapy. As available sample
sizes grow and technology evolves, it may well be that ML
algorithms can be developed to even more reliably detect treatment
features like alliance from session recordings. Clearly such tech-
nologies could dramatically revolutionize training and provision of
clinical services. In a way, these methods, while heavily reliant on
computers and artificial intelligence, may prove crucial in helping
human researchers and clinicians unravel the dizzying complexity
of the human interaction that is psychotherapy.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
445
MACHINE LEARNING AND ALLIANCE
References
Althoff, T., Clark, K., & Leskovec, J. (2016). Large-scale analysis of
counseling conversations: An application of natural language processing
to mental health. Transactions of the Association for Computational
Linguistics, 4, 463– 476. http://dx.doi.org/10.1162/tacl_a_00111
Anderson, T., Ogles, B. M., Patterson, C. L., Lambert, M. J., & Ver-
meersch, D. A. (2009). Therapist effects: Facilitative interpersonal skills
as a predictor of therapist success. Journal of Clinical Psychology, 65,
755–768. http://dx.doi.org/10.1002/jclp.20583
Atkins, D. C., Steyvers, M., Imel, Z. E., & Smyth, P. (2014). Scaling up the
evaluation of psychotherapy: Evaluating motivational interviewing fi-
delity via statistical text classification. Implementation Science, 9, 49.
http://dx.doi.org/10.1186/1748-5908-9-49
Baldwin, S. A., & Imel, Z. E. (2013). Therapist effects: Findings and
methods. In M. J. Lambert (Ed.), Bergin and Garfield’s handbook of
psychotherapy and behavior change (6th ed., pp. 258 –297). Hoboken,
NJ: Wiley.
Baldwin, S. A., Wampold, B. E., & Imel, Z. E. (2007). Untangling the
alliance-outcome correlation: Exploring the relative importance of ther-
apist and patient variability in the alliance. Journal of Consulting and
Clinical Psychology, 75, 842– 852. 852.
Benton, S. A., Robertson, J. M., Tseng, W. C., Newton, F. B., & Benton,
S. L. (2003). Changes in counseling center client problems across 13
years. Professional Psychology: Research and Practice, 34, 66 –72.
http://dx.doi.org/10.1037/0735-7028.34.1.66
Berwian, I. M., Walter, H., Seifritz, E., & Huys, Q. J. (2017). Predicting
relapse after antidepressant withdrawal - a systematic review. Psychological
Medicine, 47, 426 – 437. http://dx.doi.org/10.1017/S0033291716002580
Bibault, J. E., Giraud, P., & Burgun, A. (2016). Big data and machine learning
in radiation oncology: State of the art and future prospects. Cancer Letters,
382, 110 –117. http://dx.doi.org/10.1016/j.canlet.2016.05.033
Bordin, E. S. (1979). The generalizability of the psychoanalytic concept of
the working alliance. Psychotherapy: Theory, Research & Practice, 16,
252–260. http://dx.doi.org/10.1037/h0085885
Boswell, J. F., Kraus, D. R., Miller, S. D., & Lambert, M. J. (2015).
Implementing routine outcome monitoring in clinical practice: Benefits,
challenges, and solutions. Psychotherapy Research, 25, 6 –19. http://dx
.doi.org/10.1080/10503307.2013.817696
Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O., & Walsh, A.
(2018). Machine learning for molecular and materials science. Nature,
559, 547–555. http://dx.doi.org/10.1038/s41586-018-0337-2
Chen, M., Hao, Y., Hwang, K., Wang, L., & Wang, L. (2017). Disease
prediction by machine learning over big data from healthcare commu-
nities. IEEE Access: Practical Innovations, Open Solutions, 5, 8869 –
8879. http://dx.doi.org/10.1109/ACCESS.2017.2694446
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple
regression/correlation analysis for the behavioral sciences (3rd ed.).
Mahwah, NJ: Erlbaum.
Creed, T. A., Frankel, S. A., German, R. E., Green, K. L., Jager-Hyman, S.,
Taylor, K. P.,...Beck, A. T. (2016). Implementation of transdiagnostic
cognitive therapy in community behavioral health: The Beck Commu-
nity Initiative. Journal of Consulting and Clinical Psychology, 84,
1116 –1126. http://dx.doi.org/10.1037/ccp0000105
Cuijpers, P., Sijbrandij, M., Koole, S. L., Andersson, G., Beekman, A. T.,
& Reynolds, C. F., III. (2014). Adding psychotherapy to antidepressant
medication in depression and anxiety disorders: A meta-analysis. World
Psychiatry, 13, 56 – 67. http://dx.doi.org/10.1002/wps.20089
DeAngelis, C. D. (2000). Conflict of interest and the public trust. Journal
of the American Medical Association, 284, 2237–2238. http://dx.doi.org/
10.1001/jama.284.17.2237
Duncan, B. L., Miller, S. D., Sparks, J. A., Claud, D. A., Reynolds, L. R.,
& Johnson, L. D. (2003). The Session Rating Scale: Preliminary psy-
chometric properties of a “working” alliance measure. Journal of Brief
Therapy, 3, 3–12.
Dyson, F. J. (1998). Imagined worlds (Vol. 6). Cambridge, MA: Harvard
University Press.
Elliott, R., Bohart, A. C., Watson, J. C., & Murphy, D. (2018). Therapist
empathy and client outcome: An updated meta-analysis. Psychotherapy,
55, 399 – 410. http://dx.doi.org/10.1037/pst0000175
Fairburn, C. G., & Cooper, Z. (2011). Therapist competence, therapy
quality, and therapist training. Behaviour Research and Therapy, 49,
373–378. http://dx.doi.org/10.1016/j.brat.2011.03.005
Falkenström, F., Granström, F., & Holmqvist, R. (2013). Therapeutic
alliance predicts symptomatic improvement session by session. Journal
of Counseling Psychology, 60, 317–328. http://dx.doi.org/10.1037/
a0032258
Flemotomos, N., Martinez, V., Chen, Z., Singla, K., Peri, R., Ardulov, V.,
&Narayanan, S. (2019). A speech and language pipeline for quality
assessment of recorded psychotherapy sessions. Manuscript in prepara-
tion.
Flückiger, C., Del Re, A. C., Wampold, B. E., & Horvath, A. O. (2018).
The alliance in adult psychotherapy: A meta-analytic synthesis. Psycho-
therapy, 55, 316 –340. http://dx.doi.org/10.1037/pst0000172
Fortney, J. C., Unützer, J., Wrenn, G., Pyne, J. M., Smith, G. R., Schoe-
nbaum, M.,...Harbin, H. T. (2017). A tipping point for measurement-
based care. Psychiatric Services, 68, 179 –188. http://dx.doi.org/10
.1176/appi.ps.201500439
France, D. J., Shiavi, R. G., Silverman, S., Silverman, M., & Wilkes, D. M.
(2000). Acoustical properties of speech as indicators of depression and
suicidal risk. IEEE Transactions on Biomedical Engineering, 47, 829 –
837. http://dx.doi.org/10.1109/10.846676
Gerbing, D. W., & Hamilton, J. G. (1996). Viability of exploratory factor
analysis as a precursor to confirmatory factor analysis. Structural Equa-
tion Modeling, 3, 62–72. http://dx.doi.org/10.1080/10705519609540030
Goldberg, S. B., Babins-Wagner, R., Rousmaniere, T., Berzins, S., Hoyt,
W. T., Whipple, J. L.,...Wampold, B. E. (2016). Creating a climate for
therapist improvement: A case study of an agency focused on outcomes
and deliberate practice. Psychotherapy, 53, 367–375. http://dx.doi.org/
10.1037/pst0000060
Goldberg, S. B., Baldwin, S. A., Merced, K., Caperton, D., Imel, Z. E., Atkins,
D. C., & Creed, T. (2019). The structure of competence: Evaluating the
factor structure of the Cognitive Therapy Rating Scale. Behavior Therapy.
Advance online publication. http://dx.doi.org/10.1016/j.beth.2019.05.008
Goldberg, S. B., Rowe, G., Malte, C. A., Ruan, H., Owen, J. J., & Miller,
S. D. (2019). Routine monitoring of therapeutic alliance to predict
treatment engagement in a Veterans Affairs substance use disorders
clinic. Psychological Services. Advance online publication. http://dx.doi
.org/10.1037/ser0000337
Gonçalves, M. M., Ribeiro, A. P., Mendes, I., Matos, M., & Santos, A.
(2011). Tracking novelties in psychotherapy process research: The in-
novative moments coding system. Psychotherapy Research, 21, 497–
509.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cam-
bridge, MA: MIT Press.
Greenson, R. R. (1965). The working alliance and the transference neurosis.
The Psychoanalytic Quarterly, 34, 155–181. http://dx.doi.org/10.1080/
21674086.1965.11926343
Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanas-
wamy, A., . . . Webster, D. R. (2016). Development and validation of a
deep learning algorithm for detection of diabetic retinopathy in retinal
fundus photographs. Journal of the American Medical Association, 316,
2402–2410. http://dx.doi.org/10.1001/jama.2016.17216
Hamilton, M. (1960). A rating scale for depression. Journal of Neurology,
Neurosurgery and Psychiatry, 23, 56 – 62. http://dx.doi.org/10.1136/jnnp.23
.1.56
Hatcher, R. L., & Gillaspy, J. A. (2006). Development and validation of a
revised short version of the Working Alliance Inventory. Psychotherapy
Research, 16, 12–25. http://dx.doi.org/10.1080/10503300500352500
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
446 GOLDBERG ET AL.
Haykin, S. S. (2009). Neural networks and learning machines (3rd ed.).
Upper Saddle River, NJ: Pearson.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estima-
tion for nonorthogonal problems. Technometrics, 12(1), 55– 67.
Hutson, M. (2018). Has artificial intelligence become alchemy? Science,
360, 478. http://dx.doi.org/10.1126/science.360.6388.478
Imel, Z. E., Barco, J. S., Brown, H. J., Baucom, B. R., Baer, J. S., Kircher,
J. C., & Atkins, D. C. (2014). The association of therapist empathy and
synchrony in vocally encoded arousal. Journal of Counseling Psychol-
ogy, 61, 146 –153. http://dx.doi.org/10.1037/a0034943
Imel, Z. E., Caperton, D. D., Tanana, M., & Atkins, D. C. (2017). Technology-
enhanced human interaction in psychotherapy. Journal of Counseling Psy-
chology, 64, 385–393. http://dx.doi.org/10.1037/cou0000213
Imel, Z. E., Hubbard, R. A., Rutter, C. M., & Simon, G. (2013). Patient-
rated alliance as a measure of therapist performance in two clinical
settings. Journal of Consulting and Clinical Psychology, 81, 154 –165.
http://dx.doi.org/10.1037/a0030903
Imel, Z. E., Pace, B. T., Soma, C. S., Tanana, M., Gibson, J., Hirsch, T.,
. . . Atkins, D. A. (in press). Initial development and evaluation of an
automated, interactive, web-based therapist feedback system for moti-
vational interviewing fidelity. Psychotherapy.
Imel, Z. E., Steyvers, M., & Atkins, D. C. (2015). Computational psycho-
therapy research: Scaling up the evaluation of patient-provider interac-
tions. Psychotherapy, 52, 19 –30. http://dx.doi.org/10.1037/a0036841
Insel, T. R. (2017). Digital phenotyping. Journal of the American Medical
Association, 318, 1215–1216. http://dx.doi.org/10.1001/jama.2017.11295
Johns, R. G., Barkham, M., Kellett, S., & Saxon, D. (2019). A systematic
review of therapist effects: A critical narrative update and refinement to
review. Clinical Psychology Review, 67, 78 –93. http://dx.doi.org/10.1016/
j.cpr.2018.08.004
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspec-
tives, and prospects. Science, 349, 255–260. http://dx.doi.org/10.1126/
science.aaa8415
Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (2nd
ed.). London, UK: Pearson.
Lambert, M. J., & Barley, D. E. (2001). Research summary on the therapeutic
relationship and psychotherapy outcome. Psychotherapy: Theory, Research,
Practice, Training, 38, 357–361. http://dx.doi.org/10.1037/0033-3204.38.4
.357
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). Big data. The
parable of Google Flu: Traps in big data analysis. Science, 343, 1203–
1205. http://dx.doi.org/10.1126/science.1248506
Lutz, W., Schwartz, B., Hofmann, S. G., Fisher, A. J., Husen, K., & Rubel,
J. A. (2018). Using network analysis for the prediction of treatment
dropout in patients with mood and anxiety disorders: A methodological
proof-of-concept study. Scientific Reports, 8, 7819. http://dx.doi.org/10
.1038/s41598-018-25953-0
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. (2013).
Distributed representations of words and phrases and their composition-
ality. Proceedings of Advances in Neural Information Processing Sys-
tems, 26, 3111–3119.
Miller, W. R., Moyers, T. B., Ernst, D., & Amrhein, P. (2003). Manual for
the motivational interviewing skills code v. 2.0. Retrieved from http://
casaa.unm.edu/codinginst.html
Miner, A. S., Milstein, A., & Hancock, J. T. (2017). Talking to machines about
personal mental health problems. Journal of the American Medical Asso-
ciation, 318, 1217–1218. http://dx.doi.org/10.1001/jama.2017.14151
Mitchell, T. M. (1997). Does machine learning really work? AI Magazine,
18, 11–20.
Mjolsness, E., & DeCoste, D. (2001). Machine learning for science: State
of the art and future prospects. Science, 293, 2051–2055. http://dx.doi
.org/10.1126/science.293.5537.2051
Moore, E., II, Clements, M. A., Peifer, J. W., & Weisser, L. (2008). Critical
analysis of the impact of glottal features in the classification of clinical
depression in speech. IEEE Transactions on Biomedical Engineering,
55, 96 –107. http://dx.doi.org/10.1109/TBME.2007.900562
Murphy, K. P. (2012). Machine learning: A probabilistic perspective.
Cambridge, MA: MIT Press.
Okada, S., Ohtake, Y., Nakano, Y. I., Hayashi, Y., Huang, H. H., Takase,
Y., & Nitta, K. (2016). Estimating communication skills using dialogue
acts and nonverbal features in multiple discussion datasets. Proceedings
of the 18th ACM International Conference on Multimodal Interaction
(pp. 169 –176). New York, NY: ACM.
Olfson, M., & Marcus, S. C. (2010). National trends in outpatient psycho-
therapy. The American Journal of Psychiatry, 167, 1456 –1463. http://
dx.doi.org/10.1176/appi.ajp.2010.10040570
Open Science Collaboration. (2015). Estimating the reproducibility of
psychological science. Science, 349, aac4716. http://dx.doi.org/10.1126/
science.aac4716
Pagliardini, M., Gupta, P., & Jaggi, M. (2017). Unsupervised learning of
sentence embeddings using compositional n-gram features. CoRRarXiv:
1703.02507. Retrieved from http://dx.doi.org/10.18653/v1/N18-1049
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,
Grisel, O.,...Vanderplas, J. (2011). Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research, 12, 2825–2830.
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The
development and psychometric properties of LIWC2015. Austin: Uni-
versity of Texas at Austin. Technical Report.
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors
for word representation. Proceedings of the 2014 Conference on Empir-
ical Methods in Natural Language Processing (EMNLP) (pp.1532–
1543). Stroudsburg, PA: Association for Computational Linguistics.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N.,
. . . Silovsky, J. (2011). The Kaldi speech recognition toolkit. IEEE 2011
Workshop on Automatic Speech Recognition and Understanding. Big
Island, Hawaii: IEEE Signal Processing Society.
Python Software Foundation. (2019). Python language reference (Version
3.7.2) [Computer software]. Retrieved from http://www.python.org
R Core Team. (2018). R: A language and environment for statistical
computing. Vienna, Austria: R Foundation for Statistical Computing.
Retrieved from https://www.R-project.org/
Russell, S. J., & Norvig, P. (2016). Artificial intelligence: A modern
approach (3rd ed.). Essex, UK: Pearson Education.
Salton, G., & McGill, M. J. (1986). Introduction to modern information
retrieval. New York, NY: McGraw-Hill.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell
System Technical Journal, 27, 379 – 423. http://dx.doi.org/10.1002/j.1538-
7305.1948.tb01338.x
Shatte, A. B. R., Hutchinson, D. M., & Teague, S. J. (2019). Machine
learning in mental health: A scoping review of methods and applications.
Psychological Medicine, 49, 1426 –1448. http://dx.doi.org/10.1017/S00
33291719000151
Stead, W. W. (2018). Clinical implications and challenges of artificial
intelligence and deep learning. Journal of the American Medical Asso-
ciation, 320, 1107–1108. http://dx.doi.org/10.1001/jama.2018.11029
Stone, P. J., Bales, R. F., Namenwirth, J. Z., & Ogilvie, D. M. (1962). The
general inquirer: A computer system for content analysis and retrieval
based on the sentence as a unit of information. Behavioral Science, 7,
484 – 498. http://dx.doi.org/10.1002/bs.3830070412
Substance Abuse and Mental Health Services Administration. (2014).
Projections of national expenditures for treatment of mental and sub-
stance use disorders, 2010 –2020. Rockville, MD: Author.
Tao, K. W., Owen, J., Pace, B. T., & Imel, Z. E. (2015). A meta-analysis of
multicultural competencies and psychotherapy process and outcome. Jour-
nal of Counseling Psychology, 62, 337–350. http://dx.doi.org/10.1037/
cou0000086
Thompson, M. N., Goldberg, S. B., & Nielsen, S. L. (2018). Patient
financial distress and treatment outcomes in naturalistic psychotherapy.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
447
MACHINE LEARNING AND ALLIANCE
Journal of Counseling Psychology, 65, 523–530. http://dx.doi.org/10.1037/
cou0000264
Tichenor, V., & Hill, C. E. (1989). A comparison of six measures of
working alliance. Psychotherapy: Theory, Research, Practice, Training,
26, 195–199. http://dx.doi.org/10.1037/h0085419
Tracey, T. J. G., Wampold, B. E., Lichtenberg, J. W., & Goodyear, R. K.
(2014). Expertise in psychotherapy: An elusive goal? American Psychol-
ogist, 69, 218 –229. http://dx.doi.org/10.1037/a0035099
Tryon, G. S., Blackwell, S. C., & Hammel, E. F. (2008). The magnitude of
client and therapist working alliance ratings. Psychotherapy: Theory, Re-
search, Practice, Training, 45, 546 –551. http://dx.doi.org/10.1037/a001
4338
Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59,
433– 460. http://dx.doi.org/10.1093/mind/LIX.236.433
Wandersman, A., Duffy, J., Flaspohler, P., Noonan, R., Lubell, K., Still-
man, L.,...Saul, J. (2008). Bridging the gap between prevention
research and practice: The interactive systems framework for dissemi-
nation and implementation. American Journal of Community Psychol-
ogy, 41(3– 4), 171–181. http://dx.doi.org/10.1007/s10464-008-9174-z
Wang, R., Aung, M. S., Abdullah, S., Brian, R., Campbell, A. T., Choud-
hury, T.,...Tseng, V. W. (2016, September). CrossCheck: Toward
passive sensing and detection of mental health changes in people with
schizophrenia. Proceedings of the 2016 ACM International Joint Con-
ference on Pervasive and Ubiquitous Computing (pp. 886-897). New
York, NY: Association for Computing Machinery.
Webb, C. A., DeRubeis, R. J., & Barber, J. P. (2010). Therapist adherence/
competence and treatment outcome: A meta-analytic review. Journal of
Consulting and Clinical Psychology, 78, 200 –211. http://dx.doi.org/10
.1037/a0018912
Whiteford, H. A., Degenhardt, L., Rehm, J., Baxter, A. J., Ferrari, A. J.,
Erskine, H. E.,...Vos, T. (2013). Global burden of disease attributable
to mental and substance use disorders: Findings from the Global Burden
of Disease Study 2010. The Lancet, 382, 1575–1586. http://dx.doi.org/
10.1016/S0140-6736(13)61611-6
Xiao, B., Huang, C., Imel, Z. E., Atkins, D. C., Georgiou, P., & Narayanan,
S. S. (2016). A technology prototype system for rating therapist empathy
from audio recordings in addiction counseling. PeerJ Computer Science,
2, e59. http://dx.doi.org/10.7717/peerj-cs.59
Zilcha-Mano, S. (2017). Is the alliance really therapeutic? Revisiting this
question in light of recent methodological advances. American Psychol-
ogist, 72, 311–325. http://dx.doi.org/10.1037/a0040435
Zilcha-Mano, S., & Errázuriz, P. (2017). Early development of mechanisms of
change as a predictor of subsequent change and treatment outcome: The
case of working alliance. Journal of Consulting and Clinical Psychology,
85, 508 –520. http://dx.doi.org/10.1037/ccp0000192
Received March 9, 2019
Revision received July 8, 2019
Accepted August 8, 2019 䡲
E-Mail Notification of Your Latest Issue Online!
Would you like to know when the next issue of your favorite APA journal will be available
online? This service is now available to you. Sign up at https://my.apa.org/portal/alerts/ and you will
be notified by e-mail when issues of interest to you become available!
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
448 GOLDBERG ET AL.
A preview of this full-text is provided by American Psychological Association.
Content available from Journal of Counseling Psychology
This content is subject to copyright. Terms and conditions apply.