Content uploaded by Jinyan Fan
Author content
All content in this area was uploaded by Jinyan Fan on Jan 05, 2023
Content may be subject to copyright.
Content uploaded by Tianjun Sun
Author content
All content in this area was uploaded by Tianjun Sun on Jan 04, 2023
Content may be subject to copyright.
Personality Assessment through an AI Chatbot 1
Running Head: PERSONALITY ASSESSMENT THROUGH AN AI CHATBOT
How Well Can an AI Chatbot Infer Personality? Examining Psychometric Properties of
Machine-inferred Personality Scores
Jinyan Fan1*+, Tianjun Sun2*, Jiayi Liu1+, Teng Zhao1, Bo Zhang3,4,
Zheng Chen5, Melissa Glorioso1, and Elissa Hack6
1 Department of Psychological Sciences, Auburn University
2 Department of Psychological Sciences, Kansas State University
3 School of Labor and Employment Relations, University of Illinois at Urbana Champaign
4 Department of Psychology, University of Illinois at Urbana Champaign
5 School of Information Systems & Management, Muma College of Business, University of South
Florida – St. Petersburg
6 Department of Behavioral Sciences & Leadership, United States Air Force Academy
Author Note
*Jinyan Fan and Tianjun Sun contributed equally to this manuscript.
Melissa Glorioso is now at the Army Research Institute for the Behavioral and Social Sciences.
We thank Michelle Zhou, Huahai Yang, and Wenxi Chen of Juji, Inc. for assistance with
machine-learning based model building. We thank Andrew Speer, Louis Hickman, Filip Lievens,
Emily Campion, Peter Chen, Alan Walker, and Jesse Michel for providing valuable feedback on
an earlier version of the article.
Earlier versions of parts of the paper were presented at the Society for Industrial and
Organizational Psychology 2018 and 2022 conferences.
The views expressed are those of the authors and do not reflect the official policy or position of
the U.S. Air Force, Department of Defense, or the U.S. Government.
+Correspondence concerning this article should be addressed to Jinyan Fan, Department of
Psychological Sciences, Auburn University, 225 Thach Hall, Auburn, AL 36849, United States,
email: Jinyan.Fan@auburn.edu, or to Jiayi Liu, Department of Psychological Sciences, Auburn
University, 102A Thach Hall, Auburn, AL 36849, United States, email: jzl0217@auburn.edu.
Manuscript Status: accepted at the Journal of Applied Psychology, January 4, 2023.
This paper is not the copy of record and may not exactly replicate the final, authoritative version
of the article.
Personality Assessment through an AI Chatbot 2
Abstract
The present study explores the plausibility of measuring personality indirectly through an
Artificial Intelligence (AI) chatbot. This chatbot mines various textual features from users’ free
text responses collected during an online conversation/interview, and then uses machine learning
algorithms to infer personality scores. We comprehensively examine the psychometric properties
of the machine-inferred personality scores, including reliability (internal consistency, split-half,
and test-retest), factorial validity, convergent and discriminant validity, and criterion-related
validity. Participants were undergraduate students (n = 1,444) enrolled in a large southeastern
public university in the U.S. who completed a self-report Big-Five personality measure (IPIP-
300) and engaged with an AI chatbot for approximately 20 to 30 minutes. In a subsample (n =
407), we obtained participants’ cumulative grade point averages (GPAs) from the University
Registrar and had their peers rate their college adjustment. In an additional sample (n = 61), we
obtained test-retest data. Results indicated that machine-inferred personality scores (a) had
overall acceptable reliability at both the domain and facet levels, (b) yielded a comparable factor
structure to self-reported questionnaire-derived personality scores, (c) displayed good convergent
validity but relatively poor discriminant validity (averaged convergent correlations = .48 vs.
averaged machine-score correlations = .35 in the test sample), (d) showed low criterion-related
validity, and (e) exhibited incremental validity over self-reported questionnaire-derived
personality scores in some analyses. In addition, there was strong evidence for cross-sample
generalizability of psychometric properties of machine scores. Theoretical implications, future
research directions, and practical considerations are discussed.
Keywords: chatbot, personality, artificial intelligence, machine learning, psychometric properties
Personality Assessment through an AI Chatbot 3
How Well Can an AI Chatbot Infer Personality? Examining Psychometric Properties of
Machine-Inferred Personality Scores
During the last three decades, personality measures have been established as a useful
talent assessment tool due to the findings that (a) personality scores are predictive of important
organizational outcomes (e.g., Hurtz & Donovan, 2000; Judge & Bono, 2001), and (b)
personality scores typically do not result in racial adverse impact (e.g., Foldes et al., 2008).
While scholars and practitioners alike appreciate the utility of understanding individuals’
behavioral characteristics in organizational settings, debates have revolved around how to
measure personality more effectively and efficiently. Self-report personality measures, often
used in talent assessment practice, have been criticized for (a) modest criterion-related validity
(Morgeson et al., 2007), (b) susceptibility to faking or response distortion, particularly within
selection contexts (Ziegler et al., 2012), (c) idiosyncratic interpretation of items due to individual
differences in cross-situational behavioral consistency (Hauenstein et al., 2017), and (d) the
tedious testing experience where test-takers have to respond to many items in one sitting.
Recently, an innovative approach to personality assessment has emerged. This approach
was originally developed by computer scientists and has now made its way into the applied
psychology field. It is generally referred to as artificial intelligence (AI)-based personality
assessment. This new form of assessment can be distinguished from traditional assessment in
three ways: technologies, types of data, and algorithms (Tippins et al., 2021). Data collected via
diverse technological platforms (e.g., social media and video interviews) have been used to
obtain an assortment of personality-relevant data (digital footprints) such as facial expression
(Suen et al., 2019), smartphone data (Chittaranjan et al., 2013), interview responses (Hickman et
al., 2022), and online chat scripts (Li et al., 2017).
Personality Assessment through an AI Chatbot 4
The third area in which AI-based personality assessments are unique is its algorithms. AI
is a broad term that refers to the science and engineering of making intelligent systems or
machines (e.g., especially computer programs) that mimic human intelligence to perform tasks
and can iteratively improve themselves based on the information they collect (McCarthy &
Wright, 2004). Machine learning (ML) is a subset of AI, which focuses on building computer
algorithms that automatically learn or improve performance based on the data they consume
(Mitchell, 1997). In some more complex work, the term deep learning (DL) may also be
referenced. DL is a subset of ML, referring to neural-network-based ML algorithms that are
composed of multiple processing layers to learn the representations of data with multiple levels
of abstraction and mimic how a biological brain works (LeCun et al., 2015). The present article
refers to personality assessment tools that can predict personality traits using digital footprints as
the ML approach.
The ML approach to personality assessment typically entails two major stages: (a) model
training and (b) model application. In the model training stage, researchers attempt to build
predictive models using a large sample of individuals. The predictors are potentially trait-
relevant features, extracted through analyzing digital footprints generated by individuals such as
a corpus of texts, social media “likes,” sound/voice memos, micro facial expressions, etc. The
criteria (ground truth) are the same individuals’ questionnaire-derived personality scores, either
self-reported (e.g., Golbeck et al., 2011; Gou et al., 2014), other-rated (e.g., Chen et al., 2017;
Harrison et al., 2019), or both (e.g., Hickman et al., 2022). Next, researchers try to establish
empirical links between the predictors (features) and the criteria, often via linear regressions,
support vector machines, tree-based analyses, or neural networks, resulting in estimated model
Personality Assessment through an AI Chatbot 5
parameters (e.g., regression coefficients).1 To avoid model overfitting, within-sample k-fold
cross-validation is routinely conducted (Bleidorn & Hopwood, 2019). In addition, an
independent test sample is often arranged for cross-sample validation and model testing. In the
model application stage, the trained model is applied to automatically predict the personality of
new individuals who do not have the questionnaire-derived personality data. Specifically, the
computer algorithm first analyzes new individuals’ digital footprints, extracts features, obtains
feature scores, and then uses feature scores and established model parameters to calculate
predicted personality scores.
The ML approach can be thought of as an indirect measurement of personality using a
large number of features with empirically derived model parameters to score personality
automatically (Hickman et al., 2022; Park et al., 2015). These features can be based on lingual or
other behaviors, such as interaction logs or facial expressions. Model parameters indicate the
influence of features on personality “ground truth” as the prediction targets. This new approach
boasts at least two advantages over traditional assessment methods, particularly self-report
questionnaires (Mulfinger et al., 2020). The first advantage lies in its efficiency. For instance, it
is possible to use the same set of digital footprints to train a series of models to infer scores on
numerous individual difference variables such as personality traits, cognitive ability, values,
1 Support vector machine (or SVM, for short) has the objective of finding a hyperplane (i.e., can be multi-
dimensional) in a high-dimensional space (e.g., a. regression or classification model with many features or
predictors) that distinctly classifies the observations, such that the plane has the maximum margin (i.e., the distance
between data points and classes) where the maximization would reinforce future observations to be more confidently
classified and values predicted. Neural networks (or artificial neural networks, as often referred to in computational
sciences to be distinguished from neural networks in the biological brain) can be considered sets of algorithms that
are designed—loosely after the information processing of the human brain—to recognize patterns. All patterns
recognized—be it from sounds, images, text, or others—are numerical and contained in vectors to be stored and
managed in an information layer (or multiple processing layers), and various follow-up tasks (e.g., clustering) can
take place on another layer on top of the information layer(s).
Personality Assessment through an AI Chatbot 6
career interests, etc. It is resource intensive to train these different models; however, once
trained, various individual differences can be automatically and simultaneously inferred with a
single set of digital footprint inputs. This would greatly shorten the assessment time in general,
which should be appealing to both test-takers and sponsoring organizations. Second, the testing
experience tends to be less tedious (Kim et al., 2019). If individuals’ social media content is
utilized to infer personality, individuals do not need to go through the assessment process at all.
If video interview or online chat is used, individuals may feel they have more opportunities to
perform, and thus should enjoy the assessment more than, for instance, completing a self-report
personality inventory (McCarthy et al., 2017).
Despite the above potential advantages, the ML approach to personality assessment faces
several challenges and needs to sufficiently address many issues before it can be used in practice
for talent assessment. For instance, earlier computer algorithms required users (e.g., job
applicants) to share their social media content, which did not fare well due to apparent privacy
concerns and potential legal ramifications (Oswald et al., 2020). In response, organizations have
begun to use automated video interviews (AVIs; e.g., Hickman et al., 2022; Leutner et al., 2021)
or text-based interviews (e.g., Zhou et al., 2019; Völkel et al. 2020) to extract trait-relevant
features. The present study uses the text-based interview method (also known as the AI chatbot)
to collect textual information from users. Strategies such as AVIs and AI chatbots may gain
wider acceptance in applied settings as job applicants are less likely to refuse an interview
request by the hiring organization where they expect a job offer.
Another critical issue that remains largely unaddressed is a striking lack of extensive
examinations of the psychometric properties of machine-inferred personality scores (Bleidorn &
Hopwood, 2019; Hickman et al., 2022). Although numerous computer algorithms have been
Personality Assessment through an AI Chatbot 7
developed to infer personality scores, few validation efforts have gone beyond demonstrating the
convergence between questionnaire-derived and machine-inferred personality scores.
The purpose of the present study is to explore the plausibility of measuring personality
through an AI chatbot. More importantly, we extensively examine psychometric properties of
machine-inferred personality scores at both facet and domain levels including reliability (internal
consistency, split-half, and test-retest), factorial validity, convergent and discriminant validity,
and criterion-related validity. Such a comprehensive examination has been extremely rare in the
ML literature but is sorely needed. Although an unpublished doctoral dissertation (Sun, 2021)
provided initial promising evidence for the utility of the AI chatbot method for personality
inference, more empirical research is warranted.
In the current study, we have chosen to use self-reported questionnaire-derived
personality scores as ground truth when building predictive models. Self-report personality
measures are the most widely used method of personality assessment in practice, with an
impressive body of validity evidence. Although self-report personality measures may be prone to
social desirability or faking, our study context is research-focused instead of selection-focused,
and thus faking is unlikely to be a serious concern. In what follows we first introduce the AI-
powered, text-based interview system (AI chatbot) used in our research. We then briefly discuss
how we examine the psychometric properties of machine-inferred personality scores, and then
present a large-scale empirical study.
AI Chatbot and Personality Inference
An AI chatbot is an Artificial Intelligence system that often utilizes a combination of
technologies, such as deep learning for natural language processing (NLP), symbolic machine
learning for pattern recognition, and predictive analytics for user insights inference to enable the
Personality Assessment through an AI Chatbot 8
personalization of conversation experiences and improve chatbot performance as it is exposed to
more human interactions (IBM, n.d.). Unlike automated video interview systems (e.g., Hickman
et al., 2022; Leutner et al., 2021), which mostly entail one-way communications, an AI chatbot
engages with users through two-way communications (e.g., Zhou et al., 2019).
While there are many chatbot platforms commercially available (e.g., IBM Watson
Assistant, Google Dialogflow, and Microsoft Power Virtual Agents), we opted to use Juji’s AI
chatbot platform (https://juji.io) as our study platform for three reasons. First, unlike many other
platforms that require writing computer programs (e.g., via Application Programming Interfaces)
to train and customize advanced chatbot functionalities, such as dialog management, Juji’s
platform enables non-IT professionals, such as applied psychologists and HR professionals, to
create, customize, and manage an AI chatbot to conduct virtual interviews, without writing code.
Second, Juji’s platform is publicly accessible (currently with no- or low-cost academic use
options), which enables other scholars to conduct similar research studies and/or replicate our
study. Third, scholars have successfully used the Juji chatbot platform conducing various
scientific studies, including team creativity (e.g., Hwang and Won 2021), personality assessment
(e.g., Völker et al, 2020), and public opinion elicitation (e.g., Jiang et al., 2023).
To enable readers to better understand what is behind the scenes of the Juji AI platform
and facilitate others in selecting comparable chatbot platforms for conducting studies like ours,
in the next section we provide a high-level, non-technical explanation of the Juji AI chatbot
platform. Note that the key structure and functions of this specific chatbot platform also outlines
the required key components for supporting any capable conversational AI agents (e.g.,
Jayaratne & Jayatilleke, 2020), thus allowing our methods and results to be generalized. 2
2 We declare no financial conflict of interest, nor advisory board affiliations, etc. with Juji, Inc.
Personality Assessment through an AI Chatbot 9
As shown in Figure 1, at the bottom level, several machine learning models are used,
including data-driven machine learning models and symbolic AI models, to support the entire
chatbot system. At the middle level, specialty engines are built to facilitate two-way
conversations and user insights inferences. Specifically, an NLP engine for conversation
interprets highly diverse and complex user natural language inputs during a conversation. Based
on the interpretation results, an Active Listening Conversation Engine, which is powered by
social-emotional intelligence of carrying out an empathetic and effective conversation, decides
how to best respond to users and guide the conversation forward. For example, it may decide to
engage users in small talk, provide empathetic comments such as paraphrasing, verbalizing
emotions, and summarizing user input, as well as handle diverse user interruptions, such as
digressing from a conversation and trying to dodge a question (Xiao et al., 2020). The
Personality Inference Engine takes in the conversation script from a user and performs NLP on
the script to extract textual features, which are used as predictors, with the same user’s
questionnaire-based personality scores being used as the criteria. ML models can then be built to
automatically infer personality using statistical methods such as linear regressions.
To facilitate the customization of an AI chatbot, a set of reusable AI components are pre-
built, from conversation topics to AI assistant templates, which enables automated generation of
an AI assistant/chatbot (see the top part in Figure 1). Specifically, a chatbot designer (a
researcher or an HR manager) uses a graphical user interface to specify what the AI chatbot
should do, such as its main conversation flow or Q&As to be supported, the AI generator will
automatically generate a draft AI chatbot based on the specifications. The generated AI chatbot
is then compiled by the AI compiler to become alive—a live chatbot that can engage with users
in a conversation, which is managed by the AI runtime component ensure a smooth conversation.
Personality Assessment through an AI Chatbot 10
Examining Psychometric Properties of Machine-Inferred Personality Scores
Bleidorn and Hopwood (2019) suggested three general classes of evidence for the
construct validity of machine-inferred personality scores based on Loevinger’s (1957) original
framework. We rely on this framework to organize the current examination of psychometric
properties of machine scores. Table 1 summarizes relevant ML research in this area.
Substantive Validity
According to Bleidorn and Hopwood (2019), the first general class of evidence for
construct validity is substantive validity, which has often been operationalized as content
validity, defined as the extent to which test items sufficiently sample the conceptual domain of
the construct but do not tap into other constructs. Establishing the content validity of machine-
inferred personality scores proves quite challenging. This is because the ML approach is based
on features identified empirically in a data-driven manner (Hickman et al., 2022; Park et al.,
2015). As such, we typically do not know a priori which features (“items”) should predict which
personality traits and why; furthermore, these features are diverse, heterogenous, and in large
quantity (Bleidorn & Hopwood, 2019).
Although several previous ML studies have partially established content validity of
machine-inferred personality scores (e.g., Hickman et al., 2022; Kosinski et al., 2013; Park et al.,
2015; Yarkoni, 2010), overall, the evidence for content validity has been very limited. Because
the current AI chatbot system uses a DL model that mines textual features purely based on
mathematics, we could not examine content validity directly. However, we looked at content
validity indirectly (see the Supplemental Analysis section).
Structural Validity
Personality Assessment through an AI Chatbot 11
The second general class of evidence for construct validity is structural validity, which
focuses on the internal characteristics of test scores (Bleidorn & Hopwood, 2019). There are
three major categories of structural validity: reliability, generalizability, and factorial validity.
Reliability. Internal consistency (Cronbach’s α) typically does not apply to machine-
inferred personality scores because mined features (often in the hundreds) are empirically
derived in a purely data-driven manner and are unlikely to be homogenous “items,” thus having
very low item-total correlations (Hickman et al., 2022). However, Cronbach’s α can be estimated
at the personality domain level, treating facet scores as “items.” Test-retest reliability, on the
other hand, can be readily calculated at both domain and facet levels. We located several
empirical studies (e.g., Gow et al., 2016; Harrison et al., 2019; Hickman et al., 2022; Li et al.,
2020; Park et al., 2015) reporting reasonable test-retest reliability of machine-inferred
personality scores. A third type of reliability index, split-half reliability, cannot be calculated
directly, since there is no “item” in machine-inferred personality assessment. However, ML
scholars have found a way to overcome this difficulty. Specifically, scholars first randomly split
the corpus of text provided by participants into halves with roughly the same number of words or
sentences, then apply the trained model to predict personality scores based on the respective
segments of words separately. Split-half reliability is calculated as the correlations between these
two sets of machine scores with the Spearman-Brown correction. A few studies (e.g., Hoppe et
al., 2018; Wang & Chen, 2020; Youyou et al., 2015) reported reasonable split-half reliability of
machine-inferred scores with averaged split-half reliability across Big Five domains ranging
from .59 to .71. In the present study, we estimated test-retest and split-half reliability for
machine-inferred personality facet scores. We also estimated Cronbach’s αs of machine-inferred
personality domain scores, treating facet scores under respective domains as “items.”
Personality Assessment through an AI Chatbot 12
Generalizability. Generalizability refers to the extent to which the trained model may be
applied to different contexts (e.g., different samples, different models trained on different sets of
digital footprints, etc.) and still yield comparable personality scores (Bleidorn & Hopwood,
2019). Our review of the literature reveals that very few ML studies have examined the issue of
generalizability. One important exception is a study done by Hickman et al. (2022) who obtained
four different samples, trained predictive models on samples 1-3 individually, and then applied
trained models to separate samples. Hickman et al. reported mixed findings regarding the
generalizability of machine-inferred personality scores.
In the present study, we focused on cross-sample generalizability, looking at many
aspects of cross-sample generalizability including reliability (internal consistency at the domain
level and split-half at the facet level), factor structure, and convergence and discrimination
relations at both the latent and manifest variable levels (to be discussed subsequently).
Factorial validity. Factorial validity is established if machine-inferred personality facet
scores may recover the Big Five factor structure (Costa & McCrae, 1992; Goldberg, 1993) as
rendered by self-reported questionnaire-derived personality facet scores (i.e., same factor loading
patterns and similar magnitude of factor loadings). To our best knowledge, no empirical studies
have examined the factorial validity of machine-inferred personality scores, primarily because in
almost all empirical studies researchers have trained models to predict personality domain scores
rather than facet scores.3 In the present study, we overcome this limitation by building predictive
models at the facet level.
External Validity
3 Speer (2021) examined the factor structure of machine-inferred dimension-level performance scores, but we are
interested in the factor structure of machine-inferred personality scores.
Personality Assessment through an AI Chatbot 13
The third general class of evidence for construct validity is external validity, which
focuses on the correlation patterns between test scores and external, theoretically relevant
variables (Bleidorn & Hopwood, 2019). Within this class of evidence, researchers typically look
at convergent validity, discriminant validity, criterion-related validity, and incremental validity.
Convergent validity. Within the ML approach, convergent validity refers to the
magnitude of correlations between machine-inferred and questionnaire-derived personality
scores of the same personality traits. Because most computer algorithms treat the latter as ground
truth and aim to maximize its prediction, it is not surprising that convergent validity of machine-
inferred personality scores has been routinely examined, which has yielded several meta-
analyses (e.g., Azucar et al., 2018; Sun, 2021; Tay et al., 2020). These meta-analyses reported
modest-to-moderate convergent validity across Big Five domains, ranging from .20s to .40s.
Discriminant validity. In contrast to the heavy attention given to convergent validity,
very few empirical studies have examined the discriminant validity of machine-inferred
personality scores (Sun, 2021). A measure with good discriminant validity should demonstrate
that correlations between different measures of the same constructs (convergent relations) are
much stronger than correlations between measures of different constructs using the same
methods (discriminant relations; Cronbach & Meehl, 1955). Researchers usually rely on the
multi-trait multi-method (MTMM) matrix to investigate convergent and discriminant validity.
We identified four empirical studies that examined the discriminant validity of machine-
inferred personality scores (Harrison et al., 2019; Hickman et al., 2022; Marinucci et al., 2018;
Park et al., 2015), with findings suggesting relatively poor discriminant validity. For instance,
Park et al. (2015) showed that the average correlations among Big Five domain scores were
significantly higher when measured by the ML method than by self-report questionnaires (
Personality Assessment through an AI Chatbot 14
= .29 vs. = .19). One possible reason for relatively poor discriminant validity is that there are
typically many features (easily in the hundreds) in predictive models inferring personality. As a
result, it is common that models predicting different personality traits share many common
features, which would inflate correlations between machine-inferred personality scores.
In the present study, since we built predictive models at the facet level, we were able to
examine convergent and discriminant validity of personality domain scores at both the manifest
and latent variable levels. Analyzing the MTMM structure of latent scores allowed for
disentangling trait, method, people-specific, and measurement error influences on personality
scores. One advantage of the latent variable approach is that it separates the method variance
from the measurement error variance. As a result, correlations among personality domain scores
should be more accurate at the latent variable level than at the manifest variable level.
Criterion-related and incremental validity. Given the modest-to-moderate correlation
between machine-inferred and self-reported questionnaire-based scores (typically in the .20 - .40
range as reported in several meta-analyses; e.g., Sun, 2021; Tay et al., 2020) and the similarly
modest correlations between self-reported questionnaire-derived personality scores and job
performance (r = .18 for personality scores overall meta-analytically; Sackett et al., 2022), one
could expect that machine-inferred personality scores would exhibit some criterion-related
validity—operationalized as the cross product of the above two coefficients (e.g., .30 × .18
= .054)—if treating the portion of machine-inferred scores that converges with self-reported
scores as predictive of performance criteria. However, it may be argued that criterion-related
validity, in this case, would be too small to be practically useful.
Yet, we want to offer another reasoning approach: the criterion-related validity of
machine-inferred personality scores does not have to only come from the portion converging
Personality Assessment through an AI Chatbot 15
with self-reported questionnaire-derived personality scores. In other words, the part of the
variance in machine-inferred personality scores that is unshared by self-reported questionnaire-
derived personality scores might still be criterion-relevant. Conceptually, it is plausible that
these two types of personality scores capture related but distinct aspects of personality. For
instance, self-reported questionnaire-derived personality scores represent one’s self-reflection on
typical and situation-consistent behaviors (Pervin, 1994), with much of the nuances in actual
behaviors perhaps lost. In contrast, the ML approach extracts many linguistic cues that may
capture the nuances in behaviors above and beyond typical behaviors captured by self-reported
questionnaire-derived personality scores. Although the exact nature of the unique aspects of
personalities captured by the ML approach is unclear, their criterion relevance is not only a
theoretical question but also an empirical one. Thus, if our above reasoning is correct, we would
expect machine-inferred personality scores to exhibit not only criterion-related validity but also
incremental validity over self-reported questionnaire-derived personality scores.
Although a few empirical studies (e.g., Kulkarni et al., 2018; Park et al., 2015; Youyou et
al., 2015) have reported that machine-inferred personality scores predicted some demographic
variables (e.g., network size, field of study, political party affiliation, etc.) and self-reported
outcomes (e.g., life satisfaction, physical health, and depression), there has been a lack of
evidence that machine-inferred personality scores can predict organizationally relevant non-self-
report criteria, such as performance and turnover. The reasons self-report criteria are less
desirable include: (a) they are rarely used in talent assessment practice, and (b) machine-inferred
personality scores based on self-report models and self-report criteria share the same source,
which might inflate criterion-related validity (Park et al., 2015).
Personality Assessment through an AI Chatbot 16
Interestingly, we located a handful of studies in the field of strategic management that
documented criterion-related validity of machine-inferred CEO personality scores (e.g., Gow et
al., 2016; Harrison et al., 2019, 2020; Wang & Chen, 2020). For example, Harrison et al. (2019)
obtained 207 CEOs’ spoken or written texts (e.g., transcripts of earnings calls), extracted textual
features, had experts (trained psychology doctoral students) rate these CEOs’ personalities based
on their texts, and then built predictive models accordingly. Harrison et al. then applied the
trained model to estimate Big Five domain scores for 3,449 CEOs of 2,366 S&P 1500 firms
between 1996 and 2014 and then linked them to firm strategy changes (e.g., advertising intensity,
R&D intensity, inventory levels, etc.) and firm performance (return on assets). These authors
reported that machine-inferred CEO Openness scores were positively related to firm strategic
changes and that the relationship was stronger when the firm’s performance of the previous year
was low vs. high. Unfortunately, studies in this area typically obtained expert-rated personality
scores only for small samples of CEOs (n < 300) during modeling building, and thus were unable
to examine incremental validity in much larger holdout samples.
Thus, to our best knowledge, even in the broad ML literature, there has not been any
empirical evidence for incremental validity of machine-inferred personality scores over
questionnaire-derived personality scores beyond self-report criteria. The present study addressed
this major limitation by investigating criterion-related and incremental validity of machine-
inferred personality scores using two non-self-report performance criteria: (a) objective
cumulative grade point average (GPA) and (b) peer-rated college adjustment.
Method
Transparency and Openness
Personality Assessment through an AI Chatbot 17
We describe our sampling plan, all data exclusions, manipulations, and all measures in
the study, and we adhered to the Journal of Applied Psychology methodological checklist. Raw
data and computer code for model building are not available due to their proprietary nature, but
processed data (self-reported questionnaire-derived and machine-inferred personality scores) are
available at https://osf.io/w73qn/?view_only=165d652ef809442fbab46f815c57f467.Other
associated research materials are included in the online supplemental materials. Data were
analyzed using Python 3 (Van Rossum & Drake, 2009); the scikit-learn package (Pedregosa et
al., 2011); R (version 4.0.0); the package psych, version 2.2.5 (Revelle, 2022); IBM SPSS
Statistics (version 27), and Mplus 8.5 (Muthén & Muthén, 1998-2017). The study design and its
analysis were not preregistered.
Sample and Procedure
Participants were 1,957 undergraduate students enrolled in various psychology courses,
recruited from the subject pool (the SONA system) operated by the Department of Psychological
Sciences at a large Southeastern public university (Auburn University IRB protocol 16-354 MR
1609, Examining the Relationships among Daily Word Usage through Online Media,
Personalities, and College Performance; Auburn University IRB protocol 18-410 EP 1902,
Linking Online Interview/Chat Response Scripts to Self-reported Personality Scores). To ensure
a large enough sample size, we collected data over several semesters. We obtained a training
sample (n = 1,477) and a separate test sample (n = 480). The training sample contains data
collected in fall 2018 (n = 531 in the lab), spring 2019 (n = 202 in the lab and n = 396 online),
and spring 2020 (n = 348 online), and was used to build predictive models. The separate test
Personality Assessment through an AI Chatbot 18
sample contained data collected in spring 2017 (n = 480 in the lab) including the criteria data.
The test sample was used for cross-sample validation and model testing purposes.4
During spring 2017 data collection, when participants arrived at the lab, they first
completed an online Big Five personality measure (IPIP-300) on Qualtrics and then were
directed to the AI firm’s virtual conversation platform, where they engaged with the AI chatbot
for approximately 20 – 30 minutes.5 The chatbot asked participants several open-ended questions
organized around a series of topical areas, which were generic in nature (see the online
supplemental materials A for a list of interview questions used in the present study), and thus
probably should be considered as unstructured interview questions. In other words, although the
same questions were given to all participants, they were not aimed to measure anything specific.
The questions were displayed in a chat box on the computer screen, and participants typed their
responses into the chat box. Participants were allowed to ask the chatbot questions.
After the virtual conversation, participants completed an enrollment certification request
form, which gave us permission to obtain their SAT or/and ACT scores and cumulative GPA
from the University Registrar. Next, participants were asked to provide the names and email
addresses of three peers who knew their college life well. Immediately after the lab session, the
experimenter randomly selected one of the three peers and sent out an invitation email for a peer
4 Juji developed an algorithm to infer personality scores prior to our study. We initially applied their original
algorithm to spring 2017 data to calculate machine-inferred personality scores; however, results showed that
machine-inferred scores did not significantly predict the criteria (GPA and peer-rated college adjustment). After a
discussion with the AI firm’s developers, we identified potential flaws in their original model-building strategy,
which did not use self-reported personality scores as ground truth. Thus, we decided to collect online chat data and
self-reported personality data during the subsequent semesters to enable the AI firm to train new predictive models
using self-reported personality scores as ground truth.
5 The 25th, 50th, and 75th percentile of time spent on a virtual conversation were 18, 22, and 30 minutes, respectively.
We include two examples of chatbot conversation scripts across five participants in the online supplemental
materials A. Out of professional courtesy, the AI firm (Juji, Inc.) provides a free chatbot demo link:
https://juji.ai/pre-chat/62f571ec-11d7-4d1e-9336-37066dfa0f48, with these instructions: (a) use a computer (not a
cell phone) for the online chat; (b) use Google Chrome as the browser; (c) responses should be in sentences (rather
than words); (d) finish the entire chat in one sitting; and (e) if enough inputs are provided, the algorithm will infer
and show your personality scores on the screen toward the end of the online chat.
Personality Assessment through an AI Chatbot 19
evaluation with a survey link. One week later, a reminder email was sent to the peer. If the first
peer failed to respond, the experimenter randomly selected and invited another peer. The process
continued until all three peers were exhausted. It turned out that eight participants received
multiple peer ratings and we used the first peer’s ratings in subsequent analyses. Peers were sent
a $5 check for their time (10-15 minutes). We assigned a participant ID to each participant,
which was used as the identifier to link his/her self-reported questionnaire-derived personality
data, virtual conversation scripts, peer ratings, and cumulative GPAs.
Out of 480 participants in the spring 2017 sample, 73 participants either failed to enter,
mis-typed their participant IDs, or chose to withdraw from the study, and were excluded. Thus,
we obtained matched data for 407 participants, among whom 75.2% were female, and the mean
age was 19.38 years. Out of 407 participants, we were able to obtain 379 participants’ SAT/ACT
scores and cumulative GPA from the University Registrar and 301 participants’ peer ratings.
Two hundred and eighty-nine participants had both SAT/ACT scores and peer ratings.
For data collected after spring 2017, 733 participants came to our lab and went through
the same study procedure as those in spring 2017 (the test sample) but without collecting the
criteria data. In addition, 744 completed the study online via a Qualtrics link for the IPIP-300 and
the AI firm’s virtual conversation system for the online chat. No criteria data were collected for
the online participants, either. Based on the same matching procedure as the test sample, we were
able to match the personality data and the virtual conversation scripts for 1,037 out of 1,477
participants in the training sample among whom 76.5% were female with an average age of
19.90 years. With respect to race, 80% were White, 5% were Black, 3% were Hispanic, 2% were
Asian, and 10% did not disclose their race. These participants’ data were used subsequently to
Personality Assessment through an AI Chatbot 20
build predictive models. For the entire sample, the median number of sentences provided by
participants through the chatbot was 40 sentences, ranging from 26 to 112 sentences.
To examine test-retest reliability, we obtained another independent sample. Participants
were 74 undergraduate students enrolled in one of the two sections of a psychological statistics
course at the same university in fall 2021 (Auburn University protocol 21-351 EX 2111,
Examining Test-retest Reliability of Machine-inferred Personality Scores). Participants were
invited to engage with the chatbot twice with the same set of interview questions as the main
study in exchange for extra course credit. Sixty-one participants completed both online chats,
with time lapses ranging from 3 to 35 days with an average time lapse of 22 days.
Measures
Self-reported questionnaire-derived personalities. The 300-item IPIP personality
inventory (Goldberg, 1999) was used. The IPIP-300 was designed to measure 30 facet scales in
the Big Five framework, modeling after the NEO Personality Inventory (NEO PI-R; Costa &
McCrae, 1992). Items were rated on a 5-point scale ranging from 1 (strongly disagree) to 5
(strongly agree). These 30 facet scales, Big Five domain scales, and their reliabilities in the
current samples (both the training and test samples) are presented in Tables 2 and 3. There was
no missing data at the item level.
Machine-inferred personalities. The machine learning algorithm the AI firm helped
develop was used to estimate the 30 personality facet scores based on mined textual features of
participants’ virtual conversation scripts. See the Analytic Strategies section for NLP and model
building details. Reliabilities of machine-inferred 30 facet scores and Big Five domain scores6 in
the training and test samples are presented in Tables 2 and 3, respectively.
6 We used two methods to calculate machine-inferred personality domain scores. For the first method, we calculated
domain scores as averaged machine-inferred facet scores in respective domains. For the second method, we first
Personality Assessment through an AI Chatbot 21
Objective college performance. Participants’ cumulative GPAs were obtained from the
University Registrar.
Peer-rated college adjustment. Scholars have advocated expanding the conceptual
space of college performance beyond GPA to include alternative dimensions such as social
responsibility (Borman & Motowidlo, 1993) and adaptability and life skills (Pulakos et al.,
2000). Oswald et al. (2004) proposed a 12-dimension college adjustment model that covers
intellectual behaviors, interpersonal behaviors, and intrapersonal behaviors. Oswald et al.
developed a 12-item behaviorally anchored rating scale (BARS) to assess college students’
adjustment in these 12 dimensions; for instance, (a) knowledge, learning, and mastery of general
principles; (b) continuous learning, intellectual interest, and curiosity; (c) artistic cultural
appreciation and curiosity, etc. For each dimension, peers were presented its name, definition,
and two brief examples and were then asked to rate their friend’s adjustment on this dimension
using a seven-point scale (1 = strongly disagree) to (7 = strongly agree). There was no missing
data on the item level. Cronbach’s α was .87 in the test sample. Following Oswald et al., overall
peer-rating scores (the sum of scores on the 12 dimensions) were used in subsequent analyses. In
the current test sample, GPA and peer-rated college adjustment were modestly correlated (r
= .21), suggesting that the two criteria represent distinct, yet somewhat overlapping conceptual
domains.
Control variables. We obtained participants’ ACT and/or SAT scores from the Registrar
along with their cumulative GPA. We converted SAT scores into ACT scores using the 2018
calculated self-reported domain scores based on self-reported facet scores and then built five predictive models with
self-reported domain scores as ground truth. Next, we applied the trained model to predict machine-inferred domain
scores in both training and test samples. It turned out that both methods resulted in similar results and identical
statistical conclusions. We therefore present the results based on the first method in the current article and refer
readers to results based on the second method in the online supplemental material A (Tables S1 – S3).
Personality Assessment through an AI Chatbot 22
ACT-SAT concordance table (ACT, 2018). When examining the criterion-related validity of
personality scores, we controlled for ACT scores, which served as proxies for cognitive ability.
Analytic Strategies
Natural language processing (NLP). The conversation text from each participant was
first segmented into single sentences. Then, sentence embedding (encoding) was performed on
each sentence using the Universal Sentence Encoder (USE; Cer et al., 2018), which is a DL
model that was trained and optimized for greater-than-text length text (e.g., sentences, phrases,
and paragraphs). A sentence embedding yields a list of values, usually in an ordered vector, that
numerically represent the meanings of a sentence by machine understanding. The USE takes in a
sentence string and outputs a 512-dimension vector. The USE model adopted in this study was
pre-trained with a deep averaging network (DAN) encoder on a variety of large data sources and
a variety of tasks by the Google USE team to capture semantic textual information such that
sentences with similar meanings should have embeddings close together in the embedding space.
The resultant model (a DL network) is then fine-tuned by adjusting parameters for the training
data. Given the advantage of capturing contextual meanings, the USE is commonly used for text
classification, semantic similarity, clustering, and other natural language tasks. To obtain text
features for predictive models, we averaged each participant’s sentence embeddings across
sentences, resulting in 512 feature scores for each participant. The same average sentence vector
was used as the predictor in all subsequent model buildings. The practice of averaging
embeddings is common in NLP research and has been shown to yield excellent model
performance in various language tasks with much higher model training efficiency than other
types of models, including some more sophisticated ones (Joulin et al., 2016; Yang et al., 2019).
Personality Assessment through an AI Chatbot 23
Model building. When multicollinearity among predictors is high and/or when there are
many predictors relative to sample size—which is a typical situation ML scholars face—ordinary
least squared regression estimators are still accurate but tend to yield large variances (Zou &
Hastie, 2005). Regularized regression methods are used to help address the bias-variance
tradeoff. For instance, ridge regression penalizes large β’s by imposing the same amount of
shrinkage across β’s, referred to as the L2 penalty (Hoerl & Kennard, 1988). LASSO regression
shrinks some β’s to zero with varying amounts of shrinkage across β’s, referred to as the L1
penalty (Tibshirani, 1996). Elastic net regression combines ridge regression and LASSO
regression using two hyperparameters: alpha (α) and lambda (λ) (Zou & Hastie, 2005).7 The
alpha (α) parameter determines the relative weights given to the L1 vs. L2 penalty. When α
ranges between 0 and .5, elastic net behaves more like ridge regression (L2 penalty; at α = 0, it
becomes completely ridge regression). When α ranges between .5 and 1, elastic net behaves
more like LASSO regression (L1 penalty; at α = 1, it becomes completely LASSO regression).
The lambda (λ) parameter determines how severely regression weights are penalized. We built
predictive models on the training sample (n = 1,037) using elastic net regression via the scikit-
learn package in Python 3 (Pedregosa et al., 2011). Five-fold cross-validation was used to help
tune the model’s hyperparameters (α and λ).8 Once hyperparameters were tuned, the model was
7 The loss function in elastic net regression is: (
)+[(
)
+||
]
, in which the
first term is the OLS loss function (sum of squared residuals), the second term is L2 penalty (ridge regression), and
the third term is an L1 penalty (LASSO regression), with α indicating relative weights given to L1 vs. L2 penalty
and λ indicating the amount of penalty. Please refer to the online supplemental materials A for a non-technical
explanation of how elastic net regression works.
8 Specifically, the training sample was split into five equally sized partitions to build different combinations of
training and validation datasets for better estimation of the model’s out-of-sample performance. A model was fit
using all subsets except the first fold, and then the model was applied to the first fold to examine model performance
(i.e., prediction error in the validation dataset in the current case). Then the first subset was returned to the training
sample and the second subset was used as the hold-out sample in the 2nd round of cross-validation. The procedure
was repeated until the kth round of cross-validation was completed. In each round of cross-validation, a series of
possible values of hyperparameters (α and λ) were explored and corresponding model performance indices were
obtained. Then, the model performance indices associated with the same set of hyperparameter values were
Personality Assessment through an AI Chatbot 24
then fitted to the entire training sample with the hyperparameters fixed to their optimal values to
obtain the final model parameters for the training sample. The elastic net has been considered the
optimal modeling solution for linear relationships (Putka et al., 2018). Next, we applied the
trained models to predict personality scores for 1,037 and 407 participants in the training and test
samples, respectively. We trained 30 predictive models.
Reliability. Internal consistency and test-retest reliability are straightforward to estimate.
For split-half reliability, we randomly divided participants’ conversation scripts into two halves
with an equal number of sentences in each, applied the trained model to obtain two sets of
machine scores, and then calculated their correlations with the Spearman-Brown correction. To
obtain more consistent results, we shuffled the sentence order for each participant before each
split-half trial and reported split-half reliabilities averaged over 20 trials.
Factorial validity. As independent-cluster structures with zero cross-loadings are too
ideal to be true for personality data (Hopwood & Donnellan, 2010) and forcing non-zero cross-
loadings to zero is detrimental in multiple ways (Zhang et al., 2021), a method that allows cross-
loadings would be more appropriate. In addition, since we have two types (sets) of personality
scores, it would be more informative if both types of scores are modeled within the same model
so that correlations among latent factors derived from self-reported questionnaire-derived and
machine-inferred scores can be directly estimated. Therefore, the set exploratory structural
equation modeling (Set-ESEM; Marsh et al., 2020) is an excellent option. In Set-ESEM, two (or
more) sets of constructs are modeled within a single model such that cross-loadings are allowed
for factors within the same set but are constrained to be zero for constructs in different sets.
averaged across five cross-validation trials, and hyperparameter values associated with the best overall model
performance were chosen as the optimal hyperparameters, thus completing the hyperparameter tuning process.
Personality Assessment through an AI Chatbot 25
Set-ESEM overcomes the limitations of ESEM (i.e., lack of parsimony and potential in
confounding constructs) and represents a middle ground between the flexibility of exploratory
factor analysis (EFA) or full ESEM and the parsimony of confirmatory factor analysis (CFA) or
SEM, as Set-SEM aims to achieve a balance between CFA and full ESEM in terms of goodness-
of-fit, parsimony, and factor structure definability (i.e., specifications of empirical item-factor
mappings corresponding to a priori theories; Marsh et al., 2020). Target rotation was used
because we have some prior knowledge about which personality domain (factor) each facet
should belong to. The Set-ESEM analyses were conducted using Mplus 8.5 (Muthén & Muthén,
1998-2017).
Tucker’s Congruence Coefficients (TCCs) were used to quantify the similarity between
factors assessed with a self-report questionnaire and factors inferred by the machine. TCC has
been shown as a useful index for representing the similarity of factor loading patterns of
comparison groups (Lorenzo-Seva & ten Berge, 2006): A TCC value in the range of .85 — .94
corresponds to a fair similarity, while a TCC value at or higher than .95 can be seen as evidence
that the two factors under comparison are identical. TCCs were calculated using the package
psych, version 2.2.5 (Revelle, 2022) in R, version 4.0.0 (R Core Team, 2020). Given that TCC
focuses primarily on the overall pattern (profile similarity), we also examined absolute
agreement of magnitude. Specifically, we plot factor loadings from machine-inferred facet scores
against those from self-reported questionnaire-derived facet scores. We also calculated the Root
Mean Square Error (RMSE) for each domain in both the training and testing samples. We further
added two lines in the plots to show the range where the difference in factor loadings is less
than .10. If most dots fall within the range formed by the two lines, it means that factor loadings
were also similar in magnitude across the two measurement approaches.
Personality Assessment through an AI Chatbot 26
Convergent and discriminant validity. We used Woehr et al.’s (2012) convergence,
discrimination, and method variance indices calculated based on the MTMM matrix to examine
convergent and discriminant validity. The convergence index (C1) was calculated as the average
of the monotrait-heteromethod correlations. Conceptually, C1 indicates the proportion of
expected observed variance in trait-method units attributable to the person main effects and
shared variance specific to traits. A positive and large C1 indicates strong convergent validity
and serves as the benchmark to examine discriminant indices (Woehr et al., 2012).
The first discriminant index (D1) was calculated by subtracting the average of absolute
heterotrait-heteromethod correlations from C1, where the former is conceptualized as the
proportion of expected observed variance attributable to the person main effects. Thus, a positive
and large D1 indicates that a much higher proportion of expected observed variance can be
attributed to traits vs. person, thus high discriminant validity. The second discriminant index
(D2) is calculated by subtracting the average of absolute heterotrait-monomethod correlations
from C1. Conceptually, D2 compares the proportion of expected observed variance attributable
to traits vs. methods. A positive and large D2 indicates high discriminant validity, in that trait-
specific variance dominates method-specific variance (Woehr et al., 2012). We also calculated
D2a, a variant of D2 calculated using only machine monomethod correlations (C1: the average
of absolute heterotrait-machine method correlations). This is done considering previous
empirical studies showing that the machine method tends to pose a major threat to the
discriminant validity of machine-inferred personality scores (Park et al., 2015; Tay et al., 2020).
We also calculated the average of absolute heterotrait-machine method correlations and the
average of absolute heterotrait-self-report method correlations. If the former is substantially
Personality Assessment through an AI Chatbot 27
larger than the latter, it suggests machine-inferred scores tend to have relatively poorer
discriminant validity than self-reported questionnaire-derived scores.
Criterion-related validity. To examine criterion-related validity, we first looked at the
partial bivariate correlations between machine-inferred personality domain scores and the two
external non-self-report criteria (GPA and peer-rated college adjustment) controlling for ACT
scores. We then ran 10 sets of regression analyses with the two external criteria as separate
outcomes and each of the Big Five domain scores as separate personality predictors. Specifically,
the criterion was first regressed on ACT scores (step 1), then on the self-reported questionnaire-
derived personality domain scores (step 2), and then on the respective machine-inferred
personality domain scores (step 3). We were aware that the typical analytic strategy examining
criterion-related validity entails entering all five personality domain scores simultaneously in
respective steps in regression models. However, we were concerned that due to poor
discriminant validity of machine scores (i.e., high correlations among machine scores), such a
strategy might mask the effect of machine scores on criteria.
Results
Reliability
Tables 2 and 3 present facet-level and domain-level reliabilities of self-reported
questionnaire-derived and machine-inferred personality scores, respectively. Several
observations are noteworthy. First, at the facet level, self-reported questionnaire-derived
personality scores showed good internal consistency and comparable split-half reliabilities,
which is not surprising, as the IPIP-300 is a well-established personality inventory.
The second observation is that at the facet level, split-half reliabilities of machine-
inferred personality scores, albeit somewhat lower than those of self-reported questionnaire-
Personality Assessment through an AI Chatbot 28
derived personality scores, were overall in the acceptable range. Averaged split-half reliabilities
for facet scores in the training and test samples are as follows: = .68 and .63 for Openness
facets; = .67 and .68 for Conscientiousness facets; = .64 and .64 for Extraversion facets;
= .73 and .68 for Agreeableness facets; and = .60 and .57 for Neuroticism facets. These results
were comparable to those reported in similar ML studies (e.g., Hoppe et al., 2018; Wang &
Chen, 2020; Youyou et al., 2015). Further, split-half reliabilities are comparable between the
training and test samples (averaged split-half reliabilities are .66 and .64, respectively),
suggesting good cross-sample generalizability.
The third observation is that at the facet level, test-retest reliabilities of machine-inferred
personality scores were comparable to split-half reliabilities. Averaged test-retest reliabilities in
the test sample are as follows: = .67 for Openness facets, = .59 for Conscientiousness
facets, = .66 for Extraversion facets, = .63 for Agreeableness facets, and = .58 for
Neuroticism facets, with an averaged for all facet scales of .63. The modest retest sample size
(n = 61) rendered wide 95% confidence intervals (CIs) for test-retest reliabilities, with CI width
ranging from .21 to .41, with an average of .31. Thus, the above findings should be interpreted
with caution. We also note that the test-retest reliabilities of machine-inferred personality scores
were lower than those of self-reported questionnaire-derived personality scores which are
estimated to be averaged around .80 according to a meta-analysis (Gnambs, 2014).
The fourth observation is that at the domain level, machine-inferred personality scores
demonstrated somewhat higher internal consistency reliabilities than self-reported questionnaire-
derived personality domain scores when facet scores were treated as “items.” In the training
sample, averaged Cronbach’s αs across all Big Five domains for self-reported questionnaire-
derived and machine-inferred personality scores were .80 and .88, respectively. In the test
Personality Assessment through an AI Chatbot 29
sample, averaged Cronbach’s αs were .79 and .88, respectively. These result patterns were
somewhat unexpected. One possible explanation is that many significant features shared by
predictive models underneath the same Big Five domain might have inflated domain-level
Cronbach’s αs of machine scores. Note that Cronbach’s αs of machine-inferred domain scores
were identical between the training and test samples (averaged αs = .88 and .88, respectively),
indicating excellent cross-sample generalizability.
Based on the above findings, there is promising evidence that machine-inferred
personality domain scores demonstrated excellent internal consistency. Further, machine-inferred
personality facet scores exhibited overall acceptable split-half and test-retest reliabilities but
were lower than those of self-reported questionnaire-derived facet scores. In addition, both
domain-level internal consistency and facet-level split-half reliabilities of machine-inferred
personality scores showed strong cross-sample generalizability.
Factorial Validity
To assess the Set-ESEM model fit, Chi-square goodness-of-fit, maximum-likelihood-
based Tucker-Lewis Index (TLI), Comparative Fit Index (CFI), root mean squared error of
approximation (RMSEA), and standardized root mean squared residual (SRMR) were calculated:
For the training sample: χ2 = 11370.56, df = 1435, p < .01, CFI = .87, TLI = .84, RMSEA = .08,
SRMR = .03; for the test sample: χ2 = 5795.88, df = 1435, p < .01, CFI = .85, TLI = .81, RMSEA
= .09, SRMR = .04). Although most values of these model fit indices did not meet the commonly
used rule of thumb (Hu & Bentler, 1999), we consider them as adequate given the complexity of
personality structure (Hopwood & Donnellan, 2010) and because they also resemble what was
reported in the literature (e.g., Booth & Hughes, 2014; Zhang et al., 2020).
Personality Assessment through an AI Chatbot 30
Table 4 presents the Set-ESEM factor loadings from the self-reported questionnaire-
derived personality scores and the machine-inferred personality scores in the test sample. (The
factor loadings on the training sample can be found in Table S4 in the online supplemental
materials A.) The overall patterns of facets loading onto their corresponding Big Five domains
are clear. For the self-reported questionnaire-derived scores, all facets load onto their designated
factors the highest with overall low cross-loadings on others except for the activity level facet in
the Extroversion factor (which had the highest loading on Conscientiousness) and the feelings
facet in the Openness factor (which had the highest loading on Neuroticism). For the machine-
inferred personality scores, the overall Big Five patterns are recovered clearly as well, with the
exceptions—once again—of the activity level and feelings facets bearing the highest loadings on
non-target factors (though the loadings on the target factors are moderately high). In general, the
machine-inferred personality scores largely replicated the patterns and structure observed from
the self-reported questionnaire-derived personality scores.
Results further indicated that TCCs for Extroversion, Agreeableness, Conscientiousness,
Neuroticism, and Openness across the training and test samples were .98 & .97, .98 & .97, .98
& .98, .98 & .98, and .95 & .94, respectively. As such, factor profile similarity between the self-
reported questionnaire-derived and machine-inferred facet scores has been confirmed in both the
training and test samples. Moreover, the plots in Figure 2 clearly show that factor loadings from
the two measurement approaches are indeed similar to one another in terms of both rank order
and magnitude as most dots fall close to the line Y = X. In addition, RMSEs were also small in
general (.08~.11 in the training sample and .09~.14 in the test sample).
Thus, there is promising evidence that the machine-inferred personality facet scores
recovered the underlying Big Five structure. Further, factor loading patterns and magnitude are
Personality Assessment through an AI Chatbot 31
similar across the two measurement approaches. Thus, the factorial validity of machine-inferred
personality scores is established in our samples. In addition, given the similar model fit indices,
similar factor loading patterns and magnitude, and similarly high TCCs between the training and
test samples, cross-sample generalizability for factorial validity seems promising.
Convergent and Discriminant Validity
Tables 5 and 6 present MTMM matrices of latent and manifest Big Five domain scores,
respectively. As can be seen in Table 7, at the latent variable level, the convergence indices
(C1s) were relatively large (.59 and .48), meaning that 59% and 48% of the observed variance
can be attributed to person main effects and trait-specific variance in the training and test
samples, respectively. The first discrimination indices (D1s: .48 and .38) indicate that 48% and
38 % of the observed variance can be attributed to trait-specific variance in the training and test
samples, respectively. Contrasting these values with C1s suggests that most of the convergence
is contributed by trait-specific variance. The second discrimination indices (D2s: .43 and .33)
were positive and moderate, indicating that the percentage of shared variance specific to traits is
43 and 33 percentage points higher than the percentage of shared variance specific to methods in
the training and test samples, respectively. D2a, calculated using only machine method
correlations, was .39 and .28 in the training and test samples, respectively, suggesting that the
machine method tended to yield somewhat heightened percentage levels of method variance. In
addition, the average absolute inter-correlations of self-reported questionnaire-derived domain
scores were .12 and .11 in the training and test samples, respectively. Meanwhile, the average
absolute inter-correlations of machine-inferred domain scores were .20 and .20 in the training
and test samples, respectively. Thus, together, machine-inferred latent personality domain scores
demonstrated excellent convergent validity but somewhat weaker discriminant validity.
Personality Assessment through an AI Chatbot 32
Table 7 indicates that at the manifest variable level, the convergence index (C1) was .57
and .46 in the training and test samples, indicating good convergent validity. D1 was .40 and .31
in the training and test samples, respectively, suggesting that most of the convergence is
contributed by trait-specific variance. D2 was .30 and .19 in the training and test samples,
respectively, showing that the percentage of shared variance specific to traits is substantially
higher than the percentage of shared variance specific to methods. D2a was .24 and .11 in the
training and test samples, respectively. The magnitude of D2a in the test sample suggests that the
percentage of shared variance specific to traits is only slightly higher than the percentage of
shared variance specific to the machine method. Indeed, Table 6 shows that four out of ten
heterotrait-machine monomethod correlations exceeded the C1 (.46). In addition, the average
absolute inter-correlations of self-reported questionnaire-derived domain scores were .20 and .19
in the training and test sample, respectively, whereas the average absolute inter-correlations of
machine-inferred domain scores were .33 and .35, respectively (Δs = .13 and .16). Thus,
machine-inferred manifest domain scores demonstrated excellent convergent validity but weaker
discriminant validity. The poor discriminant validity in the test sample is particularly concerning.
Table 7 also indicates that D2a was .39 vs .24 at the latent vs. manifest variable level in
the training sample (ΔD2a = .15) and was .28 vs. .11 in the test sample (ΔD2a = .17). These
result patterns suggested that discriminant validity of machine-inferred personality domain
scores was somewhat higher at the latent variable level than at the manifest variable level.
Based on the above findings, there is evidence that machine-inferred personality latent
domain scores displayed excellent convergent and somewhat weaker discriminant validity.
Although machine-inferred personality manifest domain scores also showed good convergent
validity, discriminant validity was less impressive, particularly in the test sample. Nevertheless,
Personality Assessment through an AI Chatbot 33
overall, our results show improvements in differentiating among different traits in the machine-
inferred personalities as compared to existing machine learning applications (e.g., Hickman et
al., 2022; Marinucci et al., 2018; Park et al., 2015). Table 7 also shows that C1, D1, and D2
(D2a) dropped around .10 from the training to testing sample at both the latent and manifest
variable levels. Thus, there is evidence for reasonable levels of cross-sample generalizability of
convergent and discriminant relations.
Criterion-Related Validity
Table 6 reports the bivariate correlations between manifest personality domain scores and
the two criteria of cumulative GPA and peer-rated college adjustment. We also calculated partial
correlations controlling for ACT scores. Specifically, controlling for ACT scores, GPA was
significantly correlated with four machine-inferred domain scores: Openness (r = ‒.11),
Conscientiousness (r = .13), Extroversion (r = .12), and Neuroticism (r = ‒.12); peer-rated
college adjustment was significantly correlated with four machine-inferred domain scores:
Conscientiousness (r = .17), Extroversion (r = .18), Agreeableness (r = .12), and Neuroticism (r
= ‒.11). Thus, machine-inferred domain scores demonstrated some initial evidence for low levels
of criteria-related validity. The correlations between self-reported questionnaire-derived domain
scores and the two criteria were largely consistent with previous research (e.g., McAbee &
Oswald, 2013; Oswald et al., 2004).
Next, we conducted 10 sets of hierarchical regression analyses (five domains × two
criteria) to examine the incremental validity of machine-inferred domain scores.9 Table 8
presents the results of these regression analyses. Several observations are noteworthy. First,
whereas ACT scores were a significant predictor of cumulative GPA, they did not predict peer-
9 We also ran 60 sets of regression analyses (30 facets × two criteria) to examine incremental validity of machine-
inferred personality facet scores. The results are presented in online supplemental material A (Table S5).
Personality Assessment through an AI Chatbot 34
rated college adjustment at all, suggesting that the former criterion has a strong cognitive
connotation, whereas the latter criterion does not. Second, after controlling for ACT scores, self-
reported questionnaire-derived personality domain scores exhibited modest incremental validity
on the two criteria, with four domain scores being significant predictors of cumulative GPA and
three domain scores being significant predictors of peer-rated college adjustment.
Third, after controlling for ACT and self-reported questionnaire-derived personality
domain scores, machine-inferred personality domain scores, overall, failed to explain additional
variance in the two criteria; however, there are three important exceptions. Specifically, in one
set of regression analyses involving Extroversion scores as the predictor and GPA as the
criterion, the machine-inferred Extroversion scores explained an additional 3% of the variance in
cumulative GPA (β = .18 p < .001). Interestingly, the regression coefficient of self-reported
questionnaire-derived Extroversion scores became more negative from step 2 to step 3 (with β
increasing from ‒.09 [p = .038] to ‒.17, [p < .001]), suggesting a potential suppression effect
(Paulhus et al., 2004). In another set of regression analyses, which involved Extroversion scores
as a predictor and peer-rated college adjustment as the criterion, machine-inferred Extroversion
scores explained an additional 3% of the variance in the criterion (β = .18, p < .001), with self-
reported questionnaire-derived Extroversion scores being a non-significant predictor in step 2 (β
= .07, p = .217) and step 3 (β = ‒.002, p = .980). In still another set of regression analyses, which
involved Neuroticism scores as a predictor and cumulative GPA as the criterion, machine-
inferred Neuroticism scores explained an additional 1% of the variance in the criterion (β = ‒.13
p < .001), with self-reported questionnaire-derived Neuroticism score being a non-significant
predictor in step 2 (β = .03, p = .549) and step 3 (β = .08, p = .099).
Personality Assessment through an AI Chatbot 35
Based on the above findings, there is some evidence that machine-inferred personality
domain scores had overall comparable, low criterion-related validity to self-reported
questionnaire-derived personality domain scores. The only exception was that machine-inferred
Conscientiousness domain scores had noticeably lower criterion-related validity than self-
reported questionnaire-derived Conscientiousness domain scores. There is also preliminary
evidence that machine-inferred personality domain scores had incremental validity over ACT
scores and self-reported questionnaire-derived domain scores in some analyses.
Supplemental Analyses
Robustness checking. During the review process, an issue was raised about whether the
reliability and validity of machine-inferred personality scores might be compromised among
participants who provided fewer inputs than others during the virtual conversation. We thus
conducted additional analyses to examine the robustness of our findings. Based on the number of
sentences participants in the test sample provided during the online chat, we divided the test
sample into three equal parts, yielding three test sets: (1) bottom 1/3 of participants (n = 165), (2)
bottom 2/3 of participants (n = 285), and (c) all participants (n = 407). We then re-ran all
analyses on the two smaller test sets. Results indicated that barring a couple of exceptions, the
reliability and validity of machine-inferred personality scores were very similar across the test
sets, thus providing strong evidence for the robustness of our findings concerning the volume of
input participants provided.10
Exploring content validity of machine scores. We also conducted supplemental
analyses that allowed for an indirect and partial examination of the content validity issue.
Following Park et al.’s (2015) approach, we used the scikit-learn package in Python (Pedregosa
10 We thank Associate Editor Fred Oswald for encouraging us to examine the robustness of our findings. Interested
readers are referred to the online supplemental materials A (Tables S6 – S15) for detailed analysis results.
Personality Assessment through an AI Chatbot 36
et al., 2011) to count one-, two-, and three-word phrases (i.e., n-grams with n = 1, 2, or 3) in the
text. Words and phrases that occurred in less than 1% of the conversation scripts were removed
from the analysis. This created a document-term matrix that was populated by the counts of the
remaining phrases. After identifying n-grams and their frequency scores for each participant, we
calculated the correlations between machine-inferred personality facet scores derived from the
chatbot and frequencies of language features in the test sample. If machine-inferred personality
scores have content validity, it is expected that they should have significant correlations with
language features that are known to reflect a specific personality facet. For each personality
facet, we selected the 100 most positively and 100 most negatively correlated phrases.
Comprehensive lists of all language features and correlations can be found in the online
supplemental material B. The results show that the most strongly correlated phrases with
predictions of each personality facet were largely consistent with the characteristics of that facet.
For example, a high level of machine score in the artistic interest facet was associated with
phrases reflecting music and art (e.g., music, art, poetry, etc.) and exploration (e.g., explore,
creative, reading, etc.). In contrast, a low level of machine score in the artistic interest facet was
associated with phrases showing enjoyment of sports and outdoor activities (e.g., football, game,
sports, etc.). Based on the above supplemental analyses, it looks like our predictive models
captured some aspects of language features that can predict specific personality facets, and
therefore, showed partial evidence for content validity of machine-inferred personality scores.
Discussion
The purpose of the present study was to explore the feasibility of measuring personality
indirectly through an AI chatbot, with a particular focus on the psychometric properties of
machine-inferred personality scores. Our AI chatbot approach is different from (a) earlier
Personality Assessment through an AI Chatbot 37
approaches that relied on individuals’ willingness to share their social media content (Youyou et
al., 2015), which is not a given in talent management practice, and (b) automated video interview
systems (e.g., Hickman et al., 2022), in that our approach allows for two-way communications
and thus resembles more natural conversations. Results based on an ambitious study that
involved approximately 1,500 participants, adopted a design that allowed for examining cross-
sample generalizability, built predictive models at the personality facet level, and used non-self-
report criteria showed some promise for the AI chatbot approach to personality assessment.
Specifically, we found that machine-inferred personality scores (a) had overall acceptable
reliability at both the domain and facet levels, (b) yielded a comparable factor structure to self-
reported questionnaire-derived personality scores, (c) displayed good convergent validity but
relatively poor discriminant validity, (d) showed low criterion-related validity, and (e) exhibited
incremental validity in some analyses. In addition, there is strong evidence for cross-sample
generalizability of various aspects of psychometric properties of machine scores.
Important Findings and Implications
Several important findings and their implications warrant further discussion. Regarding
reliability, our finding that the average test-retest reliability of all facet machine scores was .63 in
a small independent sample (with an average time lapse of 22 days) compared favorably to
the .50 average test-rest reliability (with an average 15.6 days of time-lapse) reported by
Hickman et al. (2022), but unfavorably to both Harrison et al.’s (2019) .81 average based on
other-report models (with one-year time-lapse) and Park et al.’s (2015) .70 average based on
self-report models (with a 6-month time-lapse). Given that both Harrison et al. and Park et al.
relied on a much larger body of text obtained from participants’ social media, it seems that it is
not the sample size, nor the length of time-lapse, but the size of the text that would determine the
Personality Assessment through an AI Chatbot 38
magnitude of test-retest reliability of machine-inferred personality scores. Another possible
explanation is that in Hickman et al. and our studies, participants were asked to engage in the
same task (online chat or video interviews) twice with relatively short time lapses, which might
have evoked a practice × participant interaction effect. That is, participants might understand the
questions better and might provide better-quality responses during the second interview or
conversation; however, such a practice effect is unlikely to be uniform across participants, thus
resulting in lowered test-retest reliability. In contrast, Harrison et al. and Park et al.’s studies
relied on participants’ social media content over time and thus were immune from the practice
effect. However, one may counter that there might be some similar (repetitive) content on social
media, which might have resulted in overestimated test-retest reliability of machine scores.
Regarding discriminant validity, one interesting finding is that discriminant validity
seemed higher at the latent variable level than at the manifest variable level in both the training
and test samples. One possible explanation is that Set-ESEM accounted for potential cross-
loadings, whereas manifest variables did not. There is evidence that omitted cross-loadings
inflate correlations among latent factors (e.g., Asparouhov & Muthén, 2009).
Regarding criterion-related validity, despite the low criterion-related validity of machine-
inferred personality scores in predicting GPA and peer-rated college adjustment, our findings are
comparable to criterion-related validities of self-reported questionnaire-derived personality
domain scores reported in an influential meta-analysis (Hurtz & Donovan, 2000); for instance,
for Conscientiousness: r = .14 and .17 vs. 14, and for Neuroticism, r = ‒.13 and ‒.15 vs. ‒.09.
Further, eight out of ten criterion-related validities of machine scores were in the 40th (r = .12) –
60th (r = .20) percentile range based on empirically derived effect size distribution for the
psychological characteristics and performance correlations (cf. Bosco et al., 2015).
Personality Assessment through an AI Chatbot 39
The findings that in three regression analyses machine-inferred personality scores
exhibited incremental validity suggest that the part of the variance in machine scores that is not
shared by self-reported questionnaire-derived personality scores can be criterion-relevant.
However, we note that we know very little about the exact nature of this part of the criterion-
relevant variance; for instance, does it capture personality-relevant information or some sort of
biasing factors? In the formal case, we speculate that the unshared variance might have captured
the reputation component of personality (as our daily conversations clearly influence how we are
viewed by others), which has consistently been shown to contribute substantially to the
prediction of performance criteria (e.g., Connelly et al., 2022; Connelly & Ones, 2010).
However, no empirical studies have tested this speculation. This lack of understanding represents
a general issue in ML literature, that is, the predictive utility is established first by practitioners,
with the theoretical understandings lagging, awaiting scientists to address. We thus call for future
research to close the “scientist-practitioner gap.”
Another important finding is that the current study, which was based on self-report
models, yielded better psychometric properties (e.g., substantially higher convergent validity and
cross-sample generalizability) of machine-inferred personality scores than many similar ML
studies that were also based on self-report models (e.g., Hickman et al., 2022; Marinucci et al.,
2018). We offer several tentative explanations that might help reconcile this inconsistency. First,
we would like to rule out data leakage that might have contributed to our better findings. There
are two main types of data leakage (Cook, 2021): (a) target leakage and (b) train-test
contamination. Target leakage happens when the training model contains predictors that are
updated/created after the target value is realized, for instance, if the algorithm for inferring
personality scores is constantly being updated based on new data. However, since this AI firm’s
Personality Assessment through an AI Chatbot 40
algorithm for inferring personality scores is static rather than constantly updated on the fly, target
leakage is ruled out. The second type of data leakage, train-test contamination, happens when
researchers don’t carefully distinguish the training data from the test data, for instance,
researchers conduct preprocessing and/or train the model using both the training and test data.
This would result in overfitting. However, in our study, training and test samples were kept
separate, hence the test data was excluded from any model building activities, including the
fitting of preprocessing steps. Therefore, we are confident that data leakage cannot explain our
superior findings on the psychometric properties of machine scores.
We attribute our success to three factors. The first factor is the sample size. The sample
we used to train our predictive models (n = 1,037) was relatively larger than the samples used in
similar ML studies. Larger samples may help detect more subtle trait-relevant features and
facilitate complex relationships during model training, making the trained models more accurate
(Hickman et al., 2022). The second factor is related to the data collection method used. The AI
firm’s chatbot system allows for two-way communications, engaging users in small talk,
providing empathetic comments, and managing user digressions, all of which should lead to
higher-quality data. The third factor concerns the NLP method used. The present study used the
sentence embedding technique, the USE, which is a DL-based NLP method. USE goes beyond
simple count-based representations such as bag-of-words and Linguistic Inquiry and Word Count
(LIWC; Pennebaker et al., 2015) used in previous ML studies and retains the information of
context in language and relations of whole sentences (Cer et al., 2018). There has been consistent
empirical evidence showing that DL-based NLP techniques tend to outperform n-gram and
lexical methods (e.g., Guo et al., 2021; Mikolov, 2012).
Contributions and Strengths
Personality Assessment through an AI Chatbot 41
The present study makes several important empirical contributions to personality science
and ML literature. First, in the most general sense, the current study represents the most
comprehensive examination of the psychometric properties of machine-inferred personality
scores. Our findings, taken as a whole, greatly enhances confidence in the ML approach to
personality assessment. Second, the current study demonstrates, for the first time, that machine-
inferred personality facet scores are structurally equivalent to self-reported questionnaire-derived
personality facet scores. Third, to our best knowledge, the present study is the first in the broad
ML literature that shows the incremental validity of machine-inferred personality scores over
questionnaire-derived personality scores with non-self-report criteria. Admittedly, scholars in
other fields such as strategic management (e.g., Harrison et al., 2019; Wang & Chen, 2020) and
marketing (e.g., Liu et al., 2021; Shumanov & Johnson, 2021) have reported that machine-
inferred personality scores predicted non-self-report criteria. Further, there have been trends in
using ML in organizational research to predict non-self-report criteria (e.g., Putka et al., 2022;
Sajjadiani et al., 2019; Spisak et al., 2019). However, none of these studies have reported
incremental validity of machine-inferred personality scores beyond self-report criteria. This is
significant because establishing incremental validity of machine-inferred scores is a precondition
for the ML approach to personality assessment and any other new talent signals to gain
legitimacy in talent management practice (Chamorro-Premuzic et al., 2016).
Two methodological strengths of the present study should also be noted. First, building
predictive models at the personality facet level opened opportunities to examine a few important
research questions such as internal consistency at the domain level, factorial validity, and
convergent and discriminant validity at the latent variable level. These questions cannot be
investigated when predictive models are built at the domain level, which is the case for most ML
Personality Assessment through an AI Chatbot 42
studies. Second, the current research design with a training sample and an independent test
sample allowed us to examine numerous aspects of cross-sample generalizability including
reliabilities, factorial validity, and convergent and discriminant validity.
Study Limitations
Several study limitations should be kept in mind when interpreting our results. The first
limitation concerns the generalizability of our models and findings. Our samples consisted of
young, predominantly female, college-educated people, and as a result, our models and findings
might not generalize to working adults. In the current study, we used the AI firm’s default
interview questions to build and test predictive models. Given that the AI firm’s chatbot system
allows for tailor-making conversation topics, interview questions, and their temporal order, we
do not know to what extent predictive models built based on different sets of interview questions
would yield similar machine scores. Further, the current study context was a non-selection,
research context, and it is unclear whether our findings might generalize to selection contexts
where applicants are motivated to fake and thus might provide quite different responses during
virtual conversations. In addition, for both the USE and the elastic net analyses, it would be
difficult to replicate in the exact form. For instance, using any pre-trained model other than the
USE (e.g., the Bidirectional Encoder Representations from Transformers, or BERT; Devlin et al.,
2019) would produce a different dimensional arrangement of vector representations. Therefore,
we call for future research to examine the cross-model, cross-method, cross-population, and
cross-context generalizability of machine-inferred personality scores.
The second limitation is that the quality of the predictive models we built might have
been hampered by several factors, for instance, some participants might not respond truthfully to
the self-report personality inventory (IPIP-300); the USE might have inappropriately encoded
Personality Assessment through an AI Chatbot 43
regional dialects not well represented in the training data; some participants were much less
verbally expressive than others leading to fewer data; and some participants were less
able/interested in contributing to the virtual conversations; to name just a few. In addition,
models built in a high-stakes vs. low-stakes situation might yield different parameters. Future
model-building efforts should take these factors into account.
The third limitation is that we were not able to examine the content validity of machine
scores directly. A major advantage of DL models lies in their accuracy and generalizability.
However, as of now, these DL models including the USE used in the current study are not very
interpretable and have weak theoretical commitments, as the DL-derived features do not have
substantive meanings. We thus encourage computer scientists and psychologists to work together
to figure out substantive meanings of high-dimension vectors extracted from various DL models,
which shall allow for a fuller and more direct investigation of the content validity of machine-
inferred personality scores. This is aligned with the current trend that explainable AI has been a
growing area of research in ML (Tippins et al., 2021).
Future Research Directions
Despite some promising findings, we recommend several future research directions to
further advance the ML approach to personality assessment. First, the default interview questions
in the AI firm’s chatbot system probably should be considered semi-structured in nature. This is
because although all participants went through the same interview questions, they aimed to
engage with users with no explicit intention to solicit personality-related information. It thus
remains to be seen whether more structured interview questions aimed to systematically tap into
various personality traits and the predictive models built accordingly might yield more accurate
and valid machine-inferred personality scores. For instance, it is possible to develop 30 interview
Personality Assessment through an AI Chatbot 44
questions with each targeting one of the 30 personality facets. We are optimistic about such an
approach for the following reasons. First, when interview questions are built to inquire about
respondents’ thoughts, feelings, and behaviors associated with specific personality facets and
domains, the language data will be contextually connected, and trait driven. As a result, NLP
algorithms shall be able to capture not only the linguistic cues but also the trait-relevant content
of the narratives from the respondents. For instance, when assessing Conscientiousness, a
question that asks about personal work style should prompt a respondent to type an answer using
more work-relevant words/phrases and depict their work style in a way that should allow
algorithms to extract relevant features more accurately. This should improve the predictive
accuracy of the predictive models, resulting in better convergent validity.
At the same time, when different questions are asked that tap into different personality
facets or domains, text content unique to specific facets or domains is expected to be solicited.
Questions representing different personality facets should contribute uniquely to the numerical
representations of texts, as the semantics of the texts would cluster differently due to facet or
domain differences. As a result, in the prediction algorithms, clusters involving language features
that are more relevant to a personality facet or domain would carry more weight (i.e., predictive
power) for that specific trait and less so for others. In other words, language features mined this
way are less likely to overlap and should result in improved discriminant validity. However, one
may successfully counter that structured interview questions might create a stronger situation
that may lead to restricted variance in responses during the online chat, resulting in worse results
than unstructured interview questions. In any event, future research is needed to address this
important research question.
Personality Assessment through an AI Chatbot 45
Second, given the recent advancements in personality psychology showing that self-
reported and other-reported questionnaire-derived personality scores seem to capture distinct
aspects of personality, with self-reports tapping into identity and other-reports tapping into
reputation (e.g., McAbee & Connelly, 2016), it may be profitable to develop two parallel
predictive models with self-report and other-report, respectively. In a recent study, Connelly et
al. (2022) empirically showed that the reputations component (assessed through other-report) of
Conscientiousness and Agreeableness dominated the prediction of several performance criteria,
whereas the identity component (assessed through self-report) generally did not predict
performance criteria. These findings suggest that machine-inferred personality scores based on
the other-report might have a high potential to demonstrate incremental validity over self-
reported questionnaire-derived scores and machine scores based on self-reports. Future research
should empirically investigate this possibility.
Third, future research is also needed to examine whether the ML approach to personality
assessment is resistant to faking. There are two different views. The first view argues that
machine-inferred personality scores cannot be faked. This is mainly because the ML approach
uses a large quantity (usually in the hundreds or thousands or even more) of empirically derived
features that bear no substantive meaning, making it practically impossible to memorize and then
fake these features. The counterargument, however, is that in high-stakes selection situations, job
applicants may still engage in impression management in their responses when engaging with an
AI chatbot. Some of the sentiments, positive emotions, and word usage faked by job applicants
are likely to be captured by the NLP techniques and then factored into machine-inferred
personality scores. In other words, faking is still highly possible within the ML approach. Which
Personality Assessment through an AI Chatbot 46
of the above two views is correct is ultimately an empirical question. We thus call for future
empirical research to examine the fakability of machine-inferred personality scores.
Fourth, future research should also examine criterion-related and incremental validity of
machine-inferred personality scores in selection contexts. We speculate that in selection
contexts, self-reported questionnaire-derived personality scores should have weaker criterion-
related validity due to applicant faking that introduces irrelevant variance (e.g., Lanyon et al.,
2014), whereas criterion-related validity of machine-inferred scores probably should maintain
due to its resistance to faking. As such, it is possible that machine scores are more likely to
demonstrate incremental validity in selection contexts than in non-selection contexts such as the
current one. We also encourage future researchers to use more organizationally relevant criteria
such as task performance, organizational citizenship behaviors, counterproductive work
behaviors, and turnover when examining the criteria-related validity of machine scores.
Fifth, future research should address the low discriminant validity of machine scores, as
high intercorrelations among machine-inferred domain scores tend to reduce the validity of a
composite. We suggest that the relatively poor discriminant validity of machine scores might be
attributed to the fact that existing ML algorithms often have emphases largely placed on single-
target optimizations. Improvement is possible through the multi-task framework. For instance,
for predicting multi-dimensional constructs (e.g., personality facet/domain scores), an ideal way
may perhaps be simultaneous optimization of multiple targets. A potential solution for that might
be to integrate target matrices (multi-dimensional facet/domain scores as ground truth vector)
into the multi-task learning framework (i.e., a variation of transfer learning where models are
built simultaneously to perform a set of related tasks, e.g., Ruder, 2017).
Personality Assessment through an AI Chatbot 47
Finally, as mentioned earlier, one major advantage of the AI chatbot approach to
personality assessment is a less tedious testing experience. However, such an advantage is
assumed rather than empirically verified. Thus, future research is needed to compare applicant
perceptions about the traditional personality assessment and the AI chatbot approach.
Practical Considerations
Despite initial promises of the AI chatbot approach to personality assessment, we believe
that a series of practical issues need to be sufficiently addressed before we could recommend its
implementation in applied settings. First, the AI chatbot system used in this study allows users to
tailor-make the conversation agenda. This makes sense, as organizations need to design
interview questions according to specific positions. However, there is little empirical evidence
within the chatbot approach that different sets of interview questions would yield similar
machine scores for the same individuals. In this regard, we agree with Hickman et al. (2022) that
vendors need to provide potential client organizations with enough information about the training
data such as sample demographics and the list of interview questions the models are built on.
Second, there is no empirical evidence supporting the assumption that machine scores
based on the chatbot approach are resistant to applicant faking. This is an important feature that
must be established if we are to implement such an approach in real-world selection contexts.
Third, although it has been well established that self-reported questionnaire-derived personality
scores are unlikely to result in adverse impact, there is no evidence that machine scores based on
the chatbot approach are also immune from adverse impact. It is entirely possible that certain
language features might be associated with group membership and thus need to be removed from
the predictive models. We were unable to examine this issue in the current study since the
undergraduate student population of this university is majority White (80%). Fourth, although
Personality Assessment through an AI Chatbot 48
the present study shows that machine scores based on the chatbot approach had small criterion-
related validity, we still need more evidence for criterion-related validity in industrial
organizations with expanded criterion domains.
Fifth, evidence for the robustness of our findings in terms of the volume of input
participants provide during the online chat is encouraging and should be appealing to
organizations interested in implementing the AI chatbot approach into selection practice.
However, it is probably necessary to identify the lower limit (minimal number of sentences) that
would hold up good psychometric properties of machine-inferred personality scores. Sixth,
considering recent research showing that other-reported questionnaire-derived personality scores
have a unique predictive power of job performance (e.g., Connelly et al., 2022), it is worthwhile
to supplement the current predictive models based on self-report with parallel models based on
other-report. Once other-report models are built, the AI chatbot system may provide more useful
personality information for test-takers, thus further materializing the efficacy feature of the ML
approach to personality assessment.
Finally, the ML-based chatbot approach to personality assessment comes with potential
challenges related to professional ethics and ethical AI (e.g., concerns involving the fair and
responsible use of AI). For instance, is it ethical or legally defensible to have job applicants go
through a chatbot interview without telling them that their textual data will be mined for
selection-relevant information? Organizations might also decide to transform recorded in-person
interviews into texts, apply predictive models such as the one used in the present study, and
obtain machine-inferred personality scores for talent management purposes. AI chatbots may
soon be an enormously popular tool for initial recruiting and screening processes, and corporate
actors may not hesitate to repurpose harvested data for new applications.
Personality Assessment through an AI Chatbot 49
References
ACT, Inc. (2018). Guide to the 2018 ACT® /SAT® Concordance. Retrieved from
https://www.act.org/content/dam/act/unsecured/documents/ACT-SAT-Concordance-
Information.pdf
Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural
Equation Modeling: A Multidisciplinary Journal, 16, 397–438.
https://doi.org/10.1080/10705510903008204
Azucar, D., Marengo, D., & Settanni, M. (2018). Predicting the Big 5 personality traits from
digital footprints on social media: A meta-analysis. Personality and Individual
Differences, 124, 150–159. https://doi.org/10.1016/j.paid.2017.12.018
Bleidorn, W., & Hopwood, C. J. (2019). Using machine learning to advance personality
assessment and theory. Personality and Social Psychology Review, 23, 190–203.
https://doi.org/10.1177/1088868318772990
Booth, T., & Hughes, D. J. (2014). Exploratory structural equation modeling of personality
data. Assessment, 21, 260–271. https://doi.org/10.1177/1073191114528029
Borman, W. C., & Motowidlo, S. J. (1993). Expanding the criterion domain to include elements
of contextual performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in
organizations. Jossey-Bass.
Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015). Correlational effect
size benchmarks. Journal of Applied Psychology, 100, 431–449.
https://doi.org/10.1037/a0038047
Cer, D., Yang, Y., Kong, S.-Y., Hua, N., Limtiaco, N., John, R., Constant, N., Guajardo-
Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., & Kurzweil, R. (2018).
Universal Sentence Encoder. ArXiv. https://arxiv.org/abs/1803.11175
Chamorro-Premuzic, T., Winsborough, D., Sherman, R. A., & Hogan, R. (2016). New talent
signals: Shiny new objects or a brave new world? Industrial and Organizational
Psychology: Perspectives on Science and Practice, 9, 621–640.
https://doi.org/10.1017/iop.2016.6
Chen, L., Zhao, R., Leong, C. W., Lehman, B., Feng, G., & Hoque, M. (2017). Automated video
interview judgment on a large-sized corpus collected online. In 2017 7th international
conference on affective computing and intelligent interaction, ACII 2017 (pp. 504–509).
IEEE. https://doi.org/10.1109/ACII.2017.8273646
Chittaranjan, G., Blom, J., & Gatica-Perez, D. (2013). Mining large-scale smartphone data for
personality studies. Personal and Ubiquitous Computing, 17, 433–450.
https://doi.org/10.1007/s00779-011-0490-1
Personality Assessment through an AI Chatbot 50
Connelly, B. S., McAbee, S. T., Oh, I.-S., Jung, Y., & Jung, C.-W. (2022). A multirater
perspective on personality and performance: An empirical examination of the trait–
reputation–identity model. Journal of Applied Psychology, 107, 1352–1368.
https://doi.org/10.1037/apl0000732
Connelly, B. S., & Ones, D. S. (2010). An other perspective on personality: Meta-analytic
integration of observers’ accuracy and predictive validity. Psychological Bulletin, 136,
1092-1122. https://doi.org/10.1037/a0021212
Cook, A. (2021, November 9). Data leakage. Kaggle. Retrieved March 14, 2022, from
https://www.kaggle.com/alexisbcook/data-leakage
Costa, P. T., Jr., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO-PI-R) and
NEO Five-Factor Inventory (NEO-FFI) Professional Manual. Odessa, FL: Psychological
Assessment Resources.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological
Bulletin, 52, 281–302. https://doi.org/10.1037/h0040957
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–
4186.
Foldes, H., Duehr, E. E., & Ones, D. S. (2008). GroDup difference in personality: Meta-analyses
comparing five U.S. racial groups. Personnel Psychology, 61, 579–616.
https://doi.org/10.1111/j.1744-6570.2008.00123.x
Gnambs, T. (2014). A meta-analysis of dependability coefficients (test-retest reliabilities) for
measures of the Big Five. Journal of Research in Personality, 52, 20–28.
https://doi.org/10.1016/j.jrp.2014.06.003
Golbeck, J., Robles, C., & Turner, K. (2011, May). Predicting personality with social media. In
Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems –
CHI’ 11, Vancouver, BC, 253–262. https://doi.org/10.1145/1979742.1979614
Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist,
48, 26–34. https://doi.org/10.1037/0003-066X.48.1.26
Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the
lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, &
F. Ostendorf (Eds.), Personality Psychology in Europe (Vol. 7, pp. 7–28). Tilburg, The
Netherlands: Tilburg University Press.
Personality Assessment through an AI Chatbot 51
Gou, L., Zhou, M. X., & Yang, H. (2014, April). KnowMe and ShareMe: Understanding
automatically discovered personality traits from social media and user sharing
preference. In Proceedings of the SIGCHI Conference on Human Factors in Computing
System – CHI’ 14, Toronto, ON, 955–964. https://doi.org/10.1145/2556288.2557398
Gow, I. D., Kaplan, S. N., Larcker, D. F., & Zakolyukina, A. A. (2016). CEO personality and
firm policies [Working paper 22435]. Cambridge, MA: National Bureau of Economic
Research. https://doi.org/10.3386/w22435
Guo, F., Gallagher, C. M., Sun, T., Tavoosi, S., & Min, H. (2021). Smarter people analytics with
organizational text data: Demonstrations using classic and advanced NLP
models. Human Resource Management Journal, 2021, 1–16.
https://doi.org/10.1111/1748-8583.12426
Harrison, J. S., Thurgood, G. R., Boivie, S., & Pfarrer, M. D. (2019). Measuring CEO
personality: Developing, validating, and testing a linguistic tool. Strategic Management
Journal, 40, 1316–1330. https://doi.org/10.1002/smj.3023
Harrison, J. S., Thurgood, G. R., Boivie, S., & Pfarrer, M. D. (2020). Perception is reality: How
CEOs’ observed personality influences market perceptions of firm risk and shareholder
returns. Academy of Management Journal, 63, 1166–1195.
https://doi.org/10.5465/amj.2018.0626
Hauenstein, N. M. A., Bradley, K. M., O’Shea, P. G., Shah, Y. J., & Magill, D. P. (2017).
Interactions between motivation to fake and personality item characteristics: Clarifying the
process. Organizational Behavior and Human Decision Processes, 138, 74–92.
https://doi.org/10.1016/j.obhdp.2016.11.002
Hickman, L., Bosch, N., Ng, V., Saef, R., Tay, L., & Woo, S. E. (2022). Automated video
interview personality assessments: Reliability, validity, and generalizability
investigations. Journal of Applied Psychology, 107, 1323–1351.
https://doi.org/10.1037/apl0000695
Hoerl, A., & Kennard, R. (1988). Ridge regression. In Encyclopedia of Statistical Sciences, Vol.
8, pp. 129-136. New York: Wiley.
Hoppe, S., Loetscher, T., Morey, S. A., & Bulling, A. (2018). Eye movements during everyday
behavior predict personality traits. Frontiers in Human Neuroscience, 12, 105.
https://doi.org/10.3389/fnhum.2018.00105
Hopwood, C. J., & Donnellan, M. B. (2010). How should the internal structure of personality
inventories be evaluated? Personality and Social Psychology Review, 14, 332–346.
https://doi.org/10.1177/1088868310361240
Personality Assessment through an AI Chatbot 52
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling: A
Multidisciplinary Journal, 6, 1–55. https://doi.org/10.1080/10705519909540118
Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five
revisited. Journal of Applied Psychology, 85, 869–879. https://doi.org/10.1037/0021-
9010.85.6.869
Hwang, A. H. C., & Won, A. S. (2021, May). IdeaBot: Investigating social facilitation in human-
machine team creativity. In Proceedings of the 2021 CHI Conference on Human Factors
in Computing Systems (pp. 1–16). https://doi.org/10.1145/3411764.3445270
IBM. (n.d.). What is a chatbot? Retrieved December 28, 2022, from
https://www.ibm.com/topics/chatbots
Jayaratne, M., & Jayatilleke, B. (2020). Predicting personality using answers to open-ended
interview questions. IEEE Access 8, 115345–115355.
https://doi.org/10.1109/ACCESS.2020.3004002
Jiang, Z., Rashik, M., Panchal, K., Jasim, M., Sarvghad, A., Riahi, P., DeWitt, E.Thurber, F., &
Mahyar, N. (2023). CommunityBots: Creating and evaluating a multi-agent chatbot
platform for public input elicitation. Accepted for the Proceedings of ACM SCSW2023.
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text
classification. arXiv. https://doi.org/10.48550/arXiv.1607.01759
Judge, T. A., & Bono, J. E. (2001). Relationship of core self-evaluations traits—self-esteem,
generalized self-efficacy, locus of control, and emotional stability—with job satisfaction
and job performance: A meta-analysis. Journal of Applied Psychology, 86, 80–92.
https://doi.org/10.1037/0021-9010.86.1.80
Kim, S., Lee, J., & Gweon, G. (2019, May). Comparing data from chatbot and web surveys:
Effects of platform and conversational style on survey response quality. In Proceedings
of the 2019 CHI conference on human factors in computing systems (pp. 1–12).
https://doi.org/10.1145/3290605.3300316
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable
from digital records of human behavior. Proceedings of the National Academy of
Sciences, 110, 5802–5805. https://doi.org/10.1073/pnas.1218772110
Kulkarni, V., Kern, M. L., Stillwell, D., Kosinski, M., Matz, S., Ungar, L., Skiena, S., &
Schwartz, H. A. (2018). Latent human traits in the language of social media: An open-
vocabulary approach. PLoS ONE, 13(11), e0201703.
https://doi.org/10.1371/journal.pone.0201703
Personality Assessment through an AI Chatbot 53
Lanyon, R. I., Goodstein, L. D., & Wershba, R. (2014). 'Good Impression' as a moderator in
employment-related assessment. International Journal of Selection and Assessment, 22,
52–61. https://doi.org/10.1111/ijsa.12056
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
https://doi.org/10.1038/nature14539
Leutner, K., Liff, J., Zuloaga, L., & Mondragon, N. (2021). Hirevue’s assessment science [White
paper]. Hirevue.com. https://webapi.hirevue.com/wp-content/uploads/2021/03/HireVue-
Assessment-Science-whitepaper-2021.pdf
Li, J., Zhou, M. X., Yang, H., & Mark, G. (2017, March). Confiding in and listening to virtual
agents: The effect of personality. Paper presented at the 22nd annual meeting of the
intelligent user interfaces community, Limassol, Cyprus.
https://doi.org/10.1145/3025171.3025206
Li, W., Wu, C., Hu, X., Chen, J., Fu, S., Wang, F., & Zhang, D. (2020). Quantitative personality
predictions from a brief EEG recording. IEEE Transactions on Affective Computing.
https://doi.org/10.1109/TAFFC.2020.3008775
Liu, A. X., Li, Y., & Xu, S. X. (2021). Assessing the unacquainted: Inferred reviewer Personality
and review helpfulness. MIS Quarterly, 45, 1113–1148.
https://doi.org/10.25300/MISQ/2021/14375
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological
Reports, 3, 635–694. https://doi.org/10.2466/pr0.1957.3.3.635
Lorenzo-Seva, U., & ten Berge, J. M. (2006). Tucker's congruence coefficient as a meaningful
index of factor similarity. Methodology, 2, 57–64. https://doi.org/10.1027/1614-
2241.2.2.57
Marinucci, A., Kraska, J., & Costello, S. (2018). Recreating the Relationship between Subjective
Wellbeing and Personality Using Machine Learning: An Investigation into Facebook
Online Behaviours. Big Data and Cognitive Computing, 2, 29.
https://doi.org/10.3390/bdcc2030029
Marsh, H. W., Guo, J., Dicke, T., Parker, P. D., & Craven, R. G. (2020). Confirmatory factor
analysis (CFA), exploratory structural equation modeling (ESEM), and Set-ESEM:
Optimal balance between goodness of fit and parsimony. Multivariate Behavioral
Research, 55, 102–111. https://doi.org/10.1080/00273171.2019.1602503
McAbee, S. T., & Connelly, B. S. (2016). A multi-rater framework for studying personality: The
Trait-Reputation-Identity Model. Psychological Review, 123, 569–591.
https://doi.org/10.1037/rev0000035
Personality Assessment through an AI Chatbot 54
McAbee, S. T., & Oswald, F. L. (2013). The criterion-related validity of personality measures for
predicting GPA: A meta-analytic validity competition. Psychological Assessment, 25,
532–544. https://doi.org/10.1037/a0031748
McCarthy, J., & Wright, P. (2004). Technology as experience. Interactions, 11, 42–43.
https://dl.acm.org/doi/pdf/10.1145/1015530.1015549
McCarthy, J. M., Bauer, T. N., Truxillo, D. M., Anderson, N. R., Costa, A. C., & Ahmed, S. M.
(2017). Applicant perspectives during selection: A review addressing “So What?,”
“What’s New?,” and “Where to Next?” Journal of Management, 43, 1693–1725.
https://doi.org/10.1177/0149206316681846
Mikolov, T. (2012). Statistical language models based on neural networks. Presentation at
Google, Mountain View.
Mitchell, T. M. (1997). Machine learning. McGraw-Hill, Inc.
Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hllenbeck, J. R., Murphy, K., & Schmitt, N.
(2007). Reconsidering the use of personality, tests in personnel selection contexts.
Personnel Psychology, 60, 683–729. https://doi.org/10.1111/j.1744-6570.2007.00089.x
Mulfinger, E., Wu, F., Alexander, L., III, & Oswald, F. L. (2020, February). AL technologies in
talent management systems: It glitters but is it gold? [Poster Presentation]. Work in the
21st Century: Automation, Workers, and Society, Houston, TX.
Muthén, L. K., & Muthén, B. O. (1998-2017). Mplus User’s Guide. Eighth Edition. Los Angeles,
CA: Muthén & Muthén.
Oswald, F. L., Behrend, T. S., Putka, D. J., & Sinar, E. (2020). Big data in industrial-
organizational psychology and human resource management: forward progress for
organizational research and practice. Annual Review of Organizational Psychology and
Organizational Behavior, 7, 505–533. https://doi.org/10.1146/annurev-orgpsych-032117-
104553
Oswald, F. L., Schmitt, N., Kim, B. H., Ramsay, L. J., & Gillespie, M. A. (2004). Developing a
biodata measure and situational judgment inventory as predictors of college student
performance. Journal of Applied Psychology, 89, 187–207. https://doi.org/10.1037/0021-
9010.89.2.187
Park, G., Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Kosinski, M., Stillwell, D. J., Ungar, L.
H., & Seligman, M. E. (2015). Automatic personality assessment through social media
language. Journal of Personality and Social Psychology, 108, 934–952.
https://doi.org/10.1037/pspp0000020
Personality Assessment through an AI Chatbot 55
Paulhus, D., Robins, R., Trzesniewski, K., & Tracy, J. (2004). Two replicable suppressor
situations in personality research. Multivariate Behavioral Research, 39, 303–328.
https://doi.org/10.1207/s15327906mbr3902_7
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and
psychometric properties of LIWC2015. Austin, TX: LIWC.net.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofter, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in
Python. Journal of Machine Learning Research, 12, 2825–2830.
Pervin, L. A. (1994). Further reflections on current trait theory. Psychological Inquiry, 5, 169–
178. https://doi.org/10.1207/s15327965pli0502_1
Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the
workplace: Development of a taxonomy of adaptive performance. Journal of Applied
Psychology, 85(4), 612–624. https://doi.org/10.1037/0021-9010.85.4.612
Putka, D. J., Beatty, A. S., & Reeder, M. C. (2018). Modern prediction methods: New
perspectives on a common problem. Organizational Research Methods, 21, 689-732.
https://doi.org/10.1177/1094428117697041
Putka, D. J., Oswald, F. L., Landers, R. N., Beatty, A. S., McCloy, R. A., & Yu, M. C. (2022).
Evaluating a natural language processing approach to estimating KSA and interest job
analysis ratings. Journal of Business and Psychology. https://doi.org/10.1007/s10869-
022-09824-0
Revelle, W. (2022). Psych: Procedures for Psychological, Psychometric, and Personality
Research. Northwestern University, Evanston, Illinois, USA, http://CRAN.R-
project.org/package=psych. Version = 2.2.5
Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv.
https://arxiv.org/abs/1706.05098
Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2022). Revisiting meta-analytic estimates
of validity in personnel selection: Addressing systematic overcorrection for restriction of
range. Journal of Applied Psychology, 107, 2040–2068.
https://doi.org/10.1037/apl0000994
Sajjadiani, S., Sojourner, A. J., Kammeyer-Mueller, J. D., & Mykerezi, E. (2019). Using
machine learning to translate applicant work history into predictors of performance and
turnover. Journal of Applied Psychology, 104, 1207–1225.
Shumanov, M., & Johnson, L. (2021). Making conversations with chatbots more personalized.
Computers in Human Behavior, 117, 106627. https://doi.org/10.1016/j.chb.2020.106627
Personality Assessment through an AI Chatbot 56
Speer, A. B. (2021). Scoring dimension-level job performance from narrative comments:
Validity and generalizability when using natural language processing. Organizational
Research Methods, 24, 572–594. https://doi.org/10.1177/1094428120930815
Spisak, B. R., van der Laken, P. A., & Doornenbal, B. M. (2019). Finding the right fuel for the
analytical engine: Expanding the leader trait paradigm through machine learning? The
Leadership Quarterly, 30, 417–426.
Suen, H. -Y., Huang, K. -E., & Lin, C. -L. (2019). TensorFlow-based automatic personality
recognition used in asynchronous video interviews. IEEE Access, 7, 61018-61023.
https://doi.org/10.1109/ACCESS.2019.2902863
Sun, T. (2021). Artificial intelligence powered personality assessment: A multidimensional
psychometric natural language processing perspective [Doctoral dissertation, University
of Illinois Urbana-Champaign]. https://hdl.handle.net/2142/113136
Tay, L., Woo, S. E., Hickman, L., & Saef, R. M. (2020). Psychometric and Validity Issues in
Machine Learning Approaches to Personality Assessment: A Focus on Social Media Text
Mining. European Journal of Personality, 34, 826–844. https://doi.org/10.1002/per.2290
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society, Series B, 58, 267-288.
Tippins, N. T., Oswald, F. L., & McPhail, S. M. (2021). Scientific, legal, and ethical concerns
about AI-based personnel selection tools: a call to action. Personnel Assessment and
Decisions, 7, 1–22. https://doi.org/10.25035/pad.2021.02.001
Van Rossum, G., & Drake, F. L. (2009). Python 3 Reference Manual. Scotts Valley, CA:
CreateSpace.
Völkel, S. T., Haeuslschmid, R., Werner, A., Hussmann, H., & Butz, A. (2020). How to Trick
AI: Users' Strategies for Protecting Themselves from Automatic Personality Assessment.
In Proceedings of the 2020 CHI conference on human factors in computing systems (pp.
1–15). https://doi.org/10.1145/3313831.3376877
Wang, S., & Chen, X. (2020). Recognizing CEO personality and its impact on business
performance: Mining linguistic cues from social media. Information & Management, 57,
103173. https://doi.org/10.1016/j.im.2019.103173
Woehr, D. J., Putka, D. J., & Bowler, M. C. (2012). An examination of g-theory methods for
modeling multitrait-multimethod data: Clarifying links to construct validity and
confirmatory factor analysis. Organizational Research Methods, 15, 134–161.
https://doi.org/10.1177/1094428111408616
Personality Assessment through an AI Chatbot 57
Xiao, Z., Zhou, M. X., Liao, Q. V., Mark, G., Chi, C., Chen, W., & Yang, H. (2020). Tell me
about yourself: Using an AI-powered chatbot to conduct conversational surveys with
open-ended questions. ACM Transactions on Computer-Human Interaction, 27, Article
15, 1–37. https://doi.org/10.1145/3381804
Yang, Y., Arego, G. H., Yuan, S., Guo, M., Shen, Q., Cer, D., Sung, Y-h., Strope, B., &
Kurzweil, R. (2019). Improving multilingual sentence embedding using Bi-directional
Dual Encoder with additive margin softmax. arXiv.
https://doi.org/10.48550/arXiv.1902.08564.
Yarkoni, T. (2010). Personality in 100,000 words: A large-scale analysis of personality ad word
use among bloggers. Journal of Research in Personality, 44, 363–373.
https://doi.org/10.1016/j.jrp.2010.04.001
Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are
more accurate than those made by humans. Proceedings of the National Academy of the
Science, 112, 1036–1040. https://doi.org/10.1073/pnas.1418680112
Zhang, B., Luo, J., Chen, Y., Roberts, B., & Drasgow, F. (2020). The road less traveled: A cross-
cultural study of the negative wording factor in multidimensional scales. PsyArXiv.
https://doi.org/10.31234/osf.io/2psyq
Zhang, B., Luo, J., Sun, T., Cao, M., & Drasgow, F. (2021). Small but nontrivial: A comparison
of six strategies to handle cross-loadings in bifactor predictive models. Multivariate
Behavioral Research. Advanced online publication.
https://doi.org/10.1080/00273171.2021.1957664
Zhou, M. X., Chen, W., Xiao, Z., Yang, H., Chi, T., & Williams, R. (2019). Getting virtually
personal: Chatbots who actively listen to you and infer your personality. Paper presented
at 24th International Conference on Intelligent User Interfaces (IUI’19 Companion),
March 17-20, Marina Del Rey, CA, USA. https://doi.org/10.1145/3308557.3308667
Ziegler, M., MacCann, C., & Roberts, R. D. (Eds.). (2012). New Perspectives on Faking in
Personality Assessment. Oxford University Press.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of
the Royal Statistical Society. Series B, Statistical Methodology, 67, 301–320.
https://doi.org/10.1111/j.1467-9868.2005.00503.x
Personality Assessment through an AI Chatbot 58
Table 1.
Summary of Empirical Research on Psychometric Properties of Machine-inferred Personality Scores
Aspects of Construct Validity
Prototypical Examples
Major Findings/Current Status
Substantive Validity
Content validity
Hickman et al. (2022); Kosinskiet al.
(2013); Park et al. (2015); Yarkoni (2010)
• Barring limited exceptions, significant relationships between
digital features and questionnaire personality scores bear no
substantive meanings.
Structural Validity
Reliability
Test-retest reliability
Harrison et al. (2019); Hickman et al.,
(2022); Li et al. (2020); Park et al. (2015)
• Machine-inferred personality scores have comparable or
slightly lower test-retest reliability than questionnaire
personality scores.
Split-half reliability
Internal Consistency
Hoppe et al. (2018); Wang & Chen (2020);
Youyou et al. (2015)
None
• Split-half reliability of machine-inferred personality scores
range from .40s to .60s.
• None
Generalizability
Hickman et al. (2022)
• Models trained on self-reports tend to exhibit poorer
generalizability than models trained on interview-reports.
Factorial validity
None
• None
External Validity
Convergent validity
Three meta-analyses: Azucar et al. (2018);
Sun (2021); Tay et al., (2020)
• Correlations between machine-inferred and questionnaire
scores of same personality traits range from .20s to .40s.
Discriminant validity
Harrison et al. (2019); Harrison et al.
(2019); Hickman et al. (2022); Marinucci et
al. (2018)
• Correlations among machine-inferred scores tend to be
similar to correlations between machine-inferred and
questionnaire scores of same traits.
Criterion-related validitya
Gow et al. (2016); Harrison et al. (2019,
2020); Wang & Chen (2020)
• Machine-inferred CEO personality trait scores predict
various objective indictors of firm performance.
Incremental validitya
None
• None
Note. a We only considered studies using non-self-report performance criteria.
Personality Assessment through an AI Chatbot 59
Table 2.
Reliabilities of Self-reported questionnaire-derived and Machine-inferred Personality Facet Scores
Training Sample (n = 1037)
Test Sample (n = 407)
Self-reported
questionnaire-derived
Scores
Machine-inferred
Scores
Self-reported
questionnaire-derived
Scores
Machine-inferred
Scores
Personality Facets
Coefficient
Alpha
Split-half
Reliability
Split-half
Reliability
Coefficient
Alpha
Split-half
Reliability
Split-half
Reliability
Test-retest
Reliability (n = 61)
Openness (Average)
.79
.80
.68
.80
.79
.63
.67
Imagination
.81
.82
.65
.82
.82
.58
.67
Art interest
.79
.84
.72
.82
.84
.71
.70
Feelings
.78
.73
.71
.77
.70
.68
.72
Adventure
.75
.74
.59
.75
.73
.40
.49
Intellectual
.81
.78
.69
.83
.78
.70
.69
Liberalism
.82
.88
.70
.80
.86
.73
.76
Conscientiousness (Average)
.83
.86
.67
.81
.85
.68
.59
Self-efficacy
.80
.85
.65
.72
.80
.63
.58
Orderliness
.86
.88
.68
.88
.89
.67
.63
Dutifulness
.76
.82
.83
.74
.80
.83
.63
Achievement
.84
.85
.65
.82
.84
.67
.59
Self-discipline
.87
.89
.59
.87
.90
.59
.62
Cautiousness
.82
.86
.62
.85
.88
.66
.48
Extraversion (Average)
.81
.82
.64
.80
.80
.64
.66
Friendliness
.88
.92
.70
.87
.91
.70
.68
Gregarious
.88
.88
.70
.85
.82
.73
.77
Assertiveness
.84
.84
.65
.84
.84
.59
.72
Activity level
.64
.61
.60
.65
.66
.57
.61
Excite seek
.81
.80
.53
.80
.77
.54
.59
Cheerfulness
.80
.84
.66
.77
.83
.70
.58
Personality Assessment through an AI Chatbot 60
Agreeableness (Average)
.75
.82
.73
.79
.78
.68
.63
Trust
.82
.86
.71
.84
.84
.67
.57
Straightforward
.73
.84
.77
.79
.79
.75
.61
Altruism
.76
.84
.75
.81
.78
.75
.69
Cooperation
.71
.80
.76
.77
.75
.68
.61
Modesty
.77
.76
.62
.77
.79
.49
.63
Sympathy
.70
.80
.76
.77
.72
.71
.67
Neuroticism (Average)
.81
.84
.60
.84
.81
.57
.58
Anxiety
.77
.84
.61
.84
.77
.54
.68
Anger
.88
.88
.62
.89
.88
.62
.45
Depression
.86
.90
.60
.89
.86
.58
.66
Self-conscious
.75
.85
.61
.83
.79
.55
.68
Impulsiveness
.77
.79
.54
.71
.81
.57
.43
Vulnerability
.80
.77
.61
.85
.72
.54
.59
Note. Average = averaged reliabilities of facet scales within a Big Five trait. Test-retest reliabilities of machine-inferred personality
scores were based on an independent sample (n = 61) that was not part of the test sample (n = 407).
Personality Assessment through an AI Chatbot 61
Table 3.
Reliabilities (Cronbach’s Alpha) of Self-reported Questionnaire-derived and Machine-inferred Personality Domain Scores
Personality
Domains
Training Sample (n = 1037)
Test Sample (n = 407)
Self-reported questionnaire-
derived Scores
Machine-inferred
Scores
Self-reported questionnaire-
derived Scores
Machine-inferred
Scores
Openness
.67
.76
.72
.76
Conscientiousness
.84
.92
.84
.93
Extroversion
.84
.90
.81
.90
Agreeableness
.79
.92
.76
.91
Neuroticism
.86
.90
.83
.88
Note. Cronbach’s Alphas were calculated by treating facet scores under respective personality traits as “items.”
Personality Assessment through an AI Chatbot 62
Table 4.
Rotated Factor Loading Matrices Based on Set-Exploratory Structural Equation Model for the
Test Sample
Note. n = 407. E = Extroversion. A = Agreeableness. C = Conscientiousness. N = Neuroticism.
O = Openness.
Personality Facets
Self-reported questionnaire-derived
Personality Scores
Machine-inferred Personality Scores
E A C N O E A C N O
Friendliness
E1
0.81
0.21
0.03
-0.13
-0.03
0.68
0.43
0.10
-0.17
-0.20
Gregarious
E2
0.82
0.13
-0.16
-0.07
-0.07
0.78
0.25
-0.01
-0.07
-0.30
Assertiveness
E3
0.66
-0.32
0.25
-0.07
0.13
0.77
-0.23
0.37
-0.12
0.01
Activity Level
E4
0.27
-0.18
0.53
-0.04
0.03
0.34
-0.17
0.70
-0.20
0.11
Excitement Seeking
E5
0.55
-0.11
-0.32
-0.08
0.21
0.75
-0.34
-0.21
-0.10
-0.04
Cheerfulness
E6
0.64
0.27
-0.09
-0.18
0.10
0.75
0.49
-0.05
-0.08
0.05
Trust
A1
0.34
0.57
-0.05
-0.16
-0.18
0.38
0.71
0.06
-0.19
-0.16
Straightforward
A2
-0.12
0.56
0.42
-0.01
0.02
-0.09
0.73
0.41
-0.02
0.09
Altruism
A3
0.38
0.58
0.22
0.01
0.28
0.35
0.71
0.25
-0.02
0.10
Cooperation
A4
-0.10
0.79
0.05
-0.14
-0.06
0.06
0.91
0.09
-0.04
-0.04
Modesty
A5
-0.48
0.44
-0.12
0.05
-0.11
-0.64
0.49
0.01
0.22
-0.10
Sympathy
A6
0.14
0.59
0.11
0.13
0.31
0.24
0.75
0.11
0.21
0.25
Self-Efficacy
C1
0.08
-0.05
0.69
-0.30
0.20
0.25
0.04
0.63
-0.36
0.21
Orderliness
C2
-0.01
0.12
0.64
0.15
-0.27
0.11
0.05
0.90
0.17
-0.22
Dutifulness
C3
-0.13
0.38
0.57
-0.15
0.13
0.02
0.50
0.54
-0.20
0.22
Achievement
C4
0.24
0.00
0.71
-0.07
0.09
0.27
0.01
0.78
-0.13
0.10
Self-Discipline
C5
0.09
0.07
0.66
-0.16
-0.18
0.22
-0.09
0.73
-0.15
-0.24
Cautiousness
C6
-0.42
0.22
0.61
-0.11
-0.04
-0.55
0.31
0.70
-0.23
0.11
Anxiety
N1
-0.01
0.05
0.17
0.93
0.03
0.06
0.19
0.18
0.95
0.05
Anger
N2
0.08
-0.45
0.07
0.68
0.02
0.05
-0.55
-0.01
0.82
-0.11
Depression
N3
-0.30
-0.01
-0.23
0.58
0.13
-0.35
-0.14
-0.24
0.56
0.22
Self-Conscious
N4
-0.56
0.19
-0.13
0.46
-0.01
-0.52
0.30
-0.11
0.49
0.22
Impulsiveness
N5
0.27
-0.07
-0.46
0.35
0.03
0.23
-0.17
-0.46
0.53
0.09
Vulnerability
N6
0.05
0.13
-0.14
0.81
-0.14
-0.02
0.29
-0.15
0.84
-0.13
Imagination
O1
-0.03
-0.03
-0.24
0.00
0.68
0.05
-0.02
-0.26
0.11
0.84
Art_Interest
O2
0.07
0.25
0.02
0.14
0.64
0.15
0.36
0.13
0.27
0.68
Feelings
O3
0.32
0.24
0.15
0.50
0.43
0.37
0.43
0.25
0.54
0.37
Adventure
O4
0.17
0.04
-0.17
-0.27
0.49
0.20
-0.07
-0.19
-0.45
0.55
Intellectual
O5
-0.24
-0.18
0.23
-0.20
0.76
-0.21
-0.35
0.31
-0.22
0.82
Liberalism
O6
-0.23
0.03
-0.20
0.02
0.37
-0.37
-0.13
-0.23
0.09
0.58
Personality Assessment through an AI Chatbot 63
Table 5.
Latent Factor Correlations of Self-reported Questionnaire-derived and Machine-inferred
Personality Domain Scores in the Training and Test Samples
1
2
3
4
5
6
7
8
9
10
1. Extroversion (S)
‒.03
.15
‒.26
.16
.60
.10
.18
‒.17
‒.14
2. Agreeableness (S)
‒.02
.20
.05
.08
.08
.61
.18
.13
.07
3. Conscientiousness (S)
.06
.19
‒.22
.01
.20
.13
.56
‒.23
.00
4. Neuroticism (S)
‒.15
‒.03
‒.26
‒.03
‒.12
.16
‒.10
.55
.04
5. Openness (S)
.22
.07
.04
‒.06
‒.03
.01
.04
.05
.64
6. Extroversion (M)
.50
.09
.16
‒.10
‒.14
.16
.33
‒.28
‒.17
7. Agreeableness (M)
.07
.45
.08
.23
.03
.18
.32
.20
.09
8. Conscientiousness (M)
.08
.19
.50
‒.12
‒.04
.30
.35
‒.30
.04
9. Neuroticism (M)
‒.06
.01
‒.24
.38
.08
‒.23
.15
‒.37
.07
10. Openness (M)
‒.18
.01
.05
.09
.57
‒.22
.09
.05
.05
Note. Correlations above the diagonal are based on the training sample. Correlations below the
diagonal are based on the test sample. For the training sample, n = 1037. For the test sample, n =
407.
Personality Assessment through an AI Chatbot 64
Table 6.
Means, Standard Deviations, and Correlations among Study Variables in the Training and Test Samples at Manifest Variable Level
Variables
M
SD
1
2
3
4
5
6
7
8
9
10
11
12
13
M
SD
1. Openness (S)
3.44
.38
.00
.11**
.20**
.05
.63**
.00
‒.12**
.10**
.15**
N/A
N/A
N/A
3.43
.36
2. Conscientiousness (S)
3.59
.43
‒.04
.20**
.34**
‒.43**
‒.02
.53**
.26**
.23**
‒.31**
N/A
N/A
N/A
3.60
.44
3. Extroversion (S)
3.49
.42
.24**
.07
.04
‒.50**
‒.13**
.28**
.59**
.14**
‒.32**
N/A
N/A
N/A
3.47
.48
4. Agreeableness (S)
3.61
.34
.15**
.40**
.05
‒.12**
.11**
.27**
.15**
.57**
.04
N/A
N/A
N/A
3.65
.39
5. Neuroticism (S)
2.77
.46
‒.02
‒.44**
‒.35**
‒.17**
.12**
‒.28**
‒.28**
.03
.55**
N/A
N/A
N/A
2.85
.53
6. Openness (M)
3.42
.13
.58**
.00
‒.13**
.04
.11*
.01
‒.26**
.17**
.25**
N/A
N/A
N/A
3.44
.16
7. Conscientiousness (M)
3.60
.13
‒.07
.46**
.14**
.28**
‒.21**
.03
.57**
.54**
‒.55**
N/A
N/A
N/A
3.60
.16
8. Extroversion (M)
3.55
.16
‒.21**
.22**
.45**
.17**
‒.19**
‒.30**
.56**
.36**
‒.54**
N/A
N/A
N/A
3.47
.19
9. Agreeableness (M)
3.67
.13
.06
.15**
.09
.42**
.10*
.17**
.57**
.42**
.07*
N/A
N/A
N/A
3.65
.17
10. Neuroticism (M)
2.75
.14
.18**
‒.31**
‒.18**
‒.07
.40**
.24**
‒.61**
‒.53**
‒.05
N/A
N/A
N/A
2.85
.18
11. PRCA
5.28
.90
‒.01
.25**
.08
.14*
‒.16**
‒.02
.17**
.18**
.12*
‒.13*
N/A
N/A
N/A
N/A
12. Cumulative GPA
3.18
.59
‒.14*
.33**
‒.13*
.14*
.02
‒.03
.14*
.08
.04
‒.15**
.21**
N/A
N/A
N/A
13. ACT
26.28
3.67
.08
.05
‒.13*
.05
‒.003
.19**
.04
‒.13*
‒.03
‒.08
‒.01
.50**
N/A
N/A
Note. n = 1037 for the training sample. n = 289 – 407 for the test sample. (S) = self-reported questionnaire-derived scores. (M) =
machine-inferred score. PRCP = Peer-rated college adjustment. GPA = grade point average. Statistics above the diagonal are for the
training sample. Statistics below the diagonal are for the test sample.
* p < .05. ** p < .01.
Personality Assessment through an AI Chatbot 65
Table 7.
Multitrait-Multimethod Statistics for Machine-Inferred Personality Domain Scores
C1
D1
D2
D2a
MV
MVa
Latent Personality Domain Scores (self-report models)
Training Sample (n = 1,037)
.59
.48
.43
.39
.05
.09
Test Sample (n = 407)
.48
.38
.33
.28
.05
.10
Manifest Personality Domain Scores (self-report models)
Training Sample (n = 1,037)
.57
.40
.30
.24
.10
.16
Test Sample (n = 407)
.46
.31
.19
.11
.12
.20
Park et al. (2015; self-report models, test sample)
.38
.27
.15
.10
.11
.16
Hickman et al. (2022)
Self-report models (in the training sample)
.12
.04
-.05
.01
.09
.04
Interviewer-report models (in the training sample)
.40
.20
.09
.09
.11
.12
Interviewer-report models (in the test sample)
.37
.17
.06
.07
.10
.10
Harrison et al. (2019; other-report models, split-half validation)a
.65
.24
.24
Marinucci et al. (2018; self-report models; in the training sample)
.21
-.03
-.07
Note. a Unlike many other machine learning studies, Harrison et al.’s (2019) split CEO’s text into two halves. They built predictive
models based on the first half of the text and then tested it based on the other half of the text.
C1 = convergence index (average of monotrait-heteromethod correlations). = average of heterotrait-heteromethod
correlations. D1 = discrimination index1 (C1 – average of heterotrait-heteromethod correlations). D2 = discrimination index 2 (C1 –
average of heterotrait-monomethod correlations). D2a = discrimination index 2 calculated using only machine method heterotrait-
monomethod correlations). MV = method variance (average of hetero-monomethod correlations – average of heterotrait-heteromethod
correlations). MVa = method variance due to the machine method.
Personality Assessment through an AI Chatbot 66
Table 8.
Regression of Cumulative GPA and Peer-rated College Adjustment on ACT Scores and Self-
reported Questionnaire-derived and Machine-inferred Personality Score
Cumulative GPA
(n = 379)
Peer-rated college adjustment
(n = 289)
Step 1
Step 2
Step 3
Step 1
Step 2
Step 3
ACT
.50** (.04)
.51** (.04)
.51** (.04)
‒.01(.09)
‒.004 (.06)
‒.002 (.03)
Openness (S)
‒.19**(.04)
‒.20** (.05)
‒.02 (.06)
‒.01 (.08)
Openness (M)
.03 (.06)
‒.01 (.08)
R2
.24**
.28**
.28**
.00
.00
.00
ΔR2
.04**
.00
.00
.00
ACT
.50** (.04)
.48** (.04)
.48** (.04)
‒.01 (.09)
‒.02 (.06)
‒.02 (.06)
Conscientiousness (S)
.31** (.04)
.33** (.05)
.27** (.06)
.24** (.06)
Conscientiousness (M)
‒.04 (.05)
.06 (.06)
R2
.24**
.34**
.34**
.00
.07**
.08**
ΔR2
.10**
.00
.07**
.00
ACT
.50** (.04)
.48** (.04)
.50** (.04)
‒.01 (.09)
.01 (.09)
.02 (.06)
Extroversion (S)
‒.09* (.05)
‒.17**(.05)
.07 (.06)
‒.002 (.07)
Extroversion (M)
.18** (.05)
.18**(.06)
R2
.24**
.25**
.28**
.00
.01
.03*
ΔR2
.01*
.03**
.01
.03**
ACT
.50** (.04)
.49** (.04)
.49** (.04)
‒.01 (.09)
‒.01 (.07)
‒.01 (.06)
Agreeableness (S)
.11* (.04)
.11* (.05)
.15** (.06)
.13* (.06)
Agreeableness (M)
.01 (.05)
.07 (.06)
R2
.24**
.26**
.26**
.00
.02*
.03*
ΔR2
.02*
.00
.02**
.01
ACT
.50** (.04)
.50** (.04)
.48** (.04)
‒.01 (.09)
‒.01 (.06)
‒.01 (.06)
Neuroticism (S)
.03 (.05)
.08 (.05)
‒.16**(.06)
‒.13 (.06)
Neuroticism (M)
‒.13**(.05)
‒.07 (.07)
R2
.24**
.25**
.26**
.00
.02*
.03*
ΔR2
.01
.01**
.02**
.00
Note. (S) = self-reported questionnaire-derived scores. (M) = machine-inferred score. GPA =
grade point average. ACT = American College Testing. * p < .05. ** p < .01.
Personality Assessment through an AI Chatbot 67
Figure 1. Overview of an AI chatbot platform for building an effective chatbot and predicting personality using Juji’s virtual
conversation system as a prototype
Personality Assessment through an AI Chatbot 68
Figure 2. Plots of factor loadings of facet scores of the two measurement approaches and Root
Mean Squared Errors across Big Five Domains. RMSE = Root Mean Squared Errors. Train =
Training sample. Test = Testing sample. SR = self-reported questionnaire-derived, questionnaire-
derived facet scores. ML = Machine-inferred facet scores.