Conference PaperPDF Available

To Trust, or Not to Trust? A Study of Human Bias in Automated Video Interview Assessments

Authors:

Abstract and Figures

Supervised systems require human labels for training. But, are humans themselves always impartial during the annotation process? We examine this question in the context of automated assessment of human behavioral tasks. Specifically, we investigate whether human ratings themselves can be trusted at their face value when scoring video-based structured interviews, and whether such ratings can impact machine learning models that use them as training data. We present preliminary empirical evidence that indicates there might be biases in such annotations, most of which are visual in nature.
Content may be subject to copyright.
To Trust, or Not to Trust? A Study of Human Bias in Automated Video Interview
Assessments
Chee Wee (Ben) Leong1, Katrina Roohr2, Vikram Ramanarayanan1, Michelle P. Martin-Raugh2,
Harrison Kell2, Rutuja Ubale1, Yao Qian1, Zydrune Mladineo1, Laura McCulla2
NLP, Speech, Dialogic & Multimodal Research1, Academic To Career Research2
Princeton, San Francisco, Educational Testing Service, USA
Abstract
Supervised systems require human labels for training.
But, are humans themselves always impartial during the
annotation process? We examine this question in the con-
text of automated assessment of human behavioral tasks.
Specifically, we investigate whether human ratings them-
selves can be trusted at their face value when scoring video-
based structured interviews, and whether such ratings can
impact machine learning models that use them as training
data. We present preliminary empirical evidence that in-
dicates there might be biases in such annotations, most of
which are visual in nature.
1. Introduction and Related Work
Structured interviews standardize the questions and/or
evaluation methods used during the interview process (i.e.,
each interview has the same questions in the same order
[12]). As demonstrated by extensive meta-analytic evi-
dence [9,23], structured interviews consistently outperform
unstructured interviews (i.e., an interview where questions
and evaluations are non-standardized). They are particu-
larly effective hiring tools that account for significant in-
cremental validity in predicting job performance over and
above other popular hiring methods such as personality and
intelligence tests [4]. Despite their pervasiveness and ef-
fectiveness, interviews are susceptible to human biases, a
source of measurement error. Biases occur when interview-
ers collect or evaluate non-job-related information about ap-
plicants [12], such as sex, age, race, or attractiveness. Re-
search shows that human interviewers tend to be less likely
to hire individuals that are older [14], of a different race
than themselves [13], perceived as unattractive [8], or of
a sex not stereotypically associated with a particular job
(e.g., male nurse; [5]), among many other biasing factors.
In this work, we differentiate human bias from human sub-
jectivity, where interviewers make different hiring recom-
mendations for the same interviewee due to their different
weightings of job-related strengths and weakness of the ap-
plicant. In spite of mounting evidence of human bias in
industrial and organizational (I-O) psychology research, su-
pervised machine learning approaches to automated scor-
ing of video-based performance tasks (e.g. interview, pre-
sentations, public speaking, etc.) have largely focused on
mitigating human subjectivity by creating comprehensive
rubrics [24] to guide precise scoring of performance con-
structs [1], enforcing the calibration of human raters prior to
scoring [1,16], encouraging inter-rater discussions and re-
views [15], enlisting multiple raters [17,21], including both
behavioral experts and layman as raters [15], and averaging
ratings [15,21,2]. The work in [18] attempted to collect
demographic information such as age, gender and ethnicity
as part of the data collection survey but did not explore their
impact on automated scoring of the interviews. While the
authors in [15] claimed additional Mechanical Turk raters
were used as raters to remove bias from the actual inter-
viewers due to their interaction with the interviewees, no
effort was expended to further remove nor qualify the type
of biases being investigated. In this paper, we present a
study that motivates future efforts toward modeling fairness
in automated video interview assessments. We first provide
an overview of the video interview dataset used in our ex-
periments. Next, we explain the annotation scheme used
for generating the bias metadata vector, which is used to
(1) construct a standalone model, and, (2) augment a multi-
modal model for structured interview performance predic-
tion. Finally, we discuss some thoughts for future direc-
tions.
2. Dataset
Our video interview dataset is a corpus of monologic,
structured video interviews collected online through Ama-
zon Mechanical Turk from the authors in Chen et al. [2].
To our knowledge, this is the largest collection of struc-
tured interview responses simulating an actual hiring sce-
nario (i.e. hiring for an entry-level office position). It com-
1
prises 260 human interviewees with a total recording time
of 3784 minutes. Due to the dataset being collected “in
the wild”, there are videos where faces in dim lighting can-
not be detected, or unexpected clips in audios. These rep-
resent around 10% of the dataset and unfortunately have
to be discarded since we failed to extract reliable multi-
modal features from them. Eventually, we have 1887 (in-
terviewee, video response) datapoints after pre-processing.
All responses were collected from participants across the
United States, and differed in gender, race, age, experience,
etc. Recording conditions also varied in terms of devices,
lighting, backgrounds, etc. All videos from interviewees
are collected indoors. Chen et al. [2] developed a 7-point
Likert rating scale (1 = Strongly Disagree, 7 = Strongly
Agree) to score interview performance using overall hiring
recommendation guidelines proposed in [10]. During an-
notation, 5 human raters were asked to score each response
using the Likert scale, and their averaged ratings are used
as the ground-truth for the video. The same raters were
used for scoring all the video interviewees. Subsequently,
the authors used the median score of all ground-truths as a
threshold to separate all video responses into HIGH (above-
average) vs. LOW (below-average), and frame their ex-
periments as a binary classification task using multimodal
features to predict the outcome of each interviewee’s per-
formance. On the quality of human annotations, the au-
thors reported an Intraclass correlation (ICC) of 0.79 (using
the two-way random average measure of consistency), and
Rmean of 0.74, where R is the correlation coefficient of in-
dividual raters’ scores to the averaged scores. Note that,
for each interviewee, 8 different prompts (questions) were
attempted. The response length is limited to 2 min/prompt.
3. Coding Bias Metadata
Research in I-O psychology has identified several cat-
egories of human biases relevant for our work, which we
have listed here with the possible labels in parenthesis:
SETTING (full room visible, partial room visible, only
wall), WALL (blank, almost blank, with many items), GEN-
DER (male, female), APPEARANCE (very unattractive,
unattractive, average, attractive, very attractive), RACE
(White, African American, Asian, Hispanic, Other), AGE
(18-25, 26-35, 36-45, 46-55, 56-65, 65+), WEIGHT (very
thin, thin, average, overweight, obese), FACIAL STIGMA
(no, yes), ACCENT (American Typical, International, Do-
mestic (e.g. Southern, Bostonian, etc.)). Two human raters
trained in education and assessment research annotated each
video interviewee independently on each of the categories
using the appropriate label after going through an initial
calibration. The two raters differed in race, accent, age
when self-rated. Note that this bias metadata annotation
effort is independent from the hiring recommendation an-
notation in [2]. Additionally, we hypothesize that physical
backgrounds of interviewees may induce rater bias. Hence,
we introduce two categories coding for the visible environ-
ment (SETTING) and the physical state of the wall (WALL)
behind the interviewee. Cohen’s Kappa (κ) between the
2 raters are reported for each category: SETTING (.62),
WALL (.76), GENDER (.99), APPEARANCE (.49), RACE
(.72), AGE (.69), WEIGHT (.70), FACIAL STIGMA (.13),
ACCENT (.32). Where disagreement between the two
raters occurs, we adopted a consistent rule for rounding off
to the integer coding for the nearest label for all ordinal cat-
egories (e.g. for APPEARANCE, if two labels coded are
average (3) and attractive (4), then the average is 3.5, but the
rounded label is attractive (4)). For non-ordinal categories
(i.e. GENDER, RACE and ACCENT), we always arbitrate
to rater 1’s label where there is a disagreement for the sake
of consistency. However, such disagreement cases account
for only 2% of the total non-ordinal label pairings. Af-
ter the bias metadata vector for an interviewee is coded, the
same vector is used for all 8 interview prompts answered by
the same interviewee for experiments. We apply all labels in
the bias metadata as a single feature set to construct models
for predicting interviewee performance in a binary classi-
fication task (i.e. below-average or above-average). Since
each of the 8 prompts are different, and responses across
the 8 prompts are scored independently, each tuple (inter-
viewee, video response) can be treated as single datapoint
for experimentation. Of the 254 unique interviewees in the
dataset, 152 (60%) have attained different classification
scores across the 8 prompts, suggesting a potentially signif-
icant variance in the ability to handle different prompts even
within an individual. We use stratified sampling in a 10-fold
cross-validation applied at the (interviewee,prompt)-level to
maintain distribution of the classes while sampling, and em-
ploy learners with proven effectiveness on relatively small-
to-medium sized datasets using the scikit-learn toolkit [20].
4. Modeling structured interview through the
exclusive use of bias metadata
0.00 0.12 0.15
APPEARANCE
AGE
WEIGHT
GENDER
WALLS
SETTING
RACE
ACCENT
FACIAL_STIGMA
random
Figure 1. Feature category importance weighting using Random-
ForestClassifier (n=500), 10-fold cross-validated
Prior to model building, each categorical label is converted
into one-hot encodings fitted on the entire dataset of 1887
(interviewee, prompt) tuples, where only one label in each
category per datapoint is activated, across all categories.
Consequently, each label is transformed into a numerical
value to facilitate experiments across a range of standard-
Figure 2. Model interpretation using LIME applied to a prediction
instance by RandomForestClassifier (n=500)
ized learners. A stratified 10-fold cross-validation predictor
that respects the class distribution of the train fold (i.e. a
chance predictor), as well as a majority vote baseline, are
used as the baselines. For the former, we used the Strati-
fiedKFold function in [20], where the proportion of classes
in each train/test partition resembles that of the population,
and all datapoints are randomized and grouped into either
the train or test partition with no repeats per fold. The re-
sults of the 10-fold cross-validation is shown in Table 1.
Additionally, Figure 1shows the weight of each category
(averaged over the 10-folds) used in the Random Forest
model which scores the best performance (F1=.765) in our
combination experiments. The category weighting scheme
is based on permutation importance [19] instead of impor-
tance based on reduction in node impurity that is imple-
mented in scikit-learn, which has reliability concerns [25].
Note that category random is a control with randomly gen-
erated values for ensuring validity in the feature importance
computation: if a category has negative importance, remov-
ing it actually improves performance. Importance of the
FACIAL STIGMA category (0) is currently inconclusive,
due to overly skewed distributions (i.e. rater 1 and 2 an-
notated yes on only 3.1% and 1.9% of the datapoints re-
spectively). Otherwise, the most important label for mod-
eling in each category are as follows: SETTING (par-
tial room visible), WALL (almost blank), GENDER (fe-
male), APPEARANCE (unattractive), RACE (White), AGE
(26-35), WEIGHT (overweight), ACCENT (Typical).
Though encoded into numerical values for experimen-
tation, the original labels (e.g. male, obese, unattrac-
tive, etc.) associated with each category are human-
interpretable, hence lending themselves to an explanation
on whether any automated scoring model is behaving rea-
sonably as measured against established I-O psychology
findings. We take advantage of this phenomenon to fur-
ther validate our hypothesis through application of the Lo-
cal Interpretable Model-Agnostic (LIME) [22] toolkit to our
dataset, by examining datapoints where our model predic-
tions are confident. For a targeted datapoint and its pre-
diction, LIME perturbs inputs to other neighboring data-
points around it to learn an interpretable and high-fidelity,
local model to help explain the prediction made by the in-
put model. For instance, Figure 2shows our Random For-
est prediction p(class = below average”) = 1 for a
specific (interviewee, prompt) datapoint, with the contribu-
tion of each categorical label accounting for each candidate
class prediction. Here, we note the interviewee being la-
beled with appearance=unattractive adds a probability of
14% to an unfavorable class prediction outcome without
regard for multimodal features extracted. Simultaneously,
the accent=typical accent (typical American accent) adds a
probability of 17% to a favorable class prediction. This dat-
apoint, and others we have examined, corroborate research
findings in I-O psychology to a degree supported by the bias
annotation agreement that biases might influence interview
outcomes. To the best of our knowledge, this work is the
first that provides empirical linkage between I-O psychol-
ogy and ML/AI modeling for structured video interviews.
5. Multimodal Model Augmentation
We also experimented with multimodal model augmen-
tation (i.e. an approach of augmenting a multimodal model
with our bias metadata) to observe if we can further achieve
model performance gains. For a fair comparison, we ob-
tained the original feature set used in [2], evaluate on the
same training/testing partitions, but opted to use a deep neu-
ral network (DNN) for modeling the task due to its proven
effectiveness for modeling other similar constructs (e.g. en-
gagement [6,29] and emotions [7,27]) of human behavior.
P R F1
INDIVIDUAL LEARNERS
Logistic Regression .581 .567 .572
Nearest Neighbor .720 .707 .713
SVM (gamma=2, C=10) .742 .756 .748
Decision Tree .692 .732 .711
Random Forest (RF) (n=500) .737 .750 .742
Multilayer Perceptron (alpha=1) .617 .636 .623
Multimodal DNN .606 .915 .727
COMBINATIONS WITH DNN
Multimodal DNN & RF (AND) .785.701.739
Multimodal DNN & RF (Stacking) .746.787.765
Multimodal DNN & SVM (AND) .790.704.742
Multimodal DNN & SVM (Stacking) .755.773.763
BASELINES
Baseline (Stratified) .512 .552 .531
Baseline (Majority vote) .510 1.00 .675
Table 1. Mean 10-fold cross-validation of precision, recall and
F1(true positive = above-average) of individual learners, selected
combinations and baselines. All experiments are executed with the
same random seed and random state for replicability. Bold and un-
derline indicates first and second ranked results per column, while
indicates statistical significance at p < .001 over the Multimodal
DNN system, using the Wilcoxon signed-rank test.
”Well, I think
there are certain
qualities to being
a good team
member…”
Multimodal feature extraction
OpenFace
427-D VIDEO vector
PyAudio
34-D SPEECH vector
Scikit-learn TFIDF
vectorizer (“BOW”
10000-D TEXT vector
Deep Neural Network
2D-Convolution
ReLU
GRU
Batch Norm
Dropout
Batch Norm
Model Ensemble + Prediction
3x stacked
Prediction
Class:
Above-
average
OR
Below
average
Sigmoid
ModelS
ModelT
ModelV
Attenti on
Batch Norm
Dropout
Figure 3. Multimodal DNN model for achieving the state-of-the-art in [2].
Given that we have only 1887 datapoints (1521/training,
366/testing), we train a deeper network but with dropouts
to minimize overfitting. In our DNN model, a 2D convolu-
tion layer is applied to the time-series data i.e. video and
audio to extract spatio-temporal features of interest within
video/audio segments. Given the different sampling rates
for video and audio feature extraction in [2], and their
different feature dimensions, we use different kernel sizes
and filters (i.e. audio: 30 filters, each with kernel and
strides (10,2); video: 20 filters, each with kernel and strides
(15,16)) in order to generate a somewhat balanced repre-
sentation of the two time-series modalities at the input level
before sending them deeper into the network for abstrac-
tion. GRUs are used as our recurrent units which are fa-
vored over LSTMs for faster convergence in our experi-
ment, with a final attention layer mechanism applied sim-
ilar to the one in [26]. Our DNN model, shown in Figure
3, is constructed and evaluated using Keras [3] with a Ten-
sorFlow backend, and its hyperparameters are tuned using
talos [11]. After 30 epochs in the same training partition,
we achieved a best-performing model of F1=0.70 on the
same test partition using the same multimodal feature set,
which is competitive to the model performance of F1=0.66
achieved in [2] using SVM. With confirmation on the ef-
fectiveness of our DNN model, we retrained it with the
same F1loss function and metric on each of the 10 train
folds used in experiment 1. Next, we perform augmenta-
tion of the DNN model with the bias metadata using two
approaches: (1) a simple, intuitive element-wise AND con-
dition between predictions made by the DNN model and
the other learners, and (2) a stacked generalization method
[28] that uses the DNN prediction probabilities and com-
bines it with the raw bias metadata vector before applying
another learning algorithm to generate the final prediction.
A justification for the latter approach is that the shorter but
dense metadata bias vector may be masked by the much
larger set of sparse, time-series modal features, hence com-
bining them in the late fusion stage is more appropriate.
Our results for the multimodal augmentation experiments
are shown in Table 1. While using the logical AND aug-
mentation generates a high-precision classifier, the stacked
generalization approach is better at achieving overall model
performance measured by F1.Inter-feature correlation:
As further investigation, we compute Cohen’s Kappa (κ)
between each of the bias metadata features and the multi-
modal DNN prediction output. The results indicate only
very slight agreement between the biases and the prediction
output, with most of the κcentering at zero (GENDER has
the highest κat .04). Feature importance: We again com-
pute feature importances (Random Forest, n=500), this time
using all available features. Consequently, the multimodal
DNN prediction exclusively accounts for the largest impor-
tance weighting (.09), which is more than the weighting as-
sumed by APPEARANCE (.06) that ranks second. These
findings indicative that the multimodal DNN model by itself
still accounts for a substantial variance in performance of a
joint model. Bias metadata features have a non-negligible,
combined importance weighting (.27), which is concerning
if they are indeed construct-irrelevant. However, it could
also be the case that some of these metadata features proxy
for a construct-relevant latent trait that is not captured in
the DNN features. We also note that these results reflect
a closed-pool-of-subjects setting where the system will not
see new subjects at test time, but only new interviews from
existing subjects. Whether the findings generalize to a case
with new interviewees, unseen at training time, is a question
left for future work.
6. Conclusion
Despite having little correlation with construct-relevant
multimodal features, the fact that demographic character-
istics and other biasing variables have a non-negligible im-
pact on modeling human behavioral tasks is worrisome, and
poses implications for tasks such as the automated assess-
ment of structured video interviews used in high-stakes em-
ployment settings where the decisions made using scoring
models need to be both fair and valid. Future work will
further explore whether there is a causal relationship be-
tween these biases and human scores. If so, we will focus
on debiasing techniques, possibly using avatars during hu-
man scoring that mirror an interviewees facial expressions
without carrying over the visual biases, or modeling an in-
dividual human raters biases so that they can be statistically
controlled for.
References
[1] L. Chen, G. Feng, J. Joe, C. W. Leong, C. Kitchen, and C. M.
Lee. Towards automated assessment of public speaking skills
using multimodal cues. In Proceedings of the 16th Inter-
national Conference on Multimodal Interaction, pages 200–
203. ACM, 2014. 1
[2] L. Chen, R. Zhao, C. W. Leong, B. Lehman, G. Feng, and
M. E. Hoque. Automated video interview judgment on a
large-sized corpus collected online. In 2017 Seventh Inter-
national Conference on Affective Computing and Intelligent
Interaction (ACII), pages 504–509. IEEE, 2017. 1,2,3,4
[3] F. Chollet et al. Keras. https://keras.io, 2015. 4
[4] J. M. Cortina, N. B. Goldstein, S. C. Payne, H. K. Davison,
and S. W. Gilliland. The incremental validity of interview
scores over and above cognitive ability and conscientious-
ness scores. Personnel Psychology, 53(2):325–351, 2000. 1
[5] H. K. Davison and M. J. Burke. Sex discrimination in sim-
ulated employment contexts: A meta-analytic investigation.
Journal of Vocational Behavior, 56(2):225–248, 2000. 1
[6] A. Dhall, A. Kaur, R. Goecke, and T. Gedeon. Emotiw 2018:
Audio-video, student engagement and group-level affect pre-
diction. In Proceedings of the 2018 on International Con-
ference on Multimodal Interaction, pages 653–656. ACM,
2018. 3
[7] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zim-
mermann. Icon: Interactive conversational memory net-
work for multimodal emotion detection. In Proceedings of
the 2018 Conference on Empirical Methods in Natural Lan-
guage Processing, pages 2594–2604, 2018. 3
[8] M. Hosoda, E. F. Stone-Romero, and G. Coats. The ef-
fects of physical attractiveness on job-related outcomes: A
meta-analysis of experimental studies. Personnel psychol-
ogy, 56(2):431–462, 2003. 1
[9] A. I. Huffcutt, J. M. Conway, P. L. Roth, and U.-C. Klehe.
The impact of job complexity and study design on situational
and behavior description interview validity. International
Journal of Selection and Assessment, 12(3):262–273, 2004.
1
[10] A. I. Huffcutt, J. M. Conway, P. L. Roth, and N. J. Stone.
Identification and meta-analytic assessment of psychological
constructs measured in employment interviews. Journal of
Applied Psychology, 86(5):897, 2001. 2
[11] M. Kotila. talos hyperparameter optimization for keras,
2018. 4
[12] J. Levashina, C. J. Hartwell, F. P. Morgeson, and M. A. Cam-
pion. The structured employment interview: Narrative and
quantitative review of the research literature. Personnel Psy-
chology, 67(1):241–293, 2014. 1
[13] T.-R. Lin, G. H. Dobbins, and J.-L. Farh. A field study of
race and age similarity effects on interview ratings in con-
ventional and situational interviews. Journal of Applied Psy-
chology, 77(3):363, 1992. 1
[14] F. P. Morgeson, M. H. Reider, M. A. Campion, and R. A.
Bull. Review of research on age discrimination in the em-
ployment interview. Journal of Business and Psychology,
22(3):223–232, 2008. 1
[15] I. Naim, M. I. Tanveer, D. Gildea, and M. E. Hoque. Auto-
mated prediction and analysis of job interview performance:
The role of what you say and how you say it. In 2015 11th
IEEE International Conference and Workshops on Automatic
Face and Gesture Recognition (FG), volume 1, pages 1–6.
IEEE, 2015. 1
[16] L. S. Nguyen, D. Frauendorfer, M. S. Mast, and D. Gatica-
Perez. Hire me: Computational inference of hirability in
employment interviews based on nonverbal behavior. IEEE
transactions on multimedia, 16(4):1018–1031, 2014. 1
[17] L. S. Nguyen and D. Gatica-Perez. I would hire you in a
minute: Thin slices of nonverbal behavior in job interviews.
In Proceedings of the 2015 ACM on International Confer-
ence on Multimodal Interaction, pages 51–58. ACM, 2015.
1
[18] L. S. Nguyen and D. Gatica-Perez. Hirability in the wild:
Analysis of online conversational video resumes. IEEE
Transactions on Multimedia, 18(7):1422–1437, 2016. 1
[19] T. Parr. Feature importances for scikit random forests, 2018.
3
[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, et al. Scikit-learn: Machine learning in python.
Journal of machine learning research, 12(Oct):2825–2830,
2011. 2,3
[21] V. Ramanarayanan, C. W. Leong, L. Chen, G. Feng, and
D. Suendermann-Oeft. Evaluating speech, face, emotion
and body movement time-series features for automated mul-
timodal presentation scoring. In Proceedings of the 2015
ACM on International Conference on Multimodal Interac-
tion, pages 23–30. ACM, 2015. 1
[22] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust
you?: Explaining the predictions of any classifier. In Pro-
ceedings of the 22nd ACM SIGKDD international confer-
ence on knowledge discovery and data mining, pages 1135–
1144. ACM, 2016. 3
[23] F. L. Schmidt and R. D. Zimmerman. A counterintuitive hy-
pothesis about employment interview validity and some sup-
porting evidence. Journal of Applied Psychology, 89(3):553,
2004. 1
[24] L. M. Schreiber, G. D. Paul, and L. R. Shibley. The devel-
opment and test of the public speaking competence rubric.
Communication Education, 61(3):205–233, 2012. 1
[25] C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn. Bias
in random forest variable importance measures: Illustrations,
sources and a solution. BMC bioinformatics, 8(1):25, 2007.
3
[26] R. Ubale, Y. Qian, and K. Evanini. Exploring end-to-end
attention-based neural networks for native language identifi-
cation. In 2018 IEEE Spoken Language Technology Work-
shop (SLT), pages 84–91. IEEE, 2018. 4
[27] V. Vielzeuf, C. Kervadec, S. Pateux, A. Lechervy, and F. Ju-
rie. An occam’s razor view on learning audiovisual emo-
tion recognition with small training sets. In Proceedings of
the 2018 on International Conference on Multimodal Inter-
action, pages 589–593. ACM, 2018. 3
[28] D. H. Wolpert. Stacked generalization. Neural networks,
5(2):241–259, 1992. 4
[29] J. Yang, K. Wang, X. Peng, and Y. Qiao. Deep recurrent
multi-instance learning with spatio-temporal features for en-
gagement intensity prediction. In Proceedings of the 2018 on
International Conference on Multimodal Interaction, pages
594–598. ACM, 2018. 3
... In fact, systematic bias could be the result of bias in the dataset, inadequate representation in the construction of the model, or under-representation of minority populations. Thus, some researchers have studied biases that might exist within the dataset [13], [14] or in evaluating the fairness of the model's output [15]. These studies are limited to ensuring that there is no bias in the system output or in the dataset, and they do not propose any method in the case of a biased pipeline. ...
... Features for hireability prediction. Following the method- ology from automatic analysis of interview literature [1], [5], [14], we use three different modalities: vocal cues of the video, facial expressions of the candidate and verbal content. Each of these modalities is extracted from the video as a time series. ...
... Previous work on job interview analysis using deep learning approaches [5], [14] do not offer satisfactory multimodal representation of the inputs. We thus propose to improve the multimodal representation by fusing the three modalities through the use of a Gated Multimodal Unit (GMU, [27]). ...
Preprint
Full-text available
se of machine learning for automatic analysis of job interview videos has recently seen increased interest. Despite claims of fair output regarding sensitive information such as gender or ethnicity of the candidates, the current approaches rarely provide proof of unbiased decision-making, or that sensitive information is not used. Recently, adversarial methods have been proved to effectively remove sensitive information from the latent representation of neural networks. However, these methods rely on the use of explicitly labeled protected variables (e.g. gender), which cannot be collected in the context of recruiting in some countries (e.g. France). In this article, we propose a new adversarial approach to remove sensitive information from the latent representation of neural networks without the need to collect any sensitive variable. Using only a few frames of the interview, we train our model to not be able to find the face of the candidate related to the job interview in the inner layers of the model. This, in turn, allows us to remove relevant private information from these layers. Comparing our approach to a standard baseline on a public dataset with gender and ethnicity annotations, we show that it effectively removes sensitive information from the main network. Moreover, to the best of our knowledge, this is the first application of adversarial techniques for obtaining a multimodal fair representation in the context of video job interviews. In summary, our contributions aim at improving fairness of the upcoming automatic systems processing videos of job interviews for equality in job selection.
... (1) cognitive and human biases of workers encoded into algorithms (Cappelli et al., 2020;Crawford et al., 2019;Leong et al., 2019;Raub, 2018); ...
... Firstly, algorithms may embody human discriminatory behaviour because they are designed by humans (Leong et al., 2019;Raub, 2018). Cognitive, emotional, or affective biases are directly related to the lack of diversity in the IT industry, especially in the AI industry West et al., 2019). ...
Article
Full-text available
The transformation of the intelligence ecosystem associated with the digital transformation represents a critical juncture for diversity and inclusion (D&I). We present a multidisciplinary perspective on digital transformation and D&I that demonstrates that, in the context of automated decision making, where algorithmic biases and the standardisation of thought represent new risks, neurodiversity initiatives become a cornerstone for advancing D&I. Based on interviews with neurodiversity experts, we identify innovative ways to efficiently configure an inclusive organisational design targeting neurodiversity by leveraging technologies. We identify several properties of technologies that support D&I in neurodiversity initiatives: the neutralisation of biases during interviews, the development of digital support for physical and mental well‐being and the facilitation of different cognition modes. Finally, we critically discuss the risks and opportunities offered by various technologies in terms of performance evaluation, new forms of dominance, and design of a digital ecosystem for mental well‐being.
... However, in the realm of video-and image-driven AI applications, concerns have arisen regarding potential biases, as highlighted by various studies [112]- [114]. This issue became prominent in 2021 when HireVue ceased the use of its video-based candidate scoring systems, citing the absence of transparency in AI decision-making as a primary reason [32]. ...
Preprint
Full-text available
The recruitment process is crucial to an organization's ability to position itself for success, from finding qualified and well-fitting job candidates to impacting its output and culture. Therefore, over the past century, human resources experts and industrial-organizational psychologists have established hiring practices such as attracting candidates with job ads, gauging a candidate's skills with assessments, and using interview questions to assess organizational fit. However, the advent of big data and machine learning has led to a rapid transformation in the traditional recruitment process as many organizations have moved to using artificial intelligence (AI). Given the prevalence of AI-based recruitment, there is growing concern that human biases may carry over to decisions made by these systems, which can amplify the effect through systematic application. Empirical studies have identified prevalent biases in candidate ranking software and chatbot interactions, catalyzing a growing body of research dedicated to AI fairness over the last decade. This paper provides a comprehensive overview of this emerging field by discussing the types of biases encountered in AI-driven recruitment, exploring various fairness metrics and mitigation methods, and examining tools for auditing these systems. We highlight current challenges and outline future directions for developing fair AI recruitment applications, ensuring equitable candidate treatment and enhancing organizational outcomes.
... Thus, algorithms are often viewed as infallible, and some researchers have argued that they make better decisions than humans (Smith, 2018). However, other research has shown that algorithms are often plagued with the same errors and biases as humans because they are developed and programmed by humans (Leong et al., 2019;Raub, 2018). As a result, algorithms may not always uncover the most accurate sources of problems and may create or replicate discriminatory practices. ...
... Thus, algorithms are often viewed as infallible, and some researchers have argued that they make better decisions than humans (Smith, 2018). However, other research has shown that algorithms are often plagued with the same errors and biases as humans because they are developed and programmed by humans (Leong et al., 2019;Raub, 2018). As a result, algorithms may not always uncover the most accurate sources of problems and may create or replicate discriminatory practices. ...
Article
Full-text available
Human Resources Analytics (HRA) is drawing more attention every year, and will be crucial to human resource development. However, the literature around the topic would appear to be more promotional than descriptive. With this in mind, we conducted a systematic literature review and content analysis with the following objectives: first, to address the current state of HRA and second, to propose a framework for the development of HRA as a sustainable practice. We analyzed 79 articles from research databases and found 34 empirical studies for subsequent content analysis. While the main results reflect the relative newness of the field of HRA, with the majority of the empirical articles focusing on financial aspects, they also reveal the growing importance given to ethics. Finally, we propose a framework for the development of sustainable HRA based on the triple bottom line and discuss the implications of our findings for researchers and practitioners.
Article
Full-text available
Oral communication has consistently been ranked as a key skill, with 90 percent of hiring managers and 80 percent of business executives saying it is very important for college graduates to possess, according to a recent survey. Consequently, training and evaluating oral presentation skills remains a priority for educators worldwide, and there are increasing numbers of automated tools developed for providing feedback and assessment of such skills. However, modeling approaches typically require collecting large amounts of data and labels, which can be both expensive and laborious. In this paper, we explore the possibility of transfer learning between two different but related multimodal datasets to benefit the evaluation of oral presentation performance. We utilize knowledge from a job interview dataset as pretraining material and adapt the learned knowledge from the pre-trained model to a small amount of presentation data to improve the learning of the presentation assessment task. We demonstrate the efficacy of our approach, especially in improving performance for inference on small datasets (< 100 data points), and we report our findings. Moreover, we give a comparison between the proposed TL approach and a standard TL method based on a large-scale pre-trained model. Despite the simplicity of our proposed TL approach, the results show that our approach has promise in application to smaller datasets such as ours.
Article
This study used the theory of dispositional attribution and an adapted version of Brunswik's lens model to examine whether the relationship between vocal and visual cues and hiring recommendation is mediated by raters' judgments of speakers' personality traits. The results of mediation analyses suggest that for most of the vocal and visual cues examined, perceptions of interviewees' conscientiousness and openness to experience mediated the relationship between cues and interview performance. This suggests that the vocal and visual cues an interviewee displays convey important information about his or her personality traits to an interviewer.
Conference Paper
The need for global online recruitment has risen tremendously in recent years. However, this procedure presents difficulties for recruiters in managing the flood of applications and maintaining contact with the applicants. Historically, little attention has been paid to a practical solution for virtual recruitment. As a result, the paper proposes “vRecruit - A machine learning-based web application” for virtual recruitment in the current paper. vRecruit's primary features include a client-specific interview process that leverages Machine Learning-based references to context provided by the client, as well as a text-based sentiment analysis engine. All components work in unison to ensure the webapp's end-to-end functionality, which was finally launched on flask. The face recognition method using the face api model achieved a 96\% accuracy. The speech to text conversion using the Mozilla DeepSpeech model had a 7.55\% word error rate, whereas the rasa Natural Language Understanding (NLU) model trained for chatbots had a 95\% accuracy. The webapp provides a hassle-free virtual recruiting experience for candidates and interviewers.
Chapter
This paper presents an analysis of informative presentations using sequential multimodal modeling for automatic assessment of presentation performance. For this purpose, we transform a single video into multiple time-series segments that are provided as inputs to sequential models, such as Long Short-Term Memory (LSTM). This sequence modeling approach enables us to capture the time-series change of multimodal behaviors during the presentation. We proposed variants of sequential models that improve the accuracy of performance prediction over non-sequential models. Moreover, we performed segment analysis on the sequential models to analyze how relevant information from various segments can lead to better performance in sequential prediction models.
Conference Paper
Full-text available
This paper details the sixth Emotion Recognition in the Wild (EmotiW) challenge. EmotiW 2018 is a grand challenge in the ACM International Conference on Multimodal Interaction 2018, Colarado, USA. The challenge aims at providing a common platform to researchers working in the affective computing community to benchmark their algorithms on 'in the wild' data. This year EmotiW contains three sub-challenges: a) Audio-video based emotion recognition; b) Student engagement prediction; and c) Group-level emotion recognition. The databases, protocols and baselines are discussed in detail.
Conference Paper
Full-text available
This paper elaborates the winner approach for engagement intensity prediction in the EmotiW Challenge 2018. The task is to predict the engagement level of a subject when he or she is watching an educational video in diverse conditions and different environments. Our approach formulates the prediction task as a multi-instance regression problem. We divide an input video sequence into segments and calculate the temporal and spatial features of each segment for regressing the intensity. Subject engagement, that is intuitively related with body and face changes in time domain, can be characterized by long short-term memory (LSTM) network. Hence, we build a multi-modal regression model based on multi-instance mechanism as well as LSTM. To make full use of training and validation data, we train different models for different data split and conduct model ensemble finally. Experimental results show that our method achieves mean squared error (MSE) of 0.0717 in the validation set, which improves the baseline results by 28%. Our methods finally win the challenge with MSE of 0.0626 on the testing set.
Conference Paper
Full-text available
This paper presents a light-weight and accurate deep neural model for audiovisual emotion recognition. To design this model, the authors followed a philosophy of simplicity, drastically limiting the number of parameters to learn from the target datasets, always choosing the simplest learning methods: i) transfer learning and low-dimensional space embedding allows to reduce the dimensionality of the representations, ii) visual temporal information handled by a simple score-per-frame selection process averaged across time, iii) simple frame selection mechanism for weighting images within sequences, iv) fusion of the different modalities at prediction level (late fusion). The paper also highlights the inherent challenges of the AFEW dataset and the difficulty of model selection with as few as 383 validation sequences. The proposed real-time emotion classifier achieved a state-of-the-art accuracy of 60.64 % on the test set of AFEW, and ranked 4th at the Emotion in the Wild 2018 challenge.
Article
Full-text available
There has been a growing interest in understanding what constructs are assessed in the employment interview and the properties of those assessments. To address these issues, the authors developed a comprehensive taxonomy of 7 types of constructs that the interview could assess. Analysis of 338 ratings from 47 actual interview studies indicated that basic personality and applied social skills were the most frequently rated constructs in this taxonomy, followed by mental capability and job knowledge and skills. Further analysis suggested that high-and low-structure interviews tend to focus on different constructs. Taking both frequency and validity results into consideration, the findings suggest that at least part of the reason why structured interviews tend to have higher validity is because they focus more on constructs that have a stronger relationship with job performance. Limitations and directions for future research are discussed.
Conference Paper
Full-text available
In everyday life, judgments people make about others are based on brief excerpts of interactions, known as thin slices. Inferences stemming from such minimal information can be quite accurate, and nonverbal behavior plays an important role in the impression formation. Because protagonists are strangers, employment interviews are a case where both nonverbal behavior and thin slices can be predictive of outcomes. In this work, we analyze the predictive validity of thin slices of real job interviews, where slices are defined by the sequence of questions in a structured interview format. We approach this problem from an audio-visual, dyadic, and nonverbal perspective, where sensing, cue extraction, and inference are automated. Our study shows that although nonverbal behavioral cues extracted from thin slices were not as predictive as when extracted from the full interaction, they were still predictive of hirability impressions with R2R^2 values up to 0.340.34, which was comparable to the predictive validity of human observers on thin slices. Applicant audio cues were found to yield the most accurate results.
Conference Paper
Full-text available
We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams--the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.
Conference Paper
Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally varound the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.