ArticlePDF Available

The Validity of Assessment Centres for the Prediction of Supervisory Performance Ratings


Abstract and Figures

The current meta-analysis of the selection validity of assessment centres aims to update an earlier meta-analysis of assessment centre validity. To this end, we retrieved 26 studies and 27 validity coefficients (N=5850) relating the Overall Assessment Rating (OAR) to supervisory performance ratings. The current study obtained a corrected correlation of .28 between the OAR and supervisory job performance ratings (95% confidence interval .24≤ρ≤.32). It is further suggested that this validity estimate is likely to be conservative given that assessment centre validities tend to be affected by indirect range restriction.
Content may be subject to copyright.
The Validity of Assessment
Centres for the Prediction of
Supervisory Performance
Ratings: A meta-analysis
Eran Hermelin
, Filip Lievens
and Ivan T.
Psychology Group, Manchester Business School, University of Manchester, Manchester M15 6PB, UK.
Department of Personnel Management and Work and Organizational Psychology, Ghent University, Henri
Dunantlaan 2, 9000 Ghent, Belgium
Robertson Cooper Ltd., Manchester, UK
The current meta-analysis of the selection validity of assessment centres aims to update
an earlier meta-analysis of assessment centre validity. To this end, we retrieved 26 studies
and 27 validity coefficients (N¼5850) relating the Overall Assessment Rating (OAR) to
supervisory performance ratings. The current study obtained a corrected correlation of
.28 between the OAR and supervisory job performance ratings (95% confidence interval
.24 q.32). It is further suggested that this validity estimate is likely to be conserva-
tive given that assessment centre validities tend to be affected by indirect range
1. Introduction
In human resources management, assessment centres
essentially serve two purposes (Thornton, 1992).
Their first and most traditional purpose is selection
and promotion. In these assessment centre applica-
tions, the so-called Overall Assessment Rating (OAR)
plays a predominant role as selection and promotion
decisions are contingent upon it. Gaugler, Rosenthal,
Thornton, and Bentson (1987) reported findings sup-
porting the validity of predictions made on the basis of
the OAR. Specifically, their meta-analysis estimated the
mean operational validity of the OAR to be .36 with
respect to the criterion of job performance ratings. As
regards to their second purpose, assessment centres
are increasingly used for developing managerial talent.
In developmental assessment centres, the focus shifts
from the OAR to assessment centre dimensions which
serve as the basis for providing participants with
detailed feedback about their strengths and weak-
nesses. Recently, Arthur, Day, McNelly, and Edens
(2003) reported evidence supporting the validity of
predictions made on the basis of assessment centre
dimensions. In particular, their meta-analysis focused on
six main assessment centre dimensions (consideration/
awareness of others, communication, drive, influencing
others, organizing and planning, and problem solving),
with validities varying from .25 to .39.
Thus, although a recent meta-analysis updated the
validity of predictions made on the basis of assessment
centre dimensions (Arthur et al., 2003), the results of
This paper is based in part on the Doctoral thesis submitted by the
first author to the University of Manchester Institute of Science and
Technology in fulfillment of the requirements for the degree of PhD in
Management Science.
&2007 The Authors. Journal compilation &2007 Blackwell Publishing Ltd,
9600 Garsington Road, Oxford, OX4 2DQ, UK and 350 Main St., Malden, MA, 02148, USA
International Journal of Selection and Assessment Volume 15 Number 4 December 2007
Gaugler et al. (1987) still serve as the ‘gold standard’ of
assessment centre validity estimation on the basis of the
OAR. However, since 1987, new validity studies have
been conducted that were obviously not included in the
Gaugler et al. (1987) meta-analysis. Accordingly, this
study meta-analysed studies that were not included in
the Gaugler et al. (1987) meta-analysis (from 1985
onward). In keeping with the Gaugler et al. study, we
focus on the selection validity of assessment centres
(i.e., their ability to select the best candidates for a given
job) instead of on the validity of assessment centre
dimensions (see Arthur et al., 2003). Supervisory
performance ratings served as the criterion measure
in the current meta-analysis.
2. Method
2.1. Database
We used a number of strategies to identify validity
studies potentially suited for inclusion in the current
meta-analysis. First, a computerized search of various
electronic databases was conducted (PsycInfo, Social
Sciences Citation Index, etc.). Second, a computerized
search of the British Psychological Society database of
UK-based Chartered Psychologists was undertaken and
academics and practitioners were contacted to identify
individuals who may have access to unpublished assess-
ment centre validity data. Third, around 20 of the top
companies in the FTSE 500 index and four of the United
Kingdom’s largest occupational psychology firms were
2.2. Inclusion and coding criteria
We scrutinized the studies retrieved and included them
in our final database if they met the following four
inclusion criteria. First, studies were considered if they
referred to an ‘assessment centre’. An assessment
centre was defined on the basis of the following criteria:
(a) The selection procedure included two or more
selection methods, at least one of which was an
individual/group simulation exercise, (b) one of more
assessors were required to directly observe assessees’
behaviour in at least one of the simulation exercises, (c)
evaluation of assessees’ performance on the selection
methods included in the selection procedure was (or
could be) integrated into an OAR by a clinical or stati-
stical integration procedure, or both; (d) the selection
procedure lasted for at least 2 hours.
As we wanted to update the prior meta-analysis of
the selection validity of assessment centres, a second
criterion specified that only studies published or com-
pleted from 1985 onwards were considered for inclu-
sion in the meta-analytic database.
Third, we used Borman’s (1991) definition of super-
visory job performance ratings, which he defined as ‘an
estimate of individuals’ performance made by a super-
visor’ (p. 280). This estimate could be either an overall
or a multi-dimensional performance evaluation. Hence,
ratings of potential, objective measures of performance,
performance tests, and ratings made by peers were
Fourth, studies had to provide sufficient information
to be coded. As most of the studies did not report all
the necessary information, the first author attempted
to contact the authors of these studies.
On the basis of these four inclusion criteria 26
studies with 27 non-overlapping validity coefficients
were included in the meta-analysis. The total Nwas
5850 (as compared with N¼4180 of Gaugler et al.,
1987). Of these studies, 23 had been published, one was
presented at an international conference, and two were
unpublished. The earliest study included was published
in 1985, whereas the most recent study was conducted
in 2005.
The coding of the 27 validity coefficients which
constituted the final meta-analytic dataset was con-
ducted separately by the first and second authors. On
the basis of a sample of studies coded by the authors, a
reliability check revealed that when both authors
entered a coding, their coding agreed in 85% of cases.
The full coding scheme is available from the first author.
At the end of this procedure, the separately coded
datasets were compared and any disagreements were
resolved between the two authors.
2.3. Evaluation of publication bias
The data were subjected to two additional examina-
tions. First, to explore the possibility that assessment
centre validities were somehow related to extraneous
factors which could not be regarded as potential
moderators, corrected validities were correlated with
the study completion/publication date and with the
time interval between the collection of OARs and
supervisory performance ratings. The values of these
two correlations were r¼.05 and .10, respectively.
Hence, there was no consistent relationship between
either of these two factors and assessment centre
Second, a funnel plot was performed to check the
distribution of corrected validities (see Egger, Smith,
Schneider, & Minder, 1997). The idea behind this
procedure is to plot corrected validities against sample
size, so as to examine whether sampled validities
appear to be free of publication bias. As the degree of
sampling error depends on sample size it is to be
expected that the spread of validities would become
progressively smaller as sample sizes increased, thereby
406 Eran Hermelin, Filip Lievens and Ivan T. Robertson
International Journal of Selection and Assessment
Volume 15 Number 4 December 2007
&2007 The Authors
Journal compilation &2007 Blackwell Publishing Ltd
creating a scatter plot resembling a symmetrical in-
verted funnel. Should the sampling be biased, the funnel
plot would be asymmetrical (Egger et al., 1997). For the
studies included in this meta-analysis, the scatter plot of
validities resembled an inverted funnel with validity
coefficients based on small samples showing consider-
able variation, whereas those based on larger samples
tended to converge on the mean meta-analytic validity
coefficient. There was however a tendency for validities
not to be evenly distributed around the mean, with six
coefficients located under the meta-analytic mean, one
coefficient corresponding to the meta-analytic mean,
and 20 coefficients located above the meta-analytic
mean. There was a tendency for studies with larger
sample sizes to be more evenly distributed around the
meta-analytic mean than studies based on smaller
sample sizes. The study contributing the largest sample
size to the meta-analytic dataset (28% of total cases)
was positioned in the middle of the distribution of
validities and so did not skew the outcome of the meta-
2.4. Meta-analytic procedure
The dataset was analysed by following the Hunter and
Schmidt (1990) procedures for individually correcting
correlations for experimental artifacts. In instances in
which these were deemed not to be sufficiently detailed
for the purposes of the current study, advice was
solicited directly from the book’s second author
(F. Schmidt, personal communication, 2002). We were
able to obtain range restriction data for 20 out of the
27 validity coefficients included in the dataset. These 20
coefficients were hence individually coded for range
restriction. In the absence of specific information about
the range restriction ratios of the remaining seven
coefficients, they were assigned the mean of the range
restriction ratios coded for the 20 coefficients indivi-
dually coded for range restriction (see the Appendix A).
As reliabilities for supervisory performance ratings
were typically not mentioned in the studies included in
our meta-analytic dataset, we decided to use the best
available reliability estimates for supervisory perfor-
mance ratings. In fact, two large scale meta-analyses
found .52 to be the average criterion reliability estimate
for supervisory performance ratings (Salgado et al.
2003; Viswesvaran, Ones, & Schmidt, 1996). Hence,
we decided to use the value of .52 as the criterion
reliability estimate for all 27 validity coefficients.
Although there now exist procedures to correct for
indirect range restriction (Hunter, Schmidt, & Le, 2006;
Schmidt, Oh, & Le, 2006), we were not able to perform
this correction as indirect range restriction data were
not available in the primary studies. We were therefore
unable to go beyond the standard practice of correcting
the magnitude of observed validities for the presence of
direct range restriction and criterion unreliability.
3. Results
3.1. Assessment centre validity
As shown in Table 1, the mean observed rbased on a
total sample size of 5850 was .17. Correcting this
coefficient for direct range restriction in the predictor
variable increased its value to .20. When the coefficient
was also corrected for criterion unreliability, the popu-
lation estimate for the correlation between OARs and
supervisory performance ratings, increased to .28 [95%
confidence interval (CI) ¼.24 r.32]. Details of
the distribution of artifacts used to individually correct
observed validity coefficients, are provided in Table 1,
which shows that 84% of variance in validity coefficients
may be explicable in terms of sampling error. Conse-
quently, once the variance theoretically contributed by
sampling error was removed, little unexplained variance
remained. Thus, the detection of potential moderator
variables was unlikely. Nevertheless, we tested for
various moderators (e.g., number of dimensions as-
sessed, number of different selection methods, type of
integration procedure used). As could be expected,
none of these moderators was significant.
4. Discussion
The typically used meta-analytic estimate of the validity
of assessment centre OARs (Gaugler et al., 1987) is
based on studies conducted prior to 1985, some of
which are now over 50 years old. However, in the last
20 years, many new assessment centre validation
studies have been conducted. Although Arthur et al.
(2003) recently provided an updated estimate of the
validity of assessment centre dimensions, it is also
important to provide an updated estimate of assess-
ment centre OAR validity, as the OAR is almost always
used when assessment centres are used for selection
purposes (as opposed to developmental purposes).
Therefore, this study provides a meta-analytic update
to the old value obtained by Gaugler et al. (1987). The
current investigation is also based on a larger sample
size (N¼5850) than the sample size of 4180 used in the
Gaugler et al. (1987) meta-analysis.
The mean population estimate of the correlation
between assessment centre OARs and supervisory
performance ratings in the current study was r¼.28
(95% CI ¼.24 r.32). Our estimate is thus signifi-
cantly lower than the value of r¼.36 (95%
CI ¼.30 r.42) reported by Gaugler et al. (1987),
which lies outside the 95% CI fitted around our
The Validity of Assessment Centres 407
&2007 The Authors
Journal compilation &2007 Blackwell Publishing Ltd
International Journal of Selection and Assessment
Volume 15 Number 4 December 2007
estimated population value. A possible explanation for
this finding is that the participants of modern assess-
ment centres are subject to more pre-selection (given
that they are so costly) than was customary in earlier
assessment centres. This would result in more indirect
range restriction in the modern assessment centres
and consequently, in lower observed and corrected
Unfortunately, we could not correct our data for
indirect range restriction because the required indirect
range restriction data were simply not reported in the
primary studies. Nevertheless, in ancillary analyses we
found some ‘indirect’ evidence of the impact of indirect
range restriction on assessment centre data. Specifi-
cally, six studies within the meta-analytic dataset re-
ported validities for cognitive ability tests that were
used in the same selection stage within/alongside the
assessment centre. The mean observed validity of these
cognitive ability tests with respect to the criterion of
job performance ratings was .10 (N¼1757). Thus, the
validity of cognitive ability tests used within or alongside
an assessment centre seemed to be much lower than the
observed meta-analytic validities for cognitive ability
tests as stand alone predictors (.24 and .22) reported by
Hunter (1983) and Schmitt, Gooding, Noe, and Kirsch
(1984) for US data. It is also much lower than the
observed meta-analytic validity for cognitive ability tests
as stand alone predictors (.29) on the basis of recent
European data (Salgado et al., 2003). Although this
comparison should be made with caution, it seems to
indicate that the depressed validity of cognitive ability
tests used within/alongside assessment centres might
also result from considerable indirect range restriction
on the predictor variable – most likely due to pre-
selection on cognitive factors.
On a broader level, these results show that the
selection stage should always be taken into account
when reporting the validity of predictors (Hermelin &
Robertson, 2001, see also Roth, Bobko, Switzer, &
Dean, 2001). Hence, we urge future assessment centre
researchers to routinely report (1) the selection stage
within which assessment centres are used, (2) the pre-
selection ratio of assessment centre participants, and
(3) the correlation between the predictor composite
used in preliminary selection stages and the OAR used
in later stages. Only when this information becomes
available, will it be possible to examine more fully the
indirect range restriction issue in assessment centres
and to perform the corrections for indirect range
restriction [according to the procedures detailed in
Hunter et al. (2006) and Schmidt et al. (2006)].
In regard to the potential presence of moderator
variables, the current investigation suggests that very
little variance remains unaccounted for once sampling
error has been removed. This result contradicts the
notion that assessment centres should show consider-
able variation in validities given the wide variations in
their design and implementation. We believe that this
result is more likely to be a result of little chance
variation in the validity coefficients included in the
dataset, rather than due to a genuine absence of
moderator effects.
The following directions deserve attention in future
research on the predictive validity of assessment centres.
First, the criterion measures for validating assessment
centres should be broadened. Over the last decade, one
of the major developments in criterion theory is the
distinction between task performance and citizenship
performance (Borman & Motowidlo, 1993). To our
knowledge, no studies have linked assessment centre
Table 1. Summary of meta-analysis results
Mean validity estimates for meta-analytic dataset
(r) 95% CI
5850 27 .17 .20 .28 .24 r.32
Distribution of meta-analytic artifacts used to correct observed correlations
ryy bs2
.94 .016 .94 .015 .52 0 .72 0 .64
Variance estimates for meta-analytic dataset
rbc s2
erbc s2
rExplained variance SD
.0123 .0104 .019 84% .04
Note: N, meta-analytic sample size; K, number of validity coefficients contributing to meta-analytic sample; r, mean sample weighted observed r;r
mean sample weighted observed rcorrected for range restriction; r
(r), mean sample weighted observed rcorrected for range restriction and
criterion unreliability; u, mean range restriction on OARs; s2
u, variance in range restriction on OARs; c, mean range restriction correction factor;
c, variance in range restriction correction factor; r
, mean criterion reliability estimate; s2
ryy, variance in criterion reliability estimate; b, mean
criterion unreliability correction factor; s2
b, variance in criterion unreliability correction factor;
A, mean artifact attenuation factor (bc); s2
variance in validities corrected for range restriction and criterion unreliability; s2
erbc, weighted mean sampling error variance estimated for
validities corrected for range restriction and criterion unreliability; s2
r, residual variance in validities corrected for range restriction and criterion
unreliability; Explained variance, percentage of variance explained by sampling error; SD
, standard deviation of corrected correlations.
408 Eran Hermelin, Filip Lievens and Ivan T. Robertson
International Journal of Selection and Assessment
Volume 15 Number 4 December 2007
&2007 The Authors
Journal compilation &2007 Blackwell Publishing Ltd
ratings to citizenship behaviours. This is surprising
because one of the key advantages of assessment centres
is that they are able to measure interpersonally oriented
dimensions. Second, it is of great importance that future
studies examine the incremental validity of assessment
centres over and above so-called low-fidelity simulations
such as situational judgment tests (McDaniel, Morgeson,
Finnegan, Campion, & Braverman, 2001). These tests
have gained in popularity because of their ease of
administration in large groups and low costs. In
addition, they seem to capture interpersonal aspects of
the criterion space and have shown good predictive
We would like to thank Frederik Anseel for his insight-
ful comments on a previous version of this manuscript.
Arthur, W., Jr, Day, E.A., McNelly, T.L. and Edens, P.S. (2003) A
Meta-Analysis of the Criterion-Related Validity of Assess-
ment Center Dimensions. Personnel Psychology,56, 125–154.
Borman, W.C. (1991) Job Behavior, Performance, and Effec-
tiveness. In: Dunnette, M.D. and Hough, L.M. (eds), Hand-
book of Industrial and Organizational Psychology. Palo Alto,
CA: Consulting Psychologists Press, pp. 269–313.
Borman, W.C. and Motowidlo, S.J. (1993) Expanding the
Criterion Domain to Include Elements of Contextual
Performance. In: Schmitt, N. and Borman, W.C. and and
Associates (eds), Personnel Selection in Organizations. San
Francisco: Jossey-Bass, pp. 71–98.
Egger, M., Smith, G.D., Schneider, M. and Minder, C. (1997)
Bias in Meta-Analysis Detected by a Simple Graphical Test.
British Medical Journal,315, 629–634.
Gaugler, B.B., Rosenthal, D.B., Thornton, G.C. and Bentson,
C. (1987) Meta-Analysis of Assessment Center Validity.
Journal of Applied Psychology,72, 493–511.
Hermelin, E. and Robertson, I.T. (2001) A Critique and
Standardization of Meta-Analytic Validity Coefficients in
Personnel Selection. Journal of Occupational and Organiza-
tional Psychology,74, 253–277.
Hunter, J.E. (1983). Test Validation for 12,000 Jobs: An application
of job classification and validity generalization analysis to the
general aptitude test battery. USES Test Research Report No.
45, Division of Counseling and Test Development Employ-
ment and Training Administration, US Department of
Labor, Washington, DC.
Hunter, J.E. and Schmidt, F.L. (1990) Methods of Meta-Analysis:
Correcting error and bias in research findings. Beverly Hills,
CA: Sage.
Hunter, J.E., Schmidt, F.L. and Le, H. (2006) Implications of
Direct and Indirect Range Restriction for Meta-Analysis
Methods and Findings. Journal of Applied Psychology,91,
McDaniel, M.A., Morgeson, F.P., Finnegan, E.B., Campion, M.A.
and Braverman, E.P. (2001) Use of Situational Judgment
Tests to Predict Job Performance: A clarification of the
literature. Journal of Applied Psychology,86, 730–740.
Roth, P.L., Bobko, P., Switzer, F.S. and Dean, M.A. (2001) Prior
Selection Causes Biased Estimates of Standardized Ethnic
Group Differences: Simulation and analysis. Personnel Psy-
chology,54, 591–617.
Salgado, J.F., Anderson, N., Moscoso, S., Bertua, C., De Fruyt,
F. and Rolland, J.P. (2003) A Meta-Analytic Study of General
Mental Ability Validity for Different Occupations in the
European Community. Journal of Applied Psychology,88,
Schmidt, F.L., Oh, I. and Le, H. (2006) Increasing the Accuracy
of Corrections for Range Restriction: Implications for
selection procedure validities and other research results.
Personnel Psychology,59, 281–305.
Schmitt, N., Gooding, R.Z., Noe, R.A. and Kirsch, M. (1984)
Meta-Analyses of Validity Studies Published Between 1964
and 1982 and the Investigation of Study Characteristics.
Personnel Psychology,37, 407–422.
Thornton, G.C., III. (1992) Assessment Centers and Human
Resource Management. Reading, MA: Addison-Wesley.
Viswesvaran, C., Ones, D.S. and Schmidt, F.L. (1996) Com-
parative Analysis of the Reliability of Job Performance
Ratings. Journal of Applied Psychology,81, 557–574.
References to articles included in meta-
analytic dataset
Anderson, L.R. and Thaker, J. (1985) Self-Monitoring and Sex
as Related to Assessment Center Ratings and Job Perfor-
mance. Basic and Applied Social Psychology,6, 345–361.
Arthur, W. and Tubre, T. (2001). The Assessment Center
Construct-Related Validity Paradox: An investigation of self-
monitoring as a misspecified construct. Unpublished Manu-
Binning, J.F., Adorno, A.J. and LeBreton, J.M. (1999). Intraorga-
nizational Criterion-Based Moderators of Assessment Center
Validity. Paper Presented at the Annual Conference of the
Society for Industrial and Organizational Psychology,
Atlanta, GA, April.
Bobrow, W. and Leonards, J.S. (1997) Development and
Validation of an Assessment Center During Organizational
Change. Journal of Social Behavior and Personality,12, 217–
Burroughs, W.A. and White, L.L. (1996) Predicting Sales
Performance. Journal of Business and Psychology,11, 73–84.
Chan, D. (1996) Criterion and Construct Validation of an
Assessment Centre. Journal of Occupational and Organiza-
tional Psychology,69, 167–181.
Dayan, K., Kasten, R. and Fox, S. (2002) Entry-Level Police
Candidate Assessment Center: An efficient tool or a
hammer to kill a fly? Personnel Psychology,55, 827–849.
Dobson, P. and Williams, A. (1989) The Validation of the
Selection of Male British Army Officers. Journal of Occupa-
tional Psychology,62, 313–325.
Feltham, R. (1988) Assessment Centre Decision Making:
Judgmental vs. mechanical. Journal of Occupational Psychology,
The Validity of Assessment Centres 409
&2007 The Authors
Journal compilation &2007 Blackwell Publishing Ltd
International Journal of Selection and Assessment
Volume 15 Number 4 December 2007
Fleenor, J.W. (1996) Constructs and Developmental Assess-
ment Centers: Further troubling empirical findings. Journal
of Business and Psychology,10, 319–333.
Fox, S., Levonai-Hazak, M. and Hoffman, M. (1995) The Role
of Biodata and Intelligence in the Predictive Validity of
Assessment Centres. International Journal of Selection and
Assessment,3, 20–28.
Goffin, R.D., Rothstein, M.G. and Johnston, N.G. (1996)
Personality Testing and the Assessment Center: Incremen-
tal validity for managerial selection. Journal of Applied
Psychology,81, 746–756.
Goldstein, H.W., Yusko, K.P., Braverman, E.P., Smith, D.B. and
Chung, B. (1998) The Role of Cognitive Ability in the
Subgroup Differences and Incremental Validity of
Assessment Center Exercises. Personnel Psychology,51,
Gomez, J.J. and Stephenson, R.S. (1987) Validity of an Assess-
ment Center for the Selection of School-Level Adminis-
trators. Educational Evaluation and Policy Analysis,9, 1–7.
Higgs, M. (1996) The Value of Assessment Centres. Selection
and Development Review,12, 2–6.
Hoffman, C.C. and Thornton, G.C., III. (1997) Examining
Selection Utility Where Competing Predictors Differ in
Adverse Impact. Personnel Psychology,50, 455–470.
Jones, R.G. and Whitmore, M.D. (1995) Evaluating Develop-
mental Assessment Centers as Interventions. Personnel
Psychology,48, 377–388.
McEvoy, G.M. and Beatty, R.W. (1989) Assessment Centres
and Subordinate Appraisals of Managers: A seven year
examination of predictive validity. Personnel Psychology,42,
Moser, K., Schuler, H. and Funke, U. (1999) The Moderating
Effect of Raters’ Opportunities to Observe Ratees’ Job
Performance on the Validity of an Assessment Centre.
International Journal of Selection and Assessment,7, 133–141.
Nowack, K.M. (1997) Congruence Between Self–Other Rat-
ings and Assessment Center Performance. Journal of Social
Behavior and Personality,12, 145–166.
Pynes, J. and Bernardin, H.J. (1992) Entry-Level Police Selec-
tion: The assessment center is an alternative. Journal of
Criminal Justice,20, 41–55.
Robertson, I. (1999) Predictive Validity of the General Fast Stream
Selection Process. Unpublished Validity Report, School of
Management, UMIST.
Russell, C.J. and Domm, D.R. (1995) Two Field Tests of an
Explanation of Assessment Centre Validity. Journal of Occu-
pational and Organizational Psychology,68, 25–47.
Schmitt, N., Schneider, J.R. and Cohen, S.A. (1990) Factors
Affecting Validity of a Regionally Administered Assessment
Center. Personnel Psychology,43, 1–12.
Thomas, T., Sowinski, D., Laganke, J. and Goudy, K. (2005) Is the
Assessment Center Validity Paradox Illusory? Paper Presented at
the 20th Annual Conference of the Society for Industrial and
Organizational Psychology, Los Angeles, CA, April.
Tziner, A., Meir, E.I., Dahan, M. and Birati, A. (1994) An
Investigation of the Predictive Validity and Economic Utility
of the Assessment Center for the High-Management Level.
Canadian Journal of Behavioral Science,26, 228–245.
410 Eran Hermelin, Filip Lievens and Ivan T. Robertson
International Journal of Selection and Assessment
Volume 15 Number 4 December 2007
&2007 The Authors
Journal compilation &2007 Blackwell Publishing Ltd
Appendix A
Table A1. Summary of validity results from studies included in meta-analysis
Author(s) Observed overall
validity coefficient
Sample size
Range restriction
in predictor (U)
Validity coefficient
corrected for direct
range restriction
Validity coefficient
corrected for direct range
restriction and criterion
Anderson and Thaker (1985) .43 25 1 .43 .60
Arthur and Tubre (2001) .23 70 .94 .24 .34
Binning, Adorno, and LeBreton (1999) .15 1637 .8 .19 .26
Bobrow and Leonards (1997) .23 71 1 .23 .32
Burroughs and White (1996) .49 29 1 .49 .68
Chan (1996) .06 46 1 .06 .08
Dayan, Kasten, and Fox (2002) .17 420 .94 .18 .25
Dobson and Williams (1989) .14 450 .55 .25 .35
Feltham (1988) .16 128 .52 .30 .41
Fleenor (1996) .24 85 1 .24 .33
Fox, Levonai-Hazak, and Hoffman (1995) .33 91 1 .33 .46
Goffin, Rothstein, and Johnston (1996) .30 68 .88 .34 .47
Goldstein, Yusko, Braverman, Smith, and Chung (1998) .18 633 1 .18 .25
Gomez and Stephenson (1987) .19 121 .94 .20 .28
Higgs (1996) .32 123 .94 .34 .47
Hoffman and Thornton (1997) .26 118 1 .26 .36
Jones and Whitmore (1995) .03 149 1 .03 .04
McEvoy and Beatty (1989) .28 48 1 .28 .39
Moser, Schuler, and Funke (1999) .37 144 1 .37 .51
Nowack (1997) .25 144 1 .25 .35
Pynes and Bernardin (1992) .23 68 1 .23 .32
Robertson (1999) .23 105 .94 .24 .34
Russell and Domm (1995) .22 140 1 .22 .31
Russell and Domm (1995) .23 172 1 .23 .32
Schmitt, Schneider and Cohen (1990) .08 402 .95 .08 .12
Thomas, Sowinski, Laganke, and Goudy (2005, April) .30 56 .94 .31 .43
Tziner, Meir, Dahan, and Birati (1994) .21 307 .94 .22 .31
The Validity of Assessment Centres 411
&2007 The Authors
Journal compilation &2007 Blackwell Publishing Ltd
International Journal of Selection and Assessment
Volume 15 Number 4 December 2007
... We identified eight meta-analyses that had investigated the criterion-related validity of AC ratings. Some of these meta-analyses (i.e., Arthur et al., 2003;Meriac et al., 2008) corrected for predictor unreliability, but most did not (i.e., Aamodt, 2004;Gaugler et al., 1987;Hardison & Sackett, 2007;Hermelin, Lievens, & Robertson, 2007;Hunter & Hunter, 1984;Schmitt, Gooding, Noe, & Kirsch, 1984). Consistent with the previous point about the lack of data, the common reason for not reporting this information is best summarized by Gaugler et al. (1987), who stated that "the validities were not corrected for predictor unreliability because we were unable to obtain a reasonable estimate of the distribution of reliabilities for the overall assessment center ratings" (p. ...
... In contrast to the paucity of published papers investigating the content-related validity of AC ratings, as previously noted, we identified eight meta-analyses investigating the criterion-related validity of ACs. However, of these, six (i.e., Aamodt, 2004;Gaugler et al., 1987;Hardison & Sackett, 2007;Hermelin et al., 2007;Hunter & Hunter, 1984;Schmitt et al., 1984) were based on the OAR, and two (i.e., Arthur et al., 2003;Meriac et al., 2008) focused on dimensionlevel ratings. Although using the OAR as the level of analysis is a fairly common practice, Arthur et al. (2003) and Arthur and Villado (2008) have noted problems with it. ...
... In other words, no two OARs are necessarily going to be the same without holding the dimensions (that comprise the OARs) constant. So from this perspective, it is not surprising that meta-analytic estimates of the criterion-related validity of ACs, operationalized as OARs, have obtained a wide range of estimates, ranging from .43 (Hunter & Hunter, 1984) to .41 (Schmitt et al., 1984), to .37 (Gaugler et al., 1987), to .28 (Hardison & Sackett, 2007;Hermelin et al., 2007), to .22 (Aamodt, 2004). Although these differences could be due to differences in meta-analytic methodology, inclusion criteria, and historical changes in the quality of ACs, we submit that a major potential reason for the range and divergence in AC findings may be the pervasive focus on OARs. ...
... Although AC ratings are often found not to reflect intended dimension structures, they nevertheless have reasonable levels of criterion-related validity (e.g., Hermelin et al., 2007). This raises a well-known conundrum: why do ACs predict outcomes if their internal structure is not as intended or expected (Lance, 2008)? ...
... reliable. An understanding of the reliability and the associated measurement structure of ACs will help to develop an understanding of how and why ACs predict work outcomes (e.g., Hermelin et al., 2007). ACs are usually developed with the intent of generating dimension scores: a practice that dates back to their origins (Handyside & Duncan, 1954) and that persists in contemporary procedures (e.g., Meriac et al., 2008). ...
Full-text available
In two recent studies, the median reliability of assessment centers (ACs) was estimated at. 90 (range=. 23, Jackson et al., 2016; Putka & Hoffman, 2013). However, these studies, among many others (eg, Lance et al., 2004; Sackett & Dreher, 1982) indicate that the dimensions which ACs are designed to measure contribute very little to their reliability. This raises a fundamental question: how can ACs be reliable when the dimensions they are designed to assess do not contribute to reliable measurement in ACs? Using evidence from 10 samples, we resolve this issue by showing that the reliability of ACs greatly depends on the intentions of the researcher or practitioner. When the intent was to measure dimensions, we found evidence that AC reliability was unacceptably low (mean reliability=. 38, SD=. 15). However, when the intent was to include the measurement of exercise scores, we found evidence that AC reliability exceeded acceptable criteria (mean reliability=. 91, SD=. 09). We additionally found evidence that, at least in ACs that follow professional design guidelines, dimension effects and assessor effects do not generally make an appreciable difference to AC reliability.
... Thus, we use the reliability-corrected value of .32 as our mean operational validity estimate. Hermelin et al. (2007) reported a meta-analysis of 27 studies conducted since Gaugler et al. (1987). They found a mean observed validity of .17 ...
... They located 49 studies with performance ratings criteria. About one-third of studies (16) were also included in Hermelin et al. (2007). We include validity estimates from both meta-analyses given the considerably different, though not completely unique, set of studies. ...
This paper systematically revisits prior meta-analytic conclusions about the criterion-related validity of personnel selection procedures, and particularly the effect of range restriction corrections on those validity estimates. Corrections for range restriction in meta-analyses of predictor-criterion relationships in personnel selection contexts typically involve the use of an artifact distribution. After outlining and critiquing five approaches that have commonly been used to create and apply range restriction artifact distributions, we conclude that each has significant issues that often result in substantial overcorrection and that therefore the validity of many selection procedures for predicting job performance has been substantially overestimated. Revisiting prior meta-analytic conclusions produces revised validity estimates. Key findings are that most of the same selection procedures that ranked high in prior summaries remain high in rank, but with mean validity estimates reduced by .10-.20 points. Structured interviews emerged as the top-ranked selection procedure. We also pair validity estimates with information about mean Black-White subgroup differences per selection procedure, providing information about validity-diversity tradeoffs. We conclude that our selection procedures remain useful, but selection predictor-criterion relationships are considerably lower than previously thought. (PsycInfo Database Record (c) 2021 APA, all rights reserved).
... Diese indirekte Varianzeinschränkung kann jedoch nur mit einer Reihe von zusätzlich nötigenAnnahmen korrigiert werden, wogegen sich aber sowohlGaugler et al. (1987) als auch die Autorinnen und Autoren späterer AC-Metaanalysen entschieden haben (z. B.Becker et al., 2011;Hermelin et al., 2007).3 ...
Full-text available
Der Artikel von Schmidt und Hunter (1998) bietet einen Überblick und Vergleich der Kriteriumsvalidität verschiedenster Personalauswahlverfahren. Für ihren Artikel haben Schmidt und Hunter jeweils metaanalytisch ermittelte Validitäten ausgewählt, bei denen Korrekturen für Varianzeinschränkung im Auswahlverfahren und Unreliabilität des Kriteriums (d. h. der Messung der Arbeitsleistung) vorgenommen wurden. Hiermit wollten Schmidt und Hunter einen umfassenden Vergleich der wahren Validität der verschiedenen Auswahlverfahren ermöglichen. Trotzdem gibt es eine Reihe von Aspekten, die die Interpretation dieses Vergleichs erschweren bzw. die zu falschen Schlussfolgerungen führen: Erstens die Qualität und Datengrundlage der Metaanalysen, die für diesen Vergleich herangezogen wurden, zweitens die fehlenden Informationen zu unkorrigierten metaanalytischen Validitäten der verschiedenen Auswahlverfahren, drittens unterschiedliche Vorgehensweisen und Annahmen im Rahmen der durchgeführten Korrekturen in den verwendeten Metaanalysen, viertens das gewählte Kriteriums, das durch die verschiedenen Auswahlverfahren vorhergesagt werden soll und fünftens das Problem von nicht berücksichtigte Auswahlverfahren in dieser Übersicht. Im Rahmen dieses Beitrags werden die verschiedenen Aspekte und ihre Bedeutung jeweils genauer erläutert, und es wird dargestellt, inwiefern sich Schlussfolgerungen bzgl. der Validität von Auswahlverfahren ändern können, wenn andere Annahmen und Vorgehensweisen als bei Schmidt und Hunter (1998) gewählt werden.
Im Zusammenhang mit zu personaldiagnostischen Zwecken durchgeführten Interviews im internationalen Kontext drängen sich zunächst einige augenfällige und damit zu berücksichtigende Besonderheiten auf: Kulturspezifische Unterschiede, Limitationen aufgrund unterschiedlicher Sprachen und Sprachkompetenzen. Bei genauerer Betrachtung finden sich dann jedoch die auch an anderer Stelle relevanten, Qualität und Validität bedingenden Faktoren: Das, was „gute“ Interviews ausmacht – u. a. die Orientierung an vorgelagerten Anforderungsanalysen, ein methodisch an definierten und sinnvollen Standards ausgerichteter Prozess, insbesondere die Kompetenz der Interviewenden – ist auch im internationalen Kontext relevant. Daher treffen die in diesem Beitrag herausgestellten Aspekte i. d. R. auch für Interview-Verfahren im nicht internationalen Kontext zu und erlauben eine ggf. zusätzliche Perspektive „von außen“.
Full-text available
This meta‐analysis tested a series of moderators of sex‐ and race‐based subgroup differences using assessment center (AC) field data. We found that sex‐based subgroup differences favoring female assessees were smaller among studies that reported: combining AC scores with other tests to compute overall assessment ratings, lower mean correlations between rating dimensions, using more than one assessor to rate assessees in exercises, and providing assessor training. In contrast, we found larger sex‐based subgroup differences favoring female assessees among studies that reported: lower proportions of females in assessee pools, conducting a job analysis to design the AC, and using multiple observations of AC dimensions across exercises. We also observed a polynomial effect showing that subgroup differences most strongly favored female assessees in jobs with the highest and lowest rates of female incumbents. We found race‐based subgroup differences favoring White assessees were smaller on less cognitively loaded rating dimensions and for jobs with lower rates of Black incumbents. Studies reporting greater overall methodological rigor also showed smaller subgroup differences favoring White assessees. Regarding specific rigor features, studies reporting use of highly qualified assessors and integrating dimension ratings from separate exercises into overall dimension scores showed significantly lower differences favoring White assessees.
Full-text available
We modelled the effects commonly described as defining the measurement structure of supervisor performance ratings. In doing so, we contribute to different theoretical perspectives, including components of the multifactor and mediated models of performance ratings. Across 2 samples from the Jackson et al. (2020) data set (Sample 1, Nratees = 392, Nraters = 244; Sample 2, Nratees = 342, Nraters = 397), we found a structure primarily reflective of general ( > 27% of variance explained) and rater-related (> 49%) effects, with relatively small performance dimension effects (between 1% and 11%). We drew on findings from the assessment center literature to approximate the proportion of rater variance that might theoretically contribute to reliability in performance ratings. We found that even moderate contributions of rater-related variance to reliability resulted in a sizable impact on reliability estimates, drawing them closer to accepted criteria.
Full-text available
O razvoju posameznika v delovnem okolju gotovo veliko ve organizacijska psihologija kot znanstvena in aplikativna veda, ki uspešno povezuje znanstvene ugotovitve s prakso. V znanstveni monografiji Kako (še) spodbujati zaposlene: nov izbor psiholoških pristopov od A do Ž so avtorji zbrali različne metode in pristope, ki jih lahko uporabite v prav vsaki delovni organizaciji. Knjiga je nadaljevanje monografije Kako spodbujati zaposlene: psihološki pristopi od A do Ž, ki je bila izdana leta 2020. V tokratni knjigi predstavljajo popolnoma nove, samostojne ali dopolnjujoče se metode, ki jih lahko uporabite pri zaposlovanju, motiviranju in vodenju ter pri oblikovanju organizacijske klime in kulture.
In diesem Kapitel wird die Relevanz Psychologischer Diagnostik im Zuge der Feststellung der Passung zwischen Individuen und Arbeitskontexten betont. Die Rolle der Psychologischen Diagnostik im Allgemeinen sowie von spezifischen diagnostischen Zugängen im Speziellen wird mit Bezug zu den Themenbereichen Organisationsdiagnostik, Personalauswahl und -entwicklung sowie Berufs- und Ausbildungswahl erläutert. Zudem wird der derzeitige Kenntnisstand zur Evaluation des diagnostischen Vorgehens in diesen Anwendungsbereichen präsentiert. Schließlich erfolgt eine Einführung in die DIN 33430 für berufsbezogene Eignungsbeurteilungen.
Full-text available
The current study examined the ability of a developmental assessment center to support and predict professional competency development in a vocational education context. A longitudinal study was conducted where graduate organizational psychology students (N = 157 students and 501 placements) completed a developmental assessment center at the beginning of their degree, along with measures of Big Five personality and self-efficacy. Their performance was then assessed throughout the degree in three or four separate work placements using student and placement supervisor ratings. Both assessment center and placement ratings assessed students on seven work-relevant competencies. Competence developed linearly over placements with student-rated competency lower than supervisor-rated competency at the first placement but with these differences disappearing by the final placement. Consistent with the students undergoing a period of rapid professional development and principles of dynamic validity, the predictive validity of assessment center performance declined over time. The research also presents a rich picture of how competency ratings converge across raters and develop at different rates. The research provides novel longitudinal evidence regarding how objective competence and self-confidence are developed in a professional educational setting. It also shows how developmental assessment centers can be implemented within professional educational training to support career development.
Full-text available
A comprehensive meta-analysis of the validity of general mental ability (GMA) measures across 12 occupational categories in the European Community (EC) is presented. GMA measures showed that there is validity generalization and large operational validities for job performance and training success in 11 occupational groups. Results also showed that job complexity moderated the magnitude of the operational validity of GMA tests across three levels of job complexity: low, medium, and high. In general, results were similar to those found in the United States, although the European findings showed a slightly larger magnitude of operational validity in some cases. Theoretical and practical implications of these findings for personnel selection are discussed.
Full-text available
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold with the understanding that the publisher is not engaged in rendering professional services. If legal, accounting, medical, psychological or any other expert assistance is required, the services of a competent professional person should be sought.
Full-text available
Snyder's Self-Monitoring Test was included in a 1-day assessment center for computer salespersons. It was predicted that self-monitoring would be related to success at the assessment center (where impression management is crucial for employment) and to success in the boundary-spanning role of computer sales. The relationships with self-monitoring were predicted to be stronger for women than for men because these computer sales jobs have been less traditional for women than for men. Results indicate that there were significant correlations between self-monitoring scores and the overall assessment rating only for women. Furthermore, self-monitoring was significantly related to job retention after 1 year only for women, but self-monitoring scores predicted job retention as well as did the assessment center ratings for both men and women. It was suggested that impression management ability (self-monitoring) is more strongly related to job auditions and job retention when the role is nontraditional to gender.
Full-text available
Multi-rater feedback is based upon the tenet that congruence between self and others is associated with managerial success and effectiveness (Tornow, 1993; Yammarino & Atwater, 1993). This study investigated the relationship between self and managerial rating congruence on two measures of assessment center performance (overall assessor ratings and in-basket scores) with 144 production supervisors. Results of hierarchical regression analyses indicated that manager ratings of supervisory effectiveness (ΔR² = .26, p < .01) and self-ratings on 14 assessment center dimensions (ΔR²= .06, p < .01) significantly contributed towards predictions of overall assessor ratings of performance (OAR). In-agreement/good raters and overestimators were rated significantly higher in assessment center exercises by assessors compared to underestimators (Tukey's HSD test, p < .01). Results from additional hierarchical regression analyses indicated that self-ratings incrementally contributed to predictions of in-basket performance (ΔR² = .04, p < .05) above that of manager ratings on task managemenl skills (ΔR² = .33, p < .01). In-agreement/poor raters and underestimators had lower overall in-basket scores than in-agreement/good raters or overestimators. Implications of these findings in terms of previous and future research are discussed.
Conference Paper
The authors investigated the relative validity of personality testing and the assessment center (AC) in order to evaluate the incremental validity of both assessment methods in the prediction of managerial performance. Managers (N = 68) from a forestry products organization were evaluated with an AC and selected traits on the Personality Research Form (D. N. Jackson, 1984). Performance and promotability served as criteria. Overall, personality was found to be at least equivalent to the AC in predicting performance. Personality testing resulted in significant incremental validity over that of the AC in the prediction of performance, but the converse was also true. This suggested that personality and the AC assess different domains, with each uniquely and significantly predicting performance. Promotability was not significantly predicted. Practical implications and future directions were discussed.
A comprehensive meta-analysis of the validity of general mental ability (GMA) measures across 12 occupational categories in the European Community (EC) is presented. GMA measures showed that there is validity generalization and large operational validities for job performance and training success in 11 occupational groups. Results also showed that job complexity moderated the magnitude of the operational validity of GMA tests across three levels of job complexity: low, medium, and high. In general, results were similar to those found in the United States, although the European findings showed a slightly larger magnitude of operational validity in some cases. Theoretical and practical implications of these findings for personnel selection are discussed.
The purpose of this study was to ascertain the criterion-related validity of an assessment center for the selection of school-level administrators. The assessment center ratings of 121 subjects were correlated with their subsequent job performance ratings. The results indicated that the assessment center did to a significant degree predict the subject's job performance. It was thus concluded that an assessment center represents a promising approach in the selection of school-level administrators.