ArticlePDF Available

Overlooking overkill? Beyond the 1-to-5 rating scale

Authors:
  • Kaiser Leadership Solutions
Article

Overlooking overkill? Beyond the 1-to-5 rating scale

HUMAN RESOURCE PLANNING 28.3 7
[Editor’s Note: An earlier version of this
article was presented at the 19th annual
meeting of the Society for Industrial and
Organizational Psychology in Chicago,
Illinois, in April 2004. This article was
condensed from a fuller version that has
additional supporting statistical and anecdo-
tal analysis.]
Over two millennia ago, Aristotle (trans.
1982) wrote in his Ethics that what is good,
virtuous, and effective in thought and action
is difficult to achieve. He noted that ineffec-
tiveness is characterized either by
deficiency—too little of the prized behav-
ior—or by excess—too much of it. This old
and worthy idea, that deficiency and excess
constitute two fundamental classes of faulty
performance, strikes most people as common
sense. Nevertheless, the idea has somehow
been overlooked in the design of formal
systems and instruments commonly used to
assess the performance of managers.
The Problem
The method of choice for measuring
performance in organizations is the behav-
ioral rating scale (Murphy & Cleveland,
1995). First applied to the problem of psy-
chological measurement by Francis Galton
late in the 19th century(Aiken, 1996), rating
scales have evolved considerably over the last
hundred years. Their modern form can be
found in the now-ubiquitous 360osurvey.
These instruments typically employ a varia-
tion on Rensis Likert’s (1932) solution for
measuring attitudes, the Likert-type scale. In
applying Likert’s method to the measurement
of performance, the “agree-disagree”
response format has been modified to take
one of two general forms.
Most common is the frequency type of
response scale (Leslie & Fleenor,1998).
Rating formats of this “less-to-more” variety
requireraters to indicate how often the
manager exhibits a particular behavior or
how characteristic a particular statement is
of that manager. Response options are
ordered categories anchored by adverbs such
as “never, sometimes, usually, often, always”
to convey how often the manager engages in
the described behavior. Or, to indicate how
characteristic the descriptor is of the manag-
er, the anchors might be something like “not
at all, to a little extent, to some extent, to a
great extent, to a verygreat extent.” These
scales carry the appearance of objectivity in
that it is assumed that raters use them to
merely describe the frequency of behavior
(Nathan & Alexander, 1988).
The second kind of response scale is the
evaluation type, in which the rater is asked to
judge how effectively the manager performs
the behavior, role, or function described
by the survey item. Thereare two general
classes of this “how well” variety of rating
format: evaluation of performance in
absolute terms and evaluation of perfor-
mance in relative terms. Absolute evaluation
scales contain response categories with
adjective anchors such as “ineffective, ade-
quate, good, effective, and exceptional.”
Relative evaluation scales require the respon-
dent to compare the ratee’s performance to
some reference group—for example, with
instructions and anchors such as “relative
to other managers at Acme, this manager’s
performance is: among the worst, below
average, average, above average, among the
best.”
The key distinction between frequency
and evaluation response scales is that the
former asks raters to describe performance
whereas the latter requires raters to judge
the quality of performance (Stockford &
Bissell, 1949). There is another difference
between these two types of scales: Each has
aunique limitation when it comes to captur-
ing excesses.1
An Illustration
Consider Rick Strong, a fictitious senior
manager who resembles several executives
we’ve worked with over the years. A keen
analyzer of what works and what does not,
Rick is extremely results-oriented and consis-
tently achieves his objectives. Despite how
productive he and his unit are, his staff has
misgivings. In particular, they think Rick can
be critical, sometimes verging on abusive,
when they do not meet his lofty expectations.
Moreover, he is short on praise—you defi-
nitely hear about it when you arenot up to
snuff, but rarely do you get a “good going”
pat on the back. How would you rate
Rick on the items with the frequency and
Overlooking Overkill?
Beyond the 1-to-5 Rating
Scale
Robert B. Kaiser, Partner, and Robert E. Kaplan, Partner, Kaplan DeVries, Inc.
CURRENT PRACTICES
EXHIBIT 1
Rating Rick Strong with a Frequency and
Evaluation Scale
Frequency Scale Evaluation Scale
Never Rarely Some-
times Often Always Ineffective Adequate Effective Very
Effective
Outstand
-ing
Does whatever it
takes to get results. OOOOXOOXOO
Makes judgments—
zeroes in on what is
not working. OOOOXOXOOO
Shows apprecia-
tion—helps people
feel good about their
contribution.
OXOOOXOOOO
Overlooking Overkill?
Beyond the 1-to-5 Rating
Scale
Robert B. Kaiser, Partner, and Robert E. Kaplan, Partner, Kaplan DeVries, Inc.
8HUMAN RESOURCE PLANNING 28.3
evaluation response scales presented in
Exhibit 1?
The frequency scale fails to distinguish
between very much and too much. There
is no question that Rick “always” does
whatever it takes and makes judgments, so
he gets the highest rating on these items. And
because “high” scores are taken to be ideal,
there is an unstated assumption here that
more is better.” This is unfortunate because
itis widely understood that too much of a
good thing is not so good. That is how
strengths become weaknesses. But it is not
likely Rick will get the message in this case.
On the upside, the frequency scale does an
adequate job of capturing deficiencies: The
low rating on “shows appreciation” effec-
tively indicates something Rick needs to do
more often.
The evaluation scale introduces ambigui-
ty at the other end of the register. What does
Rick conclude from his merely “adequate”
score on “Zeroes in on what isn’t working”?
Is he not discriminating enough or is he
hypercritical? And a similar question can
arise about his score on “Shows apprecia-
tion.” Does the low scoreindicate he does
not give enough praise or that he doles it out
indiscriminately? Thus, although high scores
on evaluation rating scales may reveal clear
strengths, low scores are unclear. They
muddle the distinction between deficiency
and excess. Our point with this illustration is
that the rating scales commonly used in prac-
tice are not adequate for detecting excess—
when strengths areoverused. This despite the
widespread recognition that managers, the
intense and driven lot that they are, can get
into trouble by going overboard just as well
as they can by being deficient (Kaplan &
Kaiser,2003a, 2003b; Lombardo &
Eichinger, 2000; McCall, 1998; McCall &
Lombardo, 1983).
ASolution
The limitations of traditional rating scales
dawned on us in the early 1990s. The insight
came out of comprehensive assessments of
executives that involved extensive interviews
with coworkers past and present as well as a
batteryof psychological tests and 360o
ratings. In the course of helping his clients
make sense of their data, Bob Kaplan
stumbled on the oversight (see Kaplan,
1996). He found himself remarking, “You
are a force to be reckoned with.” It followed
that he would sum up their shortcomings
with the phrase, “too forceful.” It was plain
as day in the interview data, whether direct
reports were bemoaning an autocratic style,
peers were complaining about never getting
aword in edgewise, or superiors were con-
cerned about an intense drive. Something just
did not add up: None of the 360oratings
directly indicated overkill.
Looking for a way to correct for this
limitation of existing 360oinstruments
(including his own, SKILLSCOPE®for
Managers (Kaplan, 1988)), Kaplan (1996)
devised what he called a “curvilinear” rating
scale. Low ratings were anchored with “too
little,” high ratings were anchored with
“too much.” And like Goldilocks’ favorite
porridge, the optimal rating, in the middle,
was anchored with “the right amount.” Rob
Kaiser has joined Kaplan in conducting
ongoing research and refining the new rating
scale and a prototype 360oquestionnaire,
now called the Leadership Versatility Index®.
In its present form, the new response scale
looks like the one in Exhibit 2. Raters are
alerted that scale is not simply less-to-more
where “more is better.” For instance, minus
scores on the deficiency side and plus scores
on the excess side call attention to these two
different types of performance problems.
According to recent developments in the
study of mental processes involved in making
ratings, the negative and positive numbers
(and the arrows) also convey to raters that
each side of the scale is distinct: Low is not a
lack of high, it is the opposite of it (Schwartz,
1999). The scale is be a powerful way to
tease apart the two types of ineffective per-
formance in developmental feedback.
The “too little/too much” response scale
combines elements of both the frequency and
evaluation format because it contains
descriptive (how much?) as well as judgmen-
tal (how well?) components. Also, this scale
appears to takes context into account: It
implies a judgment of frequency relative to
this job in this organization at this time.
Aproject for a client led to the develop-
ment of another version of our rating scale.
Motorola Inc. commissioned us to help
develop a leadership model and attendant
performance measures to be used with its top
1,000 executives (Kaiser, et al., 2002).
Motorola approached us because senior
management was taken by our “too little/too
much” scale and wanted to employ it in their
tool. But there was also a need for a tradi-
tional effectiveness scale because the results
would be used both for development and for
administrative purposes and because the
company needed to compare scores directly
among individuals. Motorola therefore
decided to use two rating scales, an evalua-
tion scale and an adaptation of our new scale
designed to complement an evaluation scale,
the “do less/do more” scale shown in Exhibit
3. We describe later how this scale comple-
ments an evaluation response scale by clari-
fying the meaning of “less effective” ratings.
Benefits
Through our consulting practice and
program of basic research, we have found
several advantages of this new design for
response scales. These benefits accrue to
raters, feedback recipients, organizations,
and researchers. We also have some concerns
and questions that need to be addressed.
First, the benefits.
Benefits to Raters
In introducing this new approach to
groups of managers, we find two striking
results. First, the “too little/too much”
distinction is not hard to grasp: People
intuitively seem to understand it. Second,
some people report feeling less constrained in
making assessments using the new scale.
Others can see that the response scale adds
EXHIBIT 2
The Implicitly Curvilinear, “Too Little/Too
Much” Response Scale
Too little The right amount Too much
-4 -3 -2 -1 0 +1 +2 +3 +4
Much Barely Barely Much
too too too too
little little much much
HUMAN RESOURCE PLANNING 28.3 9
new possibilities—but tend to be at a loss for
fully explaining how. When we ask them to
contrast this experience to their experience
with traditional scales, we hear things like:
“Well, I’m not always sure what a ‘3’ is sup-
posed to mean,” or “I usually use the middle
two values, but on this scale I couldn’t
because they weren’t always a strength—it
forced me to use moreof the options.”
Sometimes we also hear: “This scale allowed
me to indicate, ‘yes, you arestrong in that
area, but sometimes a little too strong.’”
Benefits to Feedback Recipients
There are two benefits of the new scale to
feedback recipients. One, what the results
mean is much clearer. Low scores on evalua-
tion scales are ambiguous, and high scores on
frequency scales do not draw the line
between plenty and too much, but the
“curvilinear” scale leaves little doubt what
the results mean when they are cast in terms
of “too little,” “the right amount,” and “too
much.” As one director of talent manage-
ment whose firm has adopted our model of
leadership and tool said: “There is a confi-
dence in interpreting results—you know
right away what to do about it, whether it’s
step up, tone down, or do more of the same.”
The second benefit is a better spread using
this response scale than using standard
response scales. In recent years, a common
complaint heardin organizations is that
“everyone gets high scores on everything.”
In other words, ratings do not appear to dis-
criminate within a person (that is, distinguish
between his or her strengths and weaknesses)
or between people (that is, distinguish
between higher and lower performers). No
doubt, one reason is that raters mostly use
only a portion of the typical five-point scale.
This is to be expected: Through “corporate
Darwinism” individuals selected into
management positions are the ones who
have the ability, motivation, and experience
to do the job (LeBreton, et al., 2003). Rating
distributions get heavily skewed toward
the top end, especially over time as junior
managers get better through experience.
The “too little/too much” scale also helps
spread scores out. First, because the optimal
score is in the middle of the scale, frequency
distributions tend be relatively normal and
centered. Second, because deficiency and
excess are teased apart, there is a generous
spread in both directions surrounding
optimal. Finally, because the response scale
is effectively nine points (-4 to +4), nearly
double the typical scale (1 to 5), scores are
distributed over a wider range and differ-
ences are more readily apparent to the naked
eye.2Thus, when it comes to making sense
of feedback results, the curvilinear scale
provides an advantage by spreading scores
out and by distinguishing between too little
and too much.
Benefits to Organizations
In our work with Motorola, we learned
firsthand how the idea of accounting for
overkill and an application of that idea in the
form of a performance-appraisal tool can
have an impact on an organization (Kaplan
&Kaiser, 2003a). Recall that we designed a
leadership model and tool for them that
involved two ratings for each item—an
absolute evaluation rating and a prescriptive
“do less/do more” rating. The first thing we
learned was how the basic idea of excess can
expand the language an organization uses to
discuss leadership and development. Second,
assessing individuals in terms of “too little
and too much” as well as absolute effec-
tiveness with an evaluation scale packs a
powerful one-two informational punch for
decision makers.
Senior leaders at Motorola wanted to
reflect the tensions and trade-offs inherent in
the business world in their model and mea-
sures of leadership. They were talking about
akind of leadership that navigated the straits
and avoided crashing on one side or the
other: for instance, balancing vision with
execution and balancing “edge,” the tough
side of leadership, with empowering and
supporting people. The idea that problems
come in both flavors, deficiency and excess,
played naturally to this view: Out-of-balance
leadership could easily be described as too
much focus on execution, not enough vision;
too much pushing for results, not enough
support; and so on. By recognizing overkill
explicitly in their model, tools, and conversa-
tions, senior leaders at Motorola created a
leadership culture that was wary of excesses.
They also provided a new way to appreciate
agility and the daunting trade-offs with
which senior managers must contend.
One senior HR person remarked a few
years after launching the model and assess-
ment tools: “What’s most fascinating are
those cases where the person gets a relatively
high effectiveness rating on an item like
‘Expects a lot,’ but several coworkers also
indicate ‘do less.’ These tend to be the fast-
trackers who risk derailing because their
intensity can become too much. The level of
dialogue in these sessions is amazing. You
can see the light bulb go on.”
On a broader scale, weaving the idea of
overkill directly into the fabric of their
leadership model and 360otools has opened
the door to capitalizing on other develop-
ments in the field. For instance, Motorola
has incorporated Eichinger and Lombardo’s
(2000) For Your Improvement (FYI) devel-
opment guide in their e-learning system. Not
coincidentally,FYI is one of the few
resources that explicitly address how
strengths become weaknesses through
overuse. The HR/OD team at Motorola has
mapped the behaviors assessed by the 360o
onto the dimensions in FYI so feedback
recipients have, literally at their fingertips,
tips on what to do about skills they lack as
well as those they have overdeveloped.
In addition to developmental applica-
tions, measuring behavior in terms of too
little and too much adds to the tool’s predic-
tive power. The “do less/do more” ratings
furnish information that is distinct from that
provided by the effectiveness ratings.
“Calibration” is an annual process by which
managers at Motorola get together and
decide where each of their subordinates falls
out in a forced distribution—least effective,
solidly effective, or most effective. Todeter-
mine the value-added of the “do less/do
more” scale, we first used ratings on the eval-
EXHIBIT 3
The Prescriptive “Do Less/Do More”
Scale for Supplementing Evaluation Scales
Do a
lot less Do less Do a
little less Do the
same Do a
little more Do more Do a
lot more
-3 -2 -1 0 +1 +2 +3
10 HUMAN RESOURCE PLANNING 28.3
uation scale to predict calibration ranks and
then tested whether the “do less/do more”
ratings add to the tool’s ability to predict.
We’ve been doing this analysis every year
since 2000 and have found that the “do
less/do more” ratings increase how well
scores on the 360opredict calibration rank-
ings by at least 25 percent; one year, it
enhanced predictive power by 55 percent.
Our statistical analyses also revealed that
the “do less/do more” ratings help primarily
by clarifying the low-to-middling ratings on
the evaluation scale. Perhaps an example
will illustrate this best: On the item “Holds
people accountable,” one manager received
an effectiveness rating of 3, and no one indi-
cated do more or do less; another manager
also received an effectiveness rating of 3, but
five coworkers indicated “do more.”
Clearly, the former manager is in better
shape than the latter. In this way the “do
less/do more” scale helps the supervisor as
well as the manager receiving feedback
determine what to work on.
Benefits to Researchers
Finally,we have discovered at least two
benefits of the new response scale for stu-
dents of management. First, precisely because
the new response format was designed on a
curvilinear principle, it helps in detecting
curvilinear relationships between managerial
behavior and various criteria. Not surpris-
ingly, we routinely detect curvilinear relation-
ships between measures of effectiveness and
leadership dimensions measured with our
evaluation of frequency scale.
Asecond benefit is that the new response
scale clears up an anomaly in the body of
research on opposites in leadership (e.g.,
task-oriented versus people-oriented). In
recent years interest has increased in the
paradoxes that confront modern managers
and, by extension, in the notion of manager-
ial flexibility or versatility (Kaiser, et al.,
2005). One would expect a negative correla-
tion between opposites like short-term
orientation versus long-term orientation,
competition versus collaboration, autocratic
versus participative, and so forth. That is, we
would expect that doing too much on one
side in each pair of opposites would corre-
spond to doing too little of the other side
or that being more skilled at one would
correspond with being less skilled at the
other (Kaplan, 1996; Kaplan & Kaiser,
2003b). The research literature is clear
on this point: When measured with a
traditional response scale, correlations
between ratings on these theoretical oppo-
sites are actually positive, often on the order
of .50 or so. When opposites are measured
using the “too little/too much” scale, a very
different pattern emerges: we find negative
correlations around -.50. How to account for
the wildly discrepant results? We think the
difference comes from the type of response
scale employed: Traditional scales only cover
half the story by stopping short of excess; by
not allowing for the possibility of overkill,
they therefore cannot detect lopsidedness.
This statistical finding is not just a
researcher’s concern: It is also relevant to
practice. The positive correlation found using
traditional response scales means that most
managers get feedback that says: “The more
skilled you are at this, the more skilled you
are at its opposite too.” The negative corre-
lation for ratings on the new scale means
these managers hear: “The more you overuse
this skill, the morelikely you under-use the
complementary skill,” thus pulling the lop-
sidedness of their leadership into sharp relief.
Concerns and Further
Development
Here are the major concerns that have
occurred to us or that have been raised by
our colleagues.
Some Things Cannot Be Overdone
This is something we frequently hear,
particularly from scholarly researchers. For
instance, some people claim that you cannot
be too smart. And in an age in which vision-
ary leadership is all the rage, some have
argued that today’s leaders cannot be too
strategic. Wedisagree with these claims on
the grounds of research. For example, after
studying three different samples of managers,
Ghiselli (1963, p. 898) concluded: “…the
relationship between intelligence and man-
agerial success is curvilinear with those indi-
viduals earning both low and very high
scores being less likely to achieve success in
managerial positions.” Similarly, in the 360o
data we collect, leaders do get faulted by
their coworkers for being too strategic: Too
much time on strategic planning, grandiose
visions that defy implementation, pushing
growth too far and too fast, and so on. With
regardto the larger claim that some things
simply cannot be taken too far, that may be
true. Some experts question even this moder-
We do not know for sure that all leader-
ship behaviors can be overdone, but clearly
many can. A key lesson we have learned in
using the new response format is that items
must be phrased in a way that helps the
respondent easily see what “too much” of
that behavior might look like. Using items
that are value-laden will not work. For
instance, “Effectively makes her point to a
resistant audience” will not work because
one cannot be too effective.But “Persists in
trying to persuade people” does admit to
overdoing.
Difficulty Creating Scale Scores
Another limitation involves the computa-
tion of scale scores across several items rated
on the -4 to +4 scale. The problem occurs
when some items are in the negative, “too
little” region, but others are in the positive,
“too much” region. The net effect is for the
scores to cancel each other out and to dilute
the average, bringing it closer to zero,
optimal, than ought to be the case. Wehave
yet to discover a satisfying solution to this
arithmetic problem. We simply suggest
caution with scale scores, recognizing that no
measure is perfect. For now we regard the
dilution that occurs from the way that
positive ratings and negative ratings cancel
each other out as a cost of making room to
detect overkill.
Sometimes a Linear, Absolute
MeasureIs Needed
One of the strengths of the new response
format is that it takes context into account to
some degree. This is especially helpful in
development: The focus on using the data
is specific to one person. But in other appli-
cations, particularly administrative uses of
ratings where data is used to compare
people, this can be a drawback. For instance,
some academics have questioned whether
it makes sense to compareratings for two
different people on the new scale. As the
argument goes, if the scale does assume
agreat deal of context, then scores
between people in different contexts (e.g.,
different jobs, different organizations) are
not comparable.
Wetake these concerns seriously and have
begun a study aimed at investigating them;
ate stance. For instance, McCall (1997;
pp.25-29) took the opposite view in a section
of High Flyers titled: “Every Strength Can
BeaWeakness.”
however, at this point we are relatively
confident that comparing ratings for two or
more people on the new scale makes some
sense. Our confidence comes from a simple
empirical fact: Our cross-sectional research
consistently yields sizable correlations
between behaviors measured on the new
scale and external criteria (e.g., leader effec-
tiveness, subordinate satisfaction). If
between-person comparisons were invalid,
these correlations would equal zero.
No Direct Comparisons Between
Alternative Response Scales
Astute methodologists will note we have
made several direct conceptual comparisons
between the new response format and tradi-
tional response formats, yet have only made
indirect empirical comparisons. Many of our
claims remain hypotheses about how the two
methods would compare directly.
Specifically, what is needed is an experimen-
tal study with a controlled design that
involves having the same respondents rate
the same target manager on a set of dimen-
sions, once with the new scale and once with
atraditional scale. A study like this
could provide control adequate to ruling out
competing explanations for the observed
results, and could isolate the effects of each
type of response scale. We currently have
such a study under way.
Concluding Thought
Weare optimistic about this innovation in
response scale technology, but only cautious-
ly so. There is still much to learn about how
best to apply the new scale in practice and in
research. Weencourage other independent
research teams to conduct their own studies
of the strengths and limitations of this new
format. To that end, we would gladly share
whatever materials and thoughts interested
parties may need to get started.
NOTES
1Modern approaches to leadership develop-
ment usually recognize how strengths can
become weaknesses when overused. This
idea has been widely disseminated in the
work of M. Lombardo and M. McCall
(Lombardo & Eichinger, 2000; McCall,
1998; McCall & Lombardo, 1983). The
idea that excesses constitute just as impor-
tant a class of performance issues as defi-
ciencies is rarely reflected in the design of
standard assessment tools. When it is taken
into account, it tends to be treated as an
afterthought or as a supplemental feature
rather than as integral to the design of the
measure. See examples in Leslie and Fleenor
(1998).
2Although there is more variance in an
absolute sense with our new scales, this is
something of a methodological artifact
because our scale has nine intervals and
typical scales have only five intervals. The
average SD on our scale is .82, which is
about .09 units on the native scale (.820/9).
Typically, performance ratings on five-point
scales have an SD around .50 (.10 units on
the native scale). Thus, there is relatively
less variance on our scale, controlling for
number of response options. There is more
variance in absolute terms, which may be
more important given the near-universal
practice of providing 360oresults as raw
scores, on the original metric established by
the response scale (Leslie & Fleenor, 1998).
REFERENCES
Aiken, L.R. (1996). Rating Scales and
Checklists: Evaluating Behaviors,
Personality, and Attitudes.New York: John
Wiley & Sons.
Aristotle (undated). Nicomachean Ethics.
Translated by H. Rackham (1982).
Cambridge, MA: Harvard University Press.
Eichinger,R.W.&Lombardo, M.M. (2000).
For Your Improvement. Minneapolis, MN:
Lominger Limited, Inc.
Ghiselli, E.E. (1963). “The Validity of
Management Traits in Relation to
Occupational Level.” Personnel Psychology,
16, 109-113.
Kaiser, R.B., Craig, S.B., Kaplan, R.E., &
McArthur (2002). “Practical Science and the
Development of Motorola’s Leadership
Standards.” In K.B. Brookhouse (Chair)
Transforming Leadership at Motorola.
Practitioner Forum presented at the 17th
Annual Conference of the Society for
Industrial and Organizational Psychology,
Toronto, Ontario.
Kaiser, R.B., Lindberg, J.T., & Kaplan, R.E.
(2005). “Assessing the Flexibility of
Managers with Coworker Ratings: A
Comparison of Methods.” Manuscript
under review.
Kaplan, R.E. (1996). Forceful Leadership
and Enabling Leadership: You Can Do Both.
Greensboro, NC: Center for Creative
Leadership.
Kaplan, R.E. (1988). SKILLSCOPE®for
Managers. Greensboro, NC: Center for
Creative Leadership.
Kaplan, R.E. & Kaiser, R.B. (2003a).
“Developing Versatile Leadership.” MIT
Sloan Management Review,44, 19-26.
Kaplan, R.E. & Kaiser, R.B. (2003b).
“Rethinking a Classic Distinction in
Leadership: Implications for the Assessment
and Development of Executives.” Consulting
Psychology Journal: Research and Practice,
55, 15-25.
LeBreton, J.M., Burgess, J.R.D., Kaiser, R.B.,
Atchley, E.K., & James, L.R. (2003). “The
Restriction of Variance Hypothesis and
Interrater Reliability and Agreement: Are
Ratings from Multiple Sources Really
Dissimilar?” Organizational Research
Methods,6, 78-126.
Leslie, J.B., & Fleenor, J.W. (1998). Feedback
to Managers: A Review and Comparison of
Multi-Rater Instruments for Management
Development.Greensboro, NC: Center for
Creative Leadership.
Likert, R. (1932). “A Technique for the
Measurement of Attitude Scales.” Archives
of Psychology,140, 44-53.
Lombardo, M.M. & Eichinger,R.W.(2000).
The Leadership Machine. Minneapolis, MN:
Lominger Limited, Inc.
McCall, M.W. Jr. (1998). High Flyers:
Developing the Next Generation of Leaders.
Boston, MA: Harvard Business School Press.
McCall, W.M. Jr. & Lombardo, M.M.
(1983). Off the Track: Why and How
Successful Executives Get Derailed.
Greensboro, NC: Center for Creative
Leadership.
Murphy,K. R., & Cleveland, J. N. (1995).
Understanding Performance Appraisal:
Social, Organizational, and Goal-Based
Perspectives. Thousand Oaks, CA: Sage.
Nathan, B.R., & Alexander,R.A. (1988). “A
Comparison of Criteria for Test Validation.”
Personnel Psychology,41, 517-535.
Schwartz, N. (1999). “Self Reports: How the
Questions Shape the Answers.” American
Psychologist,54, 93-105.
Stockford, L. & Bissell, H.W. (1949).
“Factors Involved in Establishing a Merit-
Rating Scale.” Personnel,26, 94-116.
HUMAN RESOURCE PLANNING 28.3 11
... The general pattern is that leader characteristics and behavioral styles have detrimental effects, not only when they are underdeveloped but also when they are taken too far. In contrast, a level of behavior between deficiency and excess is associated with the highest levels of leadership effectiveness (Kaiser & Kaplan, 2005a). ...
... Both types of misestimation have to do with excessive levels of an otherwise desirable behavior. First, high Likert scale scores may not differentiate between doing something "a lot and well" and doing 162 Chapter 4 it "too much" (Kaiser & Kaplan, 2005a;2005b). Consider for instance rating a leader's behavior with the item "Takes a methodical approach to getting things done" . ...
... Taken together, using Likert scales one could erroneously conclude that (a) the leader's standing on a particular behavior is high without Measuring curvilinear effects 163 making a differentiation between "a lot" (scenario 1) and "too much" (scenario 2) (Kaiser & Kaplan, 2005a); or that (b) the leader's standing on a particular behavior is low while it is actually extremely high (scenario 4) confounding with leaders who are actually low on that behavior (scenario 3) (cf. Carter et al., 2014). ...
... Second, actual participants could rate their virtue-relevant behavioral expression on both Likert-type and perceived-optimality rating scales in situ using the experience sampling method (Conner et al., 2009). Rooted in Aristotle's doctrine of the golden mean, the too little/too much response format (TLTM; Kaiser & Kaplan, 2005; Vergauwe, Wille, Hofmans, Kaiser, & Fruyt, 2017) presents a rating scale anchored from -4 ("much too little") to 4 ("much too much"), with 0 denoting optimality ("The right amount"). The TLTM response format seems particularly well suited for assessing constructs that are defined by optimality or adaptation to circumstances in engendering positive outcomes. ...
... For example, TLTM response format scores on flexible leadership predict theoretically-relevant criteria (i.e., leader effectiveness) more strongly than Likert response format scores (Kaiser, Lindberg, & Craig, 2007;Kaiser & Overfield, 2010). In contrast, the standard Likert response format asks respondents to describe how characteristic an item of a target is in terms of frequency (e.g., "never" to "always") or agreement (e.g., "not at all" to "a great extent"), representing a "less-to-more" approach to measurement (Kaiser & Kaplan, 2005). This response format does not differentiate between engaging in relevant thoughts, feelings, and behaviors a lot versus doing them too much, harboring the unstated assumption that a greater degree or extent of relevant behavioral expression is better (Kaiser & Kaplan, 2005). ...
... In contrast, the standard Likert response format asks respondents to describe how characteristic an item of a target is in terms of frequency (e.g., "never" to "always") or agreement (e.g., "not at all" to "a great extent"), representing a "less-to-more" approach to measurement (Kaiser & Kaplan, 2005). This response format does not differentiate between engaging in relevant thoughts, feelings, and behaviors a lot versus doing them too much, harboring the unstated assumption that a greater degree or extent of relevant behavioral expression is better (Kaiser & Kaplan, 2005). Yet, it has been shown that even when rating behavioral expressions of "positive" individual differences people can and do make this distinction when given the opportunity. ...
Article
A seemingly universal lesson is that anything taken to its extreme is detrimental. Indeed, there has been growing interest in testing this idea within psychology. These studies have often been framed in terms of Aristotle's doctrine of the golden mean or the idea that virtue lies between the vices of deficiency and excess. Recent explicit reviews of this hypothesis in the psychological literature has led to the paradoxical conclusion that one can have too much virtue (i.e., the too-much-of-a-good-thing effect), despite virtue being identified by the golden mean. We argue in this paper that this conclusion is due to a reductionist account of virtues in psychology and the resultant measurement of virtues as general dispositional tendencies in behavior. We review philosophical theory on the golden mean to show that the relationship between virtue and relevant behavior is fundamentally about situation-specific optimality. Using schematic models, we contrast the former measurement approach against the latter to explain the too-much-of-a-good-thing effect and further demonstrate why virtues cannot be properly measured as general tendencies in behavior. We conclude with methodological implications of our theory-informed approach to virtue measurement for research design, evaluation, and conceptualization.
... This rating scale format is presented in Figure 1. It ranges from -4 (much too little), to 0 (the right amount), to +4 (much too much) and was specifically developed to measure leader behaviors from a multi-source perspective (Kaiser & Kaplan, 2005a;Kaiser, Overfield, & Kaplan, 2010;Vergauwe, Wille, Hofmans, Kaiser, & De Fruyt, 2017). The scale was originally designed as a way to identify strengths that become weaknesses through overuse, a key dynamic identified in the original derailment studies at the Center for Creative Leadership (McCall & Lombardo, 1983). ...
... An example might help illustrate the point. Early studies of how the TLTM scale functioned differently from typical Likert-type, five-point rating scales, used protocol analysis by asking raters to think out loud as they decided how to rate a leader they knew well two times using the same set of leader behaviors, once using a five-point Likert-type scale and again using the TLTM scale (Kaiser & Kaplan, 2005a). This allowed for the analysis of the 5 cognitive processes involved in using each type of rating scale. ...
... Indeed, using the TLTM scale, the tradeoff seems to be less systematic control and explicit consideration of all possible situational variables but higher fidelity and relevance to the present situation, at least as socially constructed. In the event that contextual specification and explication is required, one might consider asking raters to expressly clarify the contextual information they took into account when rating the leader (Kaiser & Kaplan, 2005a). ...
Article
Full-text available
In their focal article, Reynolds, McCauley, Tsacoumis, and the Jeanneret Symposium Participants (2018) stress the importance of context in leadership assessment. For instance, they argue that senior executives work in a different context compared to lower-level managers and that this should be taken into account. A simple example is that the competency of strategic thinking is critical for executive performance but much less so, if at all, for front-line supervisors. The claim that context matters in leadership and in the assessment of leaders is easy to grasp but difficult to apply in practice.
... The general pattern is that leader characteristics and behavioral styles have detrimental effects, not only when they are underdeveloped but also when they are taken too far. In contrast, a level of behavior between deficiency and excess is associated with the highest levels of leadership effectiveness (Kaiser & Kaplan, 2005a). ...
... Both types of misestimation have to do with excessive levels of an otherwise desirable behavior. First, high Likert scale scores may not differentiate between doing something "a lot and well" and doing it "too much" (Kaiser & Kaplan, 2005a;2005b). Consider for instance rating a leader's behavior with the item "Takes a methodical approach to getting things done" . ...
... Taken together, using Likert scales one could erroneously conclude that (a) the leader's standing on a particular behavior is high without making a differentiation between "a lot" (scenario 1) and "too much" (scenario 2) (Kaiser & Kaplan, 2005a); or that (b) the leader's standing on a particular behavior is low while it is actually extremely high (scenario 4) confounding with leaders who are actually low on that behavior (scenario 3) (cf. Carter et al., 2014). ...
... The general pattern is that leader characteristics and behavioral styles have detrimental effects not only when they are underdeveloped but also when they are taken too far. In contrast, a level of behavior between deficiency and excess is associated with the highest levels of leadership effectiveness (Kaiser & Kaplan, 2005b). ...
... Both types of misestimation have to do with excessive levels of an otherwise desirable behavior. First, high Likert scale scores may not differentiate between doing something a lot and well and doing it too much (Kaiser & Kaplan, 2005a, 2005b. Consider for instance rating a leader's behavior with the item "Takes a methodical approach to getting things done" (Kaiser, Overfield, & Kaplan, 2010). ...
... Taken together, using Likert scales one could erroneously conclude that (a) the leader's standing on a particular behavior is high without making a differentiation between a lot (Scenario 1) and too much (Scenario 2) (Kaiser & Kaplan, 2005b) or that (b) the leader's standing on a particular behavior is low while it is actually extremely high (Scenario 4)-confounding with leaders who are actually low on that behavior (Scenario 3) (cf. Carter et al., 2014). ...
Article
Full-text available
This article describes the too little/too much (TLTM) scale as an innovation in rating scale methodology that may facilitate research on the too-much-of-a-good-thing effect. Two studies demonstrate how this scale can improve the ability to detect curvilinear relationships in leadership research. In Study 1, leaders were rated twice on a set of leader behaviors: once using a traditional 5-point Likert scale and once using the TLTM scale, which ranged between –4 (much too little), 0 (the right amount), and þ4 (much too much). Only linear effects were observed for the Likert ratings, while the TLTM ratings demonstrated curvilinear, inverted U-shaped relationships with performance. Segmented regressions indicated that Likert ratings provided variance associated with the too little range of the TLTM scale but not in the too much range. Further, the TLTM ratings added incremental validity over Likert ratings, which was entirely due to variance from the too much range. Study 2 replicated these findings using a more fine-grained, 9-point Likert scale, ruling out differences in scale coarseness as an explanation for why the TLTM scale was better at detecting curvilinear effects.
... We then determined the format for measurement by adapting the Too Little/Too Much scale (TLTM scale; Kaiser & Kaplan, 2005) for the DLCS-SR. Items on this measure are scored bidirectionally from −4 (much too little) to +4 (much too much), with a score of 0 (the right amount) as a midpoint in the scale rating leadership behavior. ...
... Thus, lower absolute scores (closer to 0) reflected using a leadership behavior closer to the right amount, whereas higher absolute scores (farther from 0) reflected more lopsidedness (e.g., over-or underutilizing a behavior). We selected this response format because traditional Likert scales on leadership measures (e.g., 1-5 scales) may contain blind spots in assessing the extent to which a leader over-or underutilizes a given leadership approach (Kaiser & Kaplan, 2005). Hollenbeck, McCall, and Silzer (2006) noted that even leadership strengths can turn into weaknesses if used too often or not enough. ...
Article
Full-text available
The authors developed the Dynamic Leadership in Counseling Scale–Self-Report (DLCS-SR) and tested for evidence for validity and internal consistency with a sample of 218 participants. They found evidence for a single-factor model of global leadership behaviors among counselors in the current sample as well as evidence for convergent validity and strong internal consistency. Implications for counseling leadership research and practice are discussed in light of the findings.
... Many studies have stated that there is little evidence to support the definitive number of appropriate points on the scale. For example, Kaiser and Kaplan [16] put forth a curvilinear scale but noted that contextual influences would make comparisons difficult. Jacoby and Matell [17] noted that the three-point Likert scale was acceptable. ...
Article
The drive toward implementing an industrialized building system (IBS) in Malaysia is in line with Malaysia’s Construction Industry Transformation Plan 2016–2020, which seeks to increase more than double the construction industry’s productivity. IBS is able to accelerate the construction timeline, provide a safer working environment on site, produce a higher quality of construction, and save costs. Although the introduction of IBS in Malaysia is not new, its acceptance has not been extensive, and IBS implementation is still slow. Thus, to support the successful implementation of IBS, it is vital to determine the factors that influence the achievement of this aspiration. Therefore, this study aims to identify and evaluate the critical success factors (CSFs) that contribute to the smooth implementation of the IBS dimensions within the context of the Malaysian construction industry. By doing so, the uptake of IBS can be accelerated. In order to consolidate the set of candidate success factors, these CSFs were identified from the literature review and confirmed through a self-administered survey questionnaire. Then, the value of importance of each CSF was calculated in a second survey. Based on the factor analysis, 15 CSFs were identified and grouped into five major elements: strategy, sources of funding, process, people, and enabler, with each factor comprising its own set of components. The findings indicate that the CSFs in IBS implementation have different priorities and weights.
... According toKaiser et al. (2015), although the concept of strengths overused is acknowledged, it is seldom applied in the measurement of leader behaviour as the standard method relies on Likert-type rating scales where higher scores indicate more frequent or more effective behaviour. According toKaiser et al. (2015), this method confounds doing a lot with doing too much; it also blurs the distinction between deficiency and excess as two distinct sources of ineffectiveness (Kaiser Kaplan, 2005). Thus,Kaiser et al. (2015)proposed that this may be one reason why leadership research on dark side traits has produced inconsistent findings. ...
Article
Full-text available
The new concept of “interpersonal pollution” and its antecedents and effects, i.e. on organizational members’ health and well-being and on organizational outcomes are investigated. Building upon this work this presentation proposes a model and tentative definition of a broader construct, i.e. “organizational pollution”, and identifies its potential antecedents and explores its impact on humans’ health and well-being and organizational outcomes. In particular our model explores the roles played by leaders’ and members’ dark personalities and lack of environmental concern, by unethical leadership, by both the characteristics of the community and the organization, including the latter’s physical and ethical environment, and finally their link to organizational pollution. This new model implications for organizational and environmental psychology are discussed.
Article
Recent advances in personality theory and research have led to the introduction of the “Too-much-of-a-good-thing-effect” in the relationship between conscientiousness and desirable outcomes, challenging the “more is better” idea that has been dominating research on this trait for a long time. Thus, the question arises as to how people evaluate their conscientiousness levels themselves, more specifically, whether they regard their trait levels as “too little”, “the right amount”, or “too much”. The current study describes how an existing personality inventory can be adjusted to explore such evaluations of conscientiousness levels by incorporating a too little/too much response format. The structural characteristics of this new assessment approach are examined and compared against responses that are collected using a traditional Likert rating format asking people to describe themselves. Results show that – in this sample (N = 367) – about 11 % of participants evaluated their conscientiousness as adequate, whereas the majority (75 %) indicated it to be too high. Further, the “right amount” of conscientiousness was most frequently associated with a 7 on a 9-point Likert scale, while very high Likert-scale ratings of 9 were regarded as “too much” in over three-fourth of the ratings. Implications and directions for future research are discussed.
Chapter
Welche Stärken und Schwächen haben Sie? Vielleicht kennen Sie diese Frage aus Vorstellungsgesprächen, die Sie geführt oder an denen Sie teilgenommen haben. In fast jedem Vorstellungsgespräch fällt sie – und jeder weiß das. Eine wirklich reflektierte und persönliche Antwort ist dabei allerdings selten. Die Vorbereitung auf diese Frage beschränkt sich in den meisten Fällen auf die Ausarbeitung gewisser Stärken und Schwächen, abgeglichen mit dem gewünschten Anforderungsprofil des Arbeitsplatzes.
Article
Full-text available
The fundamental assumption underlying the use of 360-degree assessments is that ratings from different sources provide unique and meaningful information about the target manager’s performance. Extant research appears to support this assumption by demonstrating low correlations between rating sources. This article reexamines the support of this assumption, suggesting that past research has been distorted by a statistical artifact—restriction of variance in job performance. This artifact reduces the amount of between-target variance in ratings and attenuates traditional correlation-based estimates of rating similarity. Results obtained from a Monte Carlo simulation and two field studies support this restriction of variance hypothesis. Noncorrelation-based methods of assessing interrater agreement indicated that agreement between sources was about as high as agreement within sources. Thus, different sources did not appear to be furnishing substantially unique information. The authors conclude by questioning common practices in 360-degree assessments and offering suggestions for future research and application.
Article
Full-text available
The authors present a new way of construing the classic distinction between self-assertive, task-oriented leadership and empowering, people-oriented leadership. These twin pillars--what they call forceful and enabling, respectively--are portrayed as a duality, a pair of seemingly contradictory yet in fact complementary leadership "virtues." The authors also describe a new approach to measuring this duality. Data collected in this way reflect the clear tendency for managers to be lopsided--to overdo one side and to underdo the other. There is also a strong statistical association between lopsidedness--or, stated positively, versatility--and overall effectiveness. This linked way of formulating and measuring leadership in terms of dualities is very useful in giving feedback to executives and in guiding their development. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Part 1 When talent isn't enough/of astronauts and executives: the derailment conspiracy. Part 2 Developing executive talent/experience as teacher: linking business strategy and executive development assessing potential - is talent what is, or what could be? Who gets what job - the heart of development catalyst for development. Part 3 Taking action/making executive development a strategic advantage taking charge of your own development.
Article
This volume describes 24 publicly available multiple-perspective management-assessment instruments that relate self-view to the views of others on multiple management and leadership domains. Each instrument also includes an assessment-for-development focus that scales managers along a continuum of psychometric properties, and "best practices" for management development. The instruments reviewed are: (1) "Benchmarks"; (2) "Campbell Leadership Index" (CLI); (3) "COMPASS: The Managerial Practices Survey"; (4) "Executive Success Profile" (ESP); (5) "Survey of Executive Leadership" (EXEC); (6) "Leader Behavior Analysis II" (LBAII); (7) "The Visionary Leader: Leader Behavior Questionnaire" (LBQ); (8) "Leadership Effectiveness Analysis" (LEA); (9) "Acumen Leadership Skills" LEADERSHIP SKILLS; (10) "Leadership/Impact" (L/I); (11) "Leadership Practices Inventory" (LPI); (12) "Life Styles Inventory" (LSI); (13) "MANAGER VIEW/360"; (14) "Matrix: The Influence Behavior Questionnaire" (MATRIX); (15) "Management Effectiveness Profile System" (MEPS); (16) "Multifactor Leadership Questionnaire" (MLQ); (17) "The PROFILER"; (18) "PROSPECTOR"; (19) "Survey of Leadership Practices" (SLP); (20) "The Survey of Management Practices" (SMP); (21) "System for the Multiple Level Observation of Groups" (SYMLOG); (22) "Types of Work Index" (TWI); (23) "VOICES"; and (24) "Acumen Leadership Work Styles" (WORKSTYLES). Three aspects are described for each instrument: (1) descriptive: author; vendor, copyright date, purpose, target audience, cost, scoring and certification procedures, duration, format, and raters; (2) research: origins, scales, scale definitions, samples, cautionary statement, and instrument reports; and (3) training: sample instrument, sample feedback report, and training materials. (RIB)
Article
A step-by-step account is given of significant findings in a series of statistical studies made at Lockheed Aircraft Corporation to determine the degree to which certain weaknesses inherent in the ratings obtained on the existing merit-rating scale could be reduced or overcome by designing a new scale and by training supervisors in the principles and techniques of rating. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
"Rating Scales and Checklists" is [a] guide to contructing, scoring, validating, and applying these . . . investigative and diagnostic tools. It provides . . . insights into the theoretical/psychometric aspects of measurement and scaling, as well as . . . guidelines for test construction and administration in a wide range of research and applied situations. In addition, the enclosed DOS-formatted computer diskette contains several dozen programs concerned with the construction, analysis, and applications of checklists, rating scales, attitude scales, and other psychometric instruments accompanying the text. [This book is intended] for practitioners in the behavioral and social sciences as well as for market research professionals, attitude and product researchers, and political pollsters. It is also [a] supplemental text for upper level courses in psychology, education, sociology, political science, and other related disciplines. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Meta-analyses of validity coefficients from tests of clerical abilities for five criteria—supervisor ratings, supervisor rankings, work samples, production quantity, and production quality—were conducted, and the resulting expected true validities were compared. Ratings, rankings, work samples, and production quantity all resulted in high test validities. Validities resulting from ratings and quantity-of-production criteria were highly similar across tests. Validities resulting from rankings and work samples were on the average higher than those from ratings and quantity of production. The fifth criterion, quality of production, had low predictability and did not generalize across situations.
Article
Measures of intelligence, supervisory ability, initiative, self-assurance, and perceived occupational level were obtained on eleven groups of individuals in various jobs ranging from line to upper management positions. It was found that the higher the level of the job the higher the score on the five tests and the higher the validity of the tests. It was concluded that apparently these traits identify the individuals who seek or are placed in higher positions, and that the higher the position the more critical these traits are in determining job success.
Article
The project conceived in 1929 by Gardner Murphy and the writer aimed first to present a wide array of problems having to do with five major "attitude areas"--international relations, race relations, economic conflict, political conflict, and religion. The kind of questionnaire material falls into four classes: yes-no, multiple choice, propositions to be responded to by degrees of approval, and a series of brief newspaper narratives to be approved or disapproved in various degrees. The monograph aims to describe a technique rather than to give results. The appendix, covering ten pages, shows the method of constructing an attitude scale. A bibliography is also given.