Working PaperPDF Available

Gender composition and group confidence judgment: The perils of all-male groups

Authors:
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 1
Gender composition and group confidence judgment: The perils of
all-male groups
Steffen Keck
University of Vienna
Wenjie Tang
IE Business School
November 23, 2015
Author Note
Correspondence concerning this article should be addressed to: Steffen Keck, Department of
Business Administration, Strategic Management Subject Area, University of Vienna, Oskar-Morgenstern
Platz 1, 1090 Vienna, Austria; Phone: +43 1 42 77 37970; Email: Steffen.Keck@univie.ac.at
Wenjie Tang, Department of Operations and Technology, IE Business School, Calle de Maria de
Molina, 12, piso 5, 28006 Madrid, Spain. Tel: +34 915 68 96 00, email: wenjie.tang@ie.edu
Acknowledgments: We thank Linda Babcock, David Budescu, Natalia Karelaia, and Ilia Tsetlin for
helpful comments and the IE Business School Foundation for financial support.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 2
Abstract
We explore the joint effects of group decision making and a group’s gender composition on the calibration
of confidence judgments. Participants in a laboratory experiment, individually and in groups of three,
stated confidence intervals for answers to general-knowledge questions and in two types of forecasting
tasks. Our results reveal that groups with at least one female member are significantly better calibrated
than all-male groups. This effect is mediated by the extent to which group members share opinions and
information during the group discussion. Moreover, we found that compared to a statistical aggregation of
individual confidence intervals, group discussions have a mostly positive effect on judgment calibration
for groups with at least one female group member, but actually harm calibration for all-male groups.
Overall, our findings demonstrate that even the inclusion of just one female member into a group can have
strong effects on the quality of group judgments.
Keywords: Overconfidence, group judgments, gender diversity, group deliberation
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 3
Gender composition and group confidence judgment: The perils of
all-male groups
1. Introduction
Decision makers in organizations frequently need to cope with severe uncertainty. In situations like these,
adequate levels of confidence may be just as important for organizational performance as the quality of
the decisions themselves (e.g. Sniezek, 1992; Sniezek & Henry, 1989). However, a large numbers of
studies have shown that individuals’ levels of confidence are in fact not well calibrated; instead most
people are systematically overconfident, that is, they hold an excessive certainty concerning the
correctness and precision of their beliefs, judgment and forecasts (e.g., Alpert & Raiffa 1982; Lichtenstein
& Fischhoff, 1977; Russo & Schoemaker, 1992; Soll & Klayman, 2004). Such miscalibration has been
shown to have an important effect on decision making in organizations. For example, overconfident
investors take too many risks, earn lower average returns and under-diversify their portfolios (Barber &
Odean, 2000, 2002; Deaves, Lüders, & Luo, 2008; Goetzmann & Kumar, 2008). Moreover, results by
Ben-David, Graham, & Harvey (2013) showed that firms with overconfident Chief Financial Officers
pursue more aggressive corporate policies such as larger investments and debt levels, which exposes their
companies to potentially excessive risk. Confidence judgments are also widely used in decision analysis,
and therefore overconfidence in managers judgments can have serious consequences for the quality of
decisions that are based on this methodology (Clemen, 1996).
In the last several decades, following an ongoing shift from organizing work around individual
jobs to team-based work structures, a large number of important judgments and decisions in organizations
are now made by groups rather than by individuals (e.g., Ilgen, 1999; Kozlowski & Bell, 2003).
Furthermore, due to the increasing diversity in the workforce as a whole, teams in organizations are not
only becoming more important, but have also become more diverse in terms of demographic categories
such as gender, age, and ethnicity (e.g., Jackson, 1992; Triandis, Kurowski, & Gelfand, 1994). In
particular, while all-male groups are still a ubiquitous phenomenon in many areas such as upper
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 4
management, board of directors, the financial sector, or in certain areas of academia, mixed gender groups
have become more and more common in the workplace (e.g., Heilman, 2012).
Even though a substantial number of studies has explored the effects of gender diversity on group
performance in a variety of settings, such as small work teams in organizations (e.g., Jehn, Northcraft, &
Neale, 1999; Wegge, Roth, Neubach, Schmidt, & Kanfer, 2008), top management teams (e.g., Dezsö &
Ross, 2012; Krishnan & Park, 2005), board of directors (e.g., Adams & Ferreira, 2009), or student
competitions (Apesteguia, Azmat & Iriberri, 2012; Hoogendoorn, Oosterbeek, & van Praag, 2013), the
potential effect of gender diversity on susceptibility to cognitive biases such as overconfidence has
remained unexplored. In this paper we aim to address these questions. Importantly, different from
previous research on overconfidence in group judgments which focused solely on a direct comparison
between individuals and groups (Plous, 1995; Russo & Schoemaker, 1992; Sniezek & Henry, 1989), we
focus on comparing groups with different gender compositions; that is different relative proportions of
male and female members in the group. In particular, we explore the important link between (a) gender
composition, (b) opinion and information sharing between group members during group deliberations, and
(c) confidence calibration. Moreover, our study aims to provide insights into another question that has
remained mostly unclear in prior research (see for example Sniezek, 1992; Plous, 1995): under what
circumstances are group deliberations a remedy against overconfidence and when might they be
ineffective or even exacerbate the problem.
2. Literature Review and Hypotheses Development
Confidence calibration in individual judgments
A widely used method to assess confidence calibration is to ask participants for subjective confidence
intervals for a number of unknown values. Miscalibration is then defined as the difference between the
confidence level and the ratio of the number of times that the true value falls inside the confidence interval
over the total number of questions, where a ratio lower (higher) than the confidence level indicates
overconfidence (underconfidence). The most common finding in this paradigm is overconfidence. Even
though the degree of observed overconfidence varies depending on the precise nature of the task at hand
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 5
and the level of confidence which individuals are asked to state (for example 90% versus 50% or 70%),
overconfidence has been demonstrated for estimates in a variety of domains such as answers to general
knowledge questions (e.g., Alpert and Raiffa 1982; Klayman, Soll, Gonzalez-Vallejo, & Barlas, 1999;
Soll & Klayman, 2004), forecasts of stock prices (Budescu & Du, 2007), or outcomes of sport games
(Tsai, Klayman, & Hastie, 2008). Moreover, overconfidence is not limited to estimates made by students
in laboratory experiments, but has also been found frequently in the judgments of professionals such as
financial traders (Glaser, Langer, & Weber, 2013), stock market analysts (Deaves, Lüders, & Schröder,
2010; Jain, Mukherjee, Bearden, & Gaba, 2013), general managers (Russo & Schoemaker, 1992), and
Chief Financial Officers (Ben-David et al., 2013). In contrast, systematic underconfidence has only very
rarely been observed (Moore & Healy, 2008)
1
.
Calibration can be affected by both accuracy and interval widths. For example, decision makers
might make relatively accurate judgments, but the estimates could still be badly calibrated if the
confidence intervals are set to be very narrow. Conversely, a decision maker could be very inaccurate but
still achieve good calibration by setting wide enough confidence intervals.
Group deliberation and confidence calibration
The general finding from the comparison between individuals and groups with respect to confidence
calibration is that groups are better calibrated than individuals (Plous, 1995; Russo & Schoemaker, 1992;
Sniezek and Henry, 1989). However, it appears that such improved calibration in groups is mostly driven
by higher accuracy and not by groups greater appreciation of their own limited knowledge (Sniezek,
1992; Plous, 1995). In particular, whereas group judgments in previous studies tended to be more accurate
than those of individuals, the confidence intervals set by groups were not wider or in many cases even
narrower than those set by individuals.
In principle groups have access to a larger and more diverse pool of information than individuals
do (e.g., ; Hinsz, Tindale, & Vollrath, 1997; Levine & Smith, 2013) and group members can exchange
1
For other forms of miscalibration, which we will not consider here, such as misestimation of absolute or relative
performance, underconfidence is also often found (see e.g., Moore and Healy, 2008).
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 6
arguments in favor of or against a certain position which should help to better assess the degree of
uncertainty in their judgments (Sniezek & Henry, 1989; Sniezek, 1992). In particular, disagreements
between group members should make group members less confident about their judgments; on the other
side, strong agreement between the members should indicate that there is indeed good reason to be
confident about an answer. Thus if all agreements and disagreements between group members are shared
openly and in an unbiased manner, groups should be able to provide better calibrated confidence
statements. This is also consistent with the explanation for overconfidence in individual judgments by
Tversky and Kahneman (1974) who attribute overconfidence to a process of anchoring on an initial
judgment and not adjusting sufficiently in setting the limits of the confidence interval. Following their
theory, if group members openly share their opinions and available information, a group will have several
judgments provided by group members to serve as anchors and thus confidence intervals should be less
susceptible to insufficient adjustment.
However, there are several important reasons why groups might not be able to take advantage of
the diverse opinions and information present in the group and thus fail to improve their confidence
calibration. First of all, group members are subject to the desire for social acceptance and being liked
(Deutsch, 1949; Schachter, 1959), therefore they do not voice dissenting opinions and judgments so as to
avoid conflict (e.g., Asch, 1952; Nemeth, 1986; Schachter & Singer, 1962). Similarly, research on
groupthink (Esser, 1998; Janis, 1982) suggests that the desire to preserve harmony within a group can
override the motivation to freely share information, especially when such information contradicts the
opinions of other group members. Finally, even in the absence of a desire for social acceptance and group
harmony, a group member might simply fail to contribute his or her private information because groups
tend to strongly focus their discussion on information that is available to all group members before the
discussion started (see for example Stasser, 1992; Wittenbaum & Stasser, 1996).
As a consequence of these processes, group judgments are frequently based on the opinions of
only a small subset of all available information that could theoretically be shared by group members. This
might be particularly harmful for confidence calibration because the influence of a particular group
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 7
member on group judgment is often strongly linked to his or her individual level of confidence (e.g.,
Zarnoth & Sniezek, 1997), and thus overconfident members are likely to have the most influence on group
judgments. For example, Anderson, Brion, Moore, and Kennedy (2012) demonstrated that overconfident
individuals were perceived by other group members as more competent, and in turn were awarded higher
status and larger influence in the group. This effect even held when group members learned about the
overconfident individuals true competence (Kennedy, Anderson, & Moore, 2013). Therefore, when
groups rely only on information and opinions provided by a small subset of group members, those
members with the highest degree of individual overconfidence are likely to have the most influence on
group judgments and thus drive group confidence to an unwarranted level.
Whereas sharing of opinions and information is likely to cause groups to make better calibrated
confidence judgments, it is less clear if it will also lead to judgments that are more accurate. In general,
group deliberations have the greatest positive effect on judgment accuracy for tasks with solutions that can
be easily demonstrated to be correct to others once all information is available such as mathematical
problems (Laughlin & Ellis, 1986). In this case, a group member who knows the correct answer can
persuade others and the group will usually perform at the level of its best member or even above (e.g.,
Laughlin & Ellis, 1986; Laughlin, Bonner, & Miner, 2002). However, for tasks involving the estimation of
unknown values (e.g., like those in our study) in which the correct solution cannot be easily demonstrated
to others, even if all available information is shared during a group discussion, group members might not
be able to take advantage of this information and improve accuracy beyond what would be expected from
a simple statistical aggregation of individual judgments. Consistent with this logic, for these types of
estimation tasks, studies found that group judgments with deliberations tend to be more accurate than
individual judgments, but only similarly accurate as a statistical combination of group members’
individual judgments (see for example, Sniezek, 1990; Tindale & Larson, 1992; Gigone & Hastie, 1997).
In summary, in our study that focuses only on estimation tasks, we do not necessarily expect a significant
effect of information and opinion sharing during group deliberations on judgment accuracy.
The effects of gender composition on group deliberations
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 8
Existing research found mixed results concerning the effect of gender on miscalibration among individual
decision makers: Whereas Soll and Klayman (2004) reported that women provided wider confidence
intervals than men and were better calibrated, other studies did not find an effect of gender on intervals
widths or calibration (Biais, Hilton, & Mazurier, 2005; Jonsson & Allwood, 2003). In general, it is
possible that due to individual differences, groups with a higher proportion of female members display
lower levels of overconfidence. Importantly, we suggest that independent from such a possible effect, the
presence of female group members will strongly affect to what extent group members share opinions and
knowledge with each other and consequently their confidence calibration.
There is strong evidence that the presence of female group members has a significant impact on
the way group members interact with each other compared to groups composed of only men (for an
overview see for example Bear & Woolley, 2011). In general, women exhibit higher levels of
interpersonal sensitivity, i.e., they pay more attention and show more respect to other peoples feelings
and thoughts (c.f., Hall, 1978; Fletcher, 1998; McClure, 2000). As a consequence, women are for example
less likely than men to obtrusively interrupt others during group discussions (e.g., Anderson & Leaper,
1998; Smith-Lovin & Brody, 1989). Compared to all-male groups, groups with female members also
display more egalitarian behaviors, such as equal amounts of communication among group members and
shared leadership (Berdahl & Anderson 2005; Mast, 2001). Similarly, Woolley, Chabris, Pentland,
Hashmi, and Malone (2010) found that a higher share of female group members caused group discussions
to become less centered on only a few dominant group members, which enabled all group members to
participate more equally in the group discussion. Their results also confirmed that this effect was indeed
strongly linked to female group members higher level of interpersonal sensitivity.
In addition to having a direct impact on group interactions due to their own behavior, the presence
of female group members also affects the way male members behave and interact with other group
members. In a study on boards of directors in large US companies, Adams and Ferreira (2009) found that
female directors attended board meetings more often than male directors, and importantly also that the
presence of female directors on a board improved the attendance of male directors compared to boards
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 9
consisting of only male members. Thus, the presence of female group members appeared to have a
positive impact on the motivation and behavior of male members and to have shifted generally accepted
group norms concerning participation in board meetings away from those in all-male groups. In addition
to increasing job engagement, the presence of female group members also causes men to be more attuned
to other group members during group interactions and to show higher levels of interpersonal sensitivity
themselves (Williams & Polman, 2014). Importantly, this increased sensitivity of male group members is
not limited to interactions with female group members, but extends to interactions with other male
members. This observed shift towards more group-oriented norms is also in line with prior experimental
findings on the general tendency of men to behave more group-oriented in the presence of women. For
example, Van Vugt and Iredale (2013) reported that men contributed more to the overall welfare of their
groups in a public goods game in the presence of a female observer. Similarly, Boschini, Muren, and
Persson (2011) demonstrated that members of all-male teams were less willing to engage in costly efforts
to uphold cooperation within the group than were male members of gender-diverse groups.
Overall these findings suggest that the presence of women within a group causes a mental shift in
all group members toward more group-oriented norms. Such a positive group-oriented and
psychologically safe atmosphere in groups is a crucial factor that promotes the sharing of knowledge and
the expression of opinions especially when group members are in disagreement with each other
(Edmondson, 1999; Hackman, 1987; McLeod, Baron, Marti, & Yoon, 1997). In particular, group
members in such an environment will be less concerned that voicing disagreements or bringing up new
pieces of information might damage social harmony or might cause them to be evaluated negatively by
others, and thus focus more on sharing information and listening to the opinions of others.
We suggest that as a consequence of these processes, the presence of female group members, due
to their own behavior as well as their impact on the behavior of male group members, will cause group
members to share more opinions and information with each other during the group discussion.
Hypothesis 1: Group members in groups with at least one female member are more willing to
share opinions and information than those in all-male groups.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 10
Note that we did not specify the exact relationship between the degree to which group members
share opinions during the group interaction and the proportion of female members in a group. One
conclusion one might draw from the stream of work mentioned above is that this relationship is strictly
positive, and consequently all-female groups would perform the best in this regard. However, as our
previous discussion has pointed out, due to its influence on male group members behavior, the presence
of even only one female group member might have already been enough in moving the norms of the group
discussion towards higher inclusiveness of all group members opinions. It is thus unclear whether this
effect can still get stronger if there is more than one female member in the group. We therefore refrain
from making a direct prediction concerning this factor.
Because, as discussed above, sharing of opinions and information during the group discussion
should have a direct effect on groups confidence calibration, we make the following predictions
concerning the effect of the presence of at least one female group member on confidence calibration:
Hypothesis 2a: Groups with at least one female group member will make better calibrated
confidence judgments than those consisting of only male members.
Hypothesis 2b: Better calibration in groups with at least one female member will be mediated by
group members stronger willingness to share opinions and information.
In the following we present the design and the results of an experimental study in which
participants made judgments in the domain of general-knowledge questions and two different types of
forecasts. The paper concludes with a general discussion of our key findings and their implications.
3. Experiment Design
3.1. Methodology
We recruited 352 English speaking participants (180 male, 172 female; Mage = 23 years) from a major
European university via an online sign-up system. We conducted a total of 14 experimental sessions with
approximately 25 participants in each session. Participants were paid a fixed fee of 10 for their
participation. The study had a between-subject design with four conditions in which participants were
assigned to groups of three and in which we varied the groups gender composition in the following way:
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 11
all male (n = 26), male majority (n = 25), female majority (n = 23), and all female (n = 25). Participants in
these conditions made their judgments after an unstructured face-to-face discussion. All verbal
interactions between the three group members were audio-taped with the explicit knowledge of the
participants. As an additional control, we also included an individual-judgment condition (with 28 male
and 27 female participants), in which participants made their judgments alone without any interaction with
other participants.
3.2. Procedure
Participants were welcomed to the lab and then assigned to a computer. In the individual condition, each
participant was randomly seated in front of one computer. In the group condition, all three group members
were seated in front of the same computer. In each group, one of the group members was chosen to enter
the group judgments into the computer based on whose birthday was closest to the date of the experiment
2
.
Assignment to groups was random with the exception that we made sure that across all sessions there were
approximately the same number of groups for each of the four types of gender composition and that
members of any group did not know each other before the study. After participants were seated they were
asked to read the instructions on their screen. In addition, participants were also provided with paper-
based versions of the instructions.
We assigned two sets of items to the participants: ten general-knowledge questions (e.g., Biais et
al., 2005; Cesarini, Sandewall, & Johannesson, 2006; Juslin, Winman, & Hansson, 2007; Klayman et al.,
1999; Russo & Schoemaker, 1992; Soll & Klayman, 2004), and three questions on forecasts. Four of those
general-knowledge questions asked about distances between cities (Berlin to Vienna, Kairo to Capetown,
Los Angeles to Tokyo, and Paris to Moscow), three about the weights of unknown quantities (an Elephant
baby born recently in the Zoo of Vienna, an empty Airbus A380, and an empty Opel Astra limousine), and
three about the prices of products (an Apple laptop with a number of specific features, an economy class
Air-ticket from Vienna to New York booked in the next week, and a Mercedes S-Class bought in Austria
2
We did not find any effects arising from the gender of the person who entered the decisions in any of the
conditions.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 12
in the most basic version). Questions were presented in random order. Participants were asked to state
point estimates (their best guesses) as well as upper and lower bounds of 50%, 70%, and 90% confidence
interval estimates.
In the first two forecasting questions, participants were asked to provide 90% confidence intervals
for the price of the Dow Jones Index and of Microsoft shares in one month, six months, and one year
respectively. In order not to overburden participants we only asked for 90% intervals for this task. To help
with their forecasts we informed participants of the value of the Dow Jones Index and Microsoft shares at
the date of the experiment.
The third forecasting task was a random walk task that we adapted from Jain et al. (2013). We
provided participants with a description of a random variable over time and asked them to provide 50%,
70%, and 90% confidence intervals for the value of this variable after 100 periods. The initial value of the
variable was zero and in each period there was an equal chance of either a one-unit increase or a one-unit
decrease. Hence the expected value of this variable is zero and its variance is simply the total number of
periods   . In addition, for a large number of periods, the distribution of the variable can be
approximated by a normal distribution with mean 0 and variance, . Therefore, the theoretical 50%, 70%,
90% confidence intervals for this variable are respectively: (-0.674, 0.674), (-1.036, 1.036),
and (-1.645, 1.645).
After finishing the ten general-knowledge and the three forecasting questions, all participants
were asked to move to individual desks and individually fill in a final paper-based questionnaire with
demographic information. In addition, participants in the group conditions answered questions that aimed
to assess group members satisfaction with their groups and to what extent group members shared
opinions and information during the discussion. In all conditions, at the very end of the questionnaire
participants were asked what they believed to be the purpose of the experiment. Only four participants
correctly guessed that gender was a factor in our study. Removing these participants from the sample did
not change any of our results.
3.3. Measures
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 13
Based on participants point and interval estimations, we composed several measures that capture
the quality of their judgments. In the following we first define the notations. Let denote the quantity
that question asks for (e.g., the weight of an empty A380 in the general-knowledge questions or the price
of Microsoft shares in 6 months). Since the participants do not know the answer for sure, is a random
variable to them, and its realization, denoted as , stands for either the correct answer to a general-
knowledge question, or the actual value of a forecast item. We let , denote a decision maker (either an
individual or a group) s point estimate for question i. Moreover, , and  denote respectively the
lower bound and upper bound of decision maker js confidence interval estimate at confidence level to
question .
Judgment calibration. For the general-knowledge questions, we used both hit rate and calibration error
to measure the calibration of participants estimates. The hit rate for a decision maker j at confidence level
k was computed by counting the number of times the true value was within the confidence interval across
the ten questions:  

 .and calibration error was calculated as the absolute difference
between the hit rate and the required confidence level:    ; here   ,    
( being the total number of decision makers in each condition),   , and is an
indicator function that equals 1 if the condition is satisfied, and 0 otherwise. A decision maker is
considered to be perfectly calibrated, underconfident, or overconfident when the hit rate  for a given
confidence level equals, is greater than, or is less than the corresponding confidence level ,
respectively, and the calibration error captures the decision makers degree of miscalibration --- the larger
the calibration error, the less calibrated is the decision makers judgment.
For the two financial forecast tasks, following the prior literature on stock market forecasts (e.g.,
Ben-David et al., 2013; Glaser et al., 2013), we derived return volatility estimates from decision makers
confidence interval estimates and we used the mean historical return volatility on the Dow Jones and
Microsoft share prices as a normative benchmark. To do that, for each item (either the stock price or the
index value), we first transformed decision maker s stated confidence intervals into intervals for returns
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 14
by dividing the intervals upper and lower bounds by the corresponding items value on the day of the
experiment:
and
. Since we only have data for two forecasts it is not meaningful to use hit
rate as a normative benchmark; instead we deduced decision maker js implicit volatility estimate using
the following approximation (Pearson & Tukey, 1965):

 .
3
A decision maker is considered to be
perfectly calibrated, underconfident, or overconfident when the estimated return volatility is equal to,
greater than, or less than the mean historical return volatility,
4
, which we calculated using stock price
data obtained from the Center for Research in Security Prices (CRSP) ranging from 1995 to 2015.
Judgment accuracy. Our accuracy measure of a general-knowledge question for a decision maker is
the absolute percentage error computed by taking the absolute difference between a point estimate and
the true value, divided by the true value (see e.g., Mannes, Soll, & Larrick, 2014; Minson & Mueller,
2012; Davis-Stober, Budescu, Dana, & Broomell, 2014):  
.
5
Confidence interval width. We computed a measure of confidence interval widths to capture to what
extent participants appreciated the degree of uncertainty around their point estimates. The percentage
interval width of question by a decision maker with confidence level was calculated as percentage of
the point estimate:  
 and it indicates the extent to which participants appreciated the degree
of uncertainty concerning the accuracy of their point estimates.
Opinion and information sharing. To measure the level of opinion and information sharing during the
group deliberations, we asked the participants to rate four items adapted from Phillips and Loyd (2006) on
a 1 = not at all to 7 = very much scale: (i) Group members listened to each others point of view, (ii)
3
Keefer and Bodily (1983) show that given information about the 5th and 95th percentiles, this simple
approximation is the preferred method for estimating the standard deviation of the probability distribution of a
random variable.
4
This is consistent with definition of overconfidence in the literature as an overestimation of signal precision (e.g.,
Kyle & Wang, 1997; Odean, 1998; Hackbarth 2008) and has been employed in empirical work in for example Ben-
David et al. (2013) and Glaser et al. (2013).
5
As an alternative we also considered the squared error: 
. Since this measure gave similar results, in the
following we will only report the results for the absolute error.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 15
Group members encouraged each other to share their opinions, (iii) Group members were interested in
what the others had to say, and (iv) Group members shared a lot of information with each other.
Group satisfaction. We measured group member satisfaction with three items adapted from Jehn et al.
(2010) on the same scale as for information and opinion sharing: (a) I was very satisfied working in this
group during this exercise, (b) ‘‘I would like to work with this group again, and (c) ‘‘I was happy
working in this group during this exercise.
4. Results
4.1. Results from general-knowledge questions
For all of our measures, we initially also tested for differences between judgment categories (distances,
weight, and price), but did not find any significant main effects or interactions with our main dependent
measures of interest. Therefore, we dropped this variable from the analysis. We also tested for the effect
of diversity with respect to age and ethnicity which have been demonstrated to be the most important
dimensions of demographic diversity in small groups (e.g., Mannix & Neale, 2005; Van Knippenberg &
Schippers, 2007). Since 93% of our participants where white and 90% of were between the age of 21 and
27 (total range: 1832), groups were in general very homogenous with respect to these two factors. Our
analysis of age diversity or the presence of non-White group members showed no significant effect on any
of our dependent measures. Hence in order to focus on our main results we do not discuss these two
factors further.
Judgment calibration
Table 1 presents hit rates and calibration errors for 50%, 70%, and 90% confidence intervals aggregated
over all ten questions across the six types of decision makers.
Table 1: Hit rates and calibration errors across decision maker types
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 16
Decision
maker type
Gender M SD MSD MSD MSD MSD MSD
Female 37.41 13.47 45.93 13.38 54.07 14.48 15.56 9.74 24.81 11.89 35.93 14.48
Male 42.50 14.04 50.36 15.98 59.64 15.75 13.21 8.63 21.79 12.78 30.36 15.75
Gender
composition
All female 48.00 17.08 60.00 14.72 69.20 10.77 14.80 8.23 14.00 1.08 2.08 10.77
Female
majority
50.53 14.30 62.61 14.84 70.87 13.11 10.87 9.00 11.74 1.15 1.91 13.11
Male majority 48.80 15.63 61.20 17.40 72.80 16.46 12.40 9.26 13.60 1.38 1.80 15.55
All male 40.38 13.99 51.54 14.34 61.15 12.11 13.46 10.18 18.46 1.43 2.88 12.11
Calibration error (%)
50%
70%
90%
Hit rate (%)
90%
50%
70%
Results regarding the hit rates show that both individuals and groups were on average
overconfident for 70% and 90% confidence levels, whereas for the 50% confidence level overconfidence
is considerably lower. To analyze the degree of miscalibration we conducted a 6 (decision maker type) X
3 (confidence level) mixed Anova of calibration errors. The results showed a significant main effect of
confidence levels, F(17,444) = 40.46, p < .01, a significant effect of decision maker type, F(17,444) =
10.75, p < .01, and a significant interaction effect, F(17,444) = 1.91, p = .04. A planned contrast across all
three confidence levels revealed that all-male groups were calibrated worse in comparison to groups of
other gender compositions, F(3,444) = 5.30, p < .01, d = 0.42. This effect was significant for 90%,
F(3,444) = 12.04, p < .01, and 70%, F(3,444) = 3.79, p = .05, intervals, but not so for 50% intervals,
F(3,444) = 0.08, p = .78. Furthermore, pairwise comparisons showed that within groups having at least
one female member, there was significant difference in calibration error among those with a majority of
male members, a majority of female members and only female members, ps > 0.59.
At the individual level, there was no significant difference in calibration error between individual
judgments by men and by women across all three confidence levels, F(3,444) = 1.45, p = .23, and this
result also held when we tested for each confidence level separately, ps > .1. Furthermore our results
showed that, across all three confidence levels, groups on average were significantly better calibrated than
individuals, F(3,444) = 17.22, p < .01, d = 0.71; however, this difference became insignificant when we
compared only all-male groups and individuals, F(3,444) = 1.74, p = .15.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 17
We next compared group judgments with judgments that result from a simple statistical
aggregation procedure (taking the mean) of three individual judgments. To compose calibration errors that
would be expected from averaging three individual judgments, we randomly selected three individuals
from the individual judgment condition. For each question we computed aggregated confidence intervals
by taking the mean of the three lower and upper bounds stated by these individuals and then computed
calibration errors. We repeated this process 1000 times (sampling with replacement) and then averaged the
aggregated calibration errors (see for example Gaba, Tsetlin, & Winkler, 2014; Hora, 2004; Park &
Budescu, 2015 for a similar methodology to compute aggregation results from individual judgments). To
ensure that our results are comparable with the group estimates, we conducted this process for each gender
composition separately. In particular, the gender of the three selected individuals was consistent with the
corresponding gender composition of the group, i.e., we aggregated intervals estimated by only female
participants, by only male participants, by two male and one female participants, or by two female and one
male participants, to compare with those estimated by all-female, all-male, male-majority, or female-
majority groups, respectively . Figure 1 shows the calibration errors from groups and those from the
statistical aggregation procedure.
Figure 1: Calibration errors from group estimates and aggregated individual estimates
As Figure 1 shows, on average, groups that have at least one female group member are better
calibrated than what would be expected from a simple averaging of confidence intervals. Averaged across
the three confidence levels, this effect was significant for all-female groups, t(24) = -2.86, p = .01, and
groups with a majority of female members, t(22) = - 2.34, p = .03. For groups with a male majority there
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 18
was no significant difference between the statistical aggregation outcome and actual group judgments,
t(24) = -0.42, p = .68. Interestingly, for all-male groups the calibration of the group judgments was
actually significantly worse than the outcome of the statistical aggregation, t(25) = 3.81, p < .01.
Judgment accuracy and confidence interval widths
There are two potential factors that might affect the differences in calibration errors across
decision makers: accuracy and interval widths. For a fixed interval width, if a decision maker makes more
accurate judgments, i.e., point estimates that are closer to the true value, then confidence intervals around
this point estimate would contain the true value more often, which would result in a smaller calibration
error and a reduction in overconfidence. On the other hand, for a given accuracy, overconfidence and
calibration error could be reduced, if a decision maker sets wider confidence intervals that are more likely
to contain the true value. In the following, we test to what extent these two factors can account for our
observed difference in calibration error between all-male groups and other groups.
Table 2 presents the absolute percentage errors and 50%, 70%, and 90% percentage interval
widths aggregated over all ten questions across the six judgment types.
Table 2: Absolute percentage errors and percentage interval widths across decision maker types
Decision
maker type
Gender M SD MSD MSD MSD
Female 89.80 57.3 71.65 15.34 107.25 23.98 155.42 90.11
Male 100.60 67.8 74.28 16.18 106.85 36.87 155.47 113.50
Gender
composition
All female 68.64 30.88 73.35 13.98 107.87 24.02 148.17 37.39
Female majority 59.60 42.28 72.02 14.09 98.24 14.78 122.88 20.79
Male majority 53.89 22.35 72.93 10.64 106.13 19.95 146.46 41.05
59.80 29.71
Individual
Group
All male
67.22
12.33
93.79
27.96
110.35
18.08
Absolute
percentage
error (%)
Percentage interval width (%)
50%
70%
90%
A one-way Manova of absolute percentage errors averaged across the ten questions showed a
significant effect of the decision maker type, F (50.0, 637.3) = 1.48, p = .02. Planned contrasts did not
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 19
show a significant difference in accuracy between all-male groups and the other three group types,
F(10,139) = 1.21, p = .29, nor between individual judgments made by men and women, F(10,139) = 1.52,
p = .14. In contrast, our results show that groups made significantly more accurate judgment than
individuals, F(10,139) = 3.76, p < .01, d = 0.77.
A 6 (Decision maker type) X 3 (Confidence level) mixed Anova of percentage interval widths
averaged across the ten questions revealed a significant main effect of decision maker type, F(17,444) =
3.28, p < .01, and of confidence level, F(17,444) = 102.65, p < .01, but no significant interaction effect
between the two factors, F(17,444) = 1.12, p = .34. Planned contrasts across all three confidence levels
revealed that the percentage interval widths of all-male groups were significantly smaller than those of
other groups, F(3,444) = 3.13, p = .03, d = 0.40, and those of individuals, F(3,444) = 6.73, p < .01, d =
0.33. For individual judgements, there was no significant difference between judgments by women and
men, F(3,444) = 0.11, p = .95. Finally, we found that the widths of intervals provided by groups were not
significantly different than of those provided by individuals, F(3,444) = 1.49, p = .21.
Mediation and group discussion
We averaged the four items measuring the degree of opinion and information sharing during the group
discussion into one composite measure (α = .84)
6
. Inter-rater reliability across the three group members
(ICC1 = 0.57) was significantly different from zero, F(98,198) = 5.04, p < .01, suggesting that group
members ratings were strongly interdependent
7
. We then further aggregated the three composite measures
of the group members into one group measure. A one-way Anova across the four gender compositions
showed a significant effect of gender composition on the degree of opinion and information sharing in
groups, F(3,95) = 3.75, p = .01. In particular, as suggested by Hypothesis 1, members of all-male groups
engaged significantly less in the exchange of opinions and information than members of other groups,
F(1,95) = 10.53, p < .01, d = 0.74. Pairwise comparisons revealed no significant difference in the
6
We also tested for possible individual differences between male and female group members within each group with
respect to their perceptions of opinion and information sharing, but did not find a significant difference. We therefore
do not discuss this factor further.
7
Aggregation of individual measures to the group level is usually considered justified if the ICC(1) exceeds 0.2 (e.g.,
Kozlowski & Klein, 2000).
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 20
exchange of opinions and information between male majority, female majority and all-female groups, ps >
.40.
We next tested whether the degree of opinion and information sharing during the group discussion
mediated the difference in calibration between all-male groups and other group types, as suggested in
Hypothesis 2b. Figure 1 presents the results of a mediation analysis (Baron & Kenny, 1986) with opinion
and information sharing as the mediator between group gender composition (all-male groups vs. groups
with at least one female member) and calibration error averaged across all three confidence levels.
Figure 2: Results of mediation analysis (Baron & Kenny, 1986)
Our results show that sharing of opinions and information was significantly lower for all-male
groups (path a) than for groups with at least one female member. Furthermore, the negative effect of all-
male gender composition on calibration (path c) was reduced---and actually became insignificant---when
we controlled for opinion and information sharing in the regression (path b and c). We used a bootstrap
procedure (Shrout & Bolger, 2002) with 5000 samples to construct a 95% confidence interval for the
indirect effect of gender composition on calibration error. The confidence interval (0.09, 0.64) excluded
zero, which confirmed that our measure of opinion and information sharing was a significant mediator.
Summary of results for general knowledge questions
Our findings provide clear evidence for Hypothesis 2a suggesting that groups with at least one female
group member are better calibrated than those composed of men only. We also found that for groups with
Gender
composition:
All-male vs. other
Calibration
error
Sharing of
opinions and
information
-0.74** (0.23)
-0.49**(0.09)
0.69** (0.21) / 0.32 (0.19)
a
b
c / c´
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 21
at least one female group member, there is no significant difference in calibration among those with only
female members, a female majority, or a male majority. Moreover we found that group members in groups
with at least one female member were more willing to exchange opinions and information during the
group discussion as proposed in Hypothesis 1 and that this effect mediated their better confidence
calibration as suggested in Hypothesis 2b.
We found no significant difference in calibration between male and female individuals. Moreover,
our comparison of calibration errors of group judgments with those of a statistical aggregation of
individual judgements stated by a corresponding number of male and female participants revealed that
judgements from groups with at least one female member were better calibrated than those resulting from
the aggregation procedure; in contrast, confidence intervals from all-male groups were actually worse
calibrated than those resulting from averaging confidence intervals of three male individuals. Together
with the results of our mediation analysis these two results provide clear evidence that the improvement in
calibration of groups with at least one female member over all-male groups is not driven by individual
characteristics of group members, but by factors pertaining to the group discussion.
In general, the presence of female group members can have an impact on two potential factors that
drive better calibrated judgment: more accurate point estimates, or wider confidence intervals. Strongly
supporting the latter explanation, we found a significant difference between all-male group and groups of
other gender compositions in interval widths, but not in accuracy.
4.2. Results from financial and random walk forecasts
Financial forecasts
Table 3 shows mean historical return volatilities of the Dow Jones Index and Microsoft shares and
corresponding return volatility estimates derived from participants confidence intervals.
Table 3: Historical and estimated return volatilities
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 22
Mean
historical
return
volatility
Decision
maker type
Gender M SD MSD MSD MSD MSD MSD
Female 3.06 2.78 4.90 3.78 8.39 7.48 18.80 17.61 30.65 25.65 43.03 29.54
Male 2.76 2.47 4.66 4.05 6.22 5.48 19.72 16.12 26.89 22.32 38.40 29.62
Gender
composition
All female 3.36 2.92 6.17 4.27 10.74 7.09 12.31 7.60 21.06 14.54 30.91 23.75
Female
majority
2.83 2.53 4.25 3.48 6.06 4.74 12.04 6.39 19.25 10.64 30.03 25.52
Male majority 3.40 2.99 6.44 5.16 9.51 7.44 12.03 9.64 21.88 14.98 32.62 18.31
All male 2.40 1.96 4.28 3.31 5.86 4.40 7.59 5.65 13.74 9.74 21.76 26.48
Microsoft Shares (%)
6 months
12 months
1 month
9.53
22.06
36.19
4.23
Individual
Group
12.16
13.07
Dow Jones Industrial Index (%)
1 month
6 months
12 months
A comparison of return volatility estimates for the Dow Jones Index and mean historical return
volatilities across all three time horizons revealed that both groups, t(296) = 14.50, p < .01, and
individuals, t(164) = 11.58, p < .01, displayed overconfidence and significantly underestimated return
volatilities. For Microsoft shares we found that individuals actually overestimated return volatility, t(164)
= -3.54, p < .01, whereas groups tended to underestimate volatility, t(296) = 2.04, p = .04.
We next conducted two separate 6 (Decision maker type) X 3 (Time horizon) mixed Anova of
return volatility estimates for the Dow Jones Index and for Microsoft shares respectively. Our analysis for
the Dow Jones Index revealed a significant effect of time horizon, F(17,444) = 43.05, p < .01 and of
decision maker type, F(17,444) = 4.07, p < .01, but the interaction between the two was not significant,
F(17,444) = 0.87, p = .56. Planned contrasts, across all time horizons, showed that return volatility
estimates by all-male groups were significantly lower than those from groups with at least one female
member, F(3,444) = 3.32, p = .02, d = 0.34. There was no significant difference between return volatility
estimates by individual men and women, F(3,444) = 0.23, p = .88, or between individual and group
estimates, F(3,444) = 0.39, p = .76.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 23
Similarly, for estimated return volatilities of Microsoft share the results showed a significant
effect of time horizon, F(17,444) = 40.49, p < .01, and of decision maker type, F(3,444) = 7.08, p < .01,
but there was no significant interaction effect, F(3,444) = 0.26, p = .98. Planned contrasts, across all time
horizons, revealed that return volatility estimates by all-male groups were significantly lower than those
by other groups, F(3,444) = 3.99, p = .01, d = 0.47, again indicating stronger overconfidence for all-male
groups than for groups of other gender compositions. There was no significant difference between return
volatility estimates by individual men and women, F(3,444) = 0.39, p = .76. Moreover, the results showed
that return volatility estimates by groups were significantly lower than those by individuals, F(3,444) =
6.82, p < .01, d = 0.40.
Random walk forecasts
Table 4 summarizes the widths of decision makers estimated confidence intervals in the random walk
task and the corresponding theoretical benchmarks.
Table 4: Confidence interval widths in the random walk task
Random walk model
Decision maker type
Gender M SD MSD MSD
Female 8.36 9.66 13.21 11.47 20.95 17.89
Male 8.63 9.54 10.54 9.44 18.50 20.91
Gender composition
All female 18.22 20.22 30.68 26.95 54.00 46.64
Female majority 10.48 7.41 20.13 13.38 34.35 26.28
Male majority 10.00 9.84 19.76 19.31 34.72 28.99
All male 14.27 13.05 23.88 16.76 38.92 23.99
Individual
Group
13.48
20.72
32.9
Interval Width
50%
70%
90%
Similar to the financial forecasts, it is not meaningful to use hit rates as a normative benchmark
since we only have data for one forecast; furthermore, unlike the Dow Jones Index or the Microsoft
shares, there is no meaningful and observable realization of the random variable. On the other hand, the
random walk model allows us to directly compute confidence intervals estimates that can be used as a
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 24
theoretical benchmark. Hence, we are going to compare normative confidence interval widths and those
stated by decision makers to assess the susceptibility that a decision maker is prone to miscalibration
8
.
A 6 (Decision maker type) X 3 (Confidence level) mixed Anova of confidence interval widths
revealed a significant main effect of confidence level, F(17,444) = 47.03, p < .01, and decision maker
type, F(17,444) = 10.32, p < .01. There was no significant interaction between the two factors, F(17,444)
= 1.24, p = .26. A planned contrast revealed no significant difference in confidence interval widths
between all-male groups and other group types, F(3,444) = 0.10, p = .96, nor between individual judgment
by men and women, F(3,444) = 1.10, p = .35. In contrast, confidence levels by groups were significantly
wider than those by individuals, F(3,444) = 12.64, p < .01, d = -.58.
Summary of results for financial and random walk forecasts
Our data showed that estimated return volatilities of both the Dow Jones Index and Microsoft shares were
significantly lower for all-male groups than for groups with at least one female member; in both cases,
estimates by groups and especially all-male groups were also lower than the corresponding mean
historical return volatilities. In contrast to our findings for financial forecasts and general-knowledge
questions the analysis of random walk forecasts did not show a significant difference in interval width
between all-male groups and groups with at least one female member. One plausible explanation for this is
that the random walk forecast requires mostly mathematical intuition rather than real-world knowledge
(like in the other tasks) and that the former is less likely to be improved through more sharing of opinions
and information in the group discussion.
4.3. Analysis of audio-tapes and reported satisfaction
We now turn to our analysis of the audio recordings of group discussions and group members self-
reported satisfaction. On average, group discussions lasted for 32 minutes (SD = 5.64). A one-way Anova
indicated a significant difference in the amount of discussion time across groups of different gender
compositions, F(3,95) = 3.16, p = .03, and a direct comparison showed that discussions in all-male groups
8
Note that we do not use percentage interval width here, since the theoretical expected value of the position equals
zero which makes the percentage widths not appropriate for comparison.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 25
(M = 29.04 , SD = 6.74 ) were significantly shorter than those in other groups (M = 32.57, SD = 4.91),
F(1,95) = 8.14, p = .01, d = 0.65. To analyze to what extent discussions in all-male groups were
dominated by only one or two group members, we measured the proportional amount of time each group
member was speaking during the discussion and used this measure as a proxy for each group members
participation intensity (e.g., Phillips & Loyd, 2006; Woolley et al., 2010; Tost, Gino, & Larrick, & Gino,
2013). We then computed the variance of group member participation intensity across the three group
members (e.g., Woolley et al., 2010). Participation variance would equal zero when all group members
participated equally in the discussion, and reach its maximum when only one group member spoke and the
remaining two members remained completely silent. A one-way Anova revealed a significant difference
in the participation variance across groups of different gender composition, F(3,95) = 2.87, p = .04. In
addition, our analysis showed that the variance in group members participation intensity during the
discussion was significantly larger in all-male groups (M = 0.046, SD = 0.033) compared to all other
groups (M = 0.029, SD = 0.031), F(1,95) = 7.50, p = .01, d = -0.63.
9
To analyze participants satisfaction with their group, we aggregated the three self-reported
measures of group member satisfaction into one composite variable (α = .81). A one-way Anova indicated
a significant difference across gender compositions, F(3, 95) = 4.73, p < .01. Moreover, planned contrasts
revealed that group members in all-male groups were less satisfied than those in groups with at least one
female member, F(1,95) = 9.26, p < .01, d = 0.69.
5. General discussion
The results of our experiment revealed that confidence judgments by groups with at least one
female member were significantly better calibrated than those by all-male groups. This effect was
mediated by a higher willingness to share opinions and information in groups with one or more female
members. Consistently, our analysis of the audio group discussion content established that in groups with
at least one female member, group members participated more evenly in the group discussion than in all-
9
We also tested for potential differences in participation intensity between male and female group
members within each group, but did not find a significant effect of group member gender.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 26
male groups where discussions were more likely to be dominated by only a single member and also ended
more quickly. We did not find a significant difference in confidence calibration between individual
judgments made by men and women. This result is in line with some prior research (Biais et al., 2005;
Jonsson & Allwood, 2003), but different from other (Soll & Klayman, 2004). Moreover, confidence
calibration in groups was generally better than what would be expected from a simple aggregation of
individual judgments, but this effect did not hold for groups consisting of three male members, whose
calibration was actually worse than what would be expected from averaging three individual judgments.
Thus while group deliberation was mostly beneficial for groups with at least one female member, it was
actually detrimental for all-male groups. We suggest that this latter effect might be due to the more
frequent monopolization of the group discussion by one or two members in all-male groups. Due to this
process, all-male groups might be performing closer to the level of individual decision makers with
respect to their confidence calibration, compared to groups of other gender compositions in which
discussion participation of the three group members was more evenly balanced. Consistent with the results
of prior research (Plous, 1995; Russo & Schoemaker, 1992; Sniezek & Henry, 1989), a direct comparison
of group and individual judgments showed that groups made significantly better calibrated judgments than
individuals. However this advantage of group decision making was mostly lost in the case of all-male
groups.
In general, our results from the mediation analysis, the statistical aggregation model and the
comparison of individual judgments by men and women all three strongly indicate that it is the group
deliberation process, rather than group members individual differences, that drives the difference in
calibration between groups with at least one female member and all-male groups.
Both accuracy and interval widths are factors that might be affecting calibration. Whereas we did
not find a significant effect of groups gender composition on judgment accuracy, confidence intervals set
by all-male groups were significantly narrower than those by other groups. Thus, whereas group gender
composition clearly did not affect a groups ability to correctly answer a question, our results indicate that
it did have a strong effect on group members appreciation of their own lack of knowledge.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 27
The results for forecasts of financial variables mirrored those we observed for general knowledge
questions, providing converging evidence for our hypotheses from a different type of task. All-male
groups provided confidence intervals that implied significantly lower return volatility in the stock market
than those provided by groups with at least one female member. Moreover, while return volatility
estimates by all groups were generally lower than those observed historically, volatility estimates by
groups with only male members were even further away from historical volatilities. These findings also
suggest that the detrimental effect resulting from the absence of female group members extends to tasks
that are similar to those carried out within the finance industry---an area with a relatively low proportion
of women and thus very likely a high proportion of all-male groups.
In contrast to our results from the general knowledge and financial forecast questions, we did not
find a significant difference between all-male groups and other group types for estimates in the random
walk task. This result provides an interesting boundary condition for the effects of gender composition. An
explanation for this result might be that, unlike the general knowledge questions and the financial
forecasts, this particular task requires mostly mathematical intuition and does not relate to real-world
phenomena. In general, even if group members strongly engage in the exchange of opinions and
information, their lack of skill in a task might prevent them from taking advantage of this increase in
available information (e.g., Woolley et al. 2010). Thus for forecasts in the random walk task, improved
information and opinion sharing during the group discussion might not have been as beneficial as in the
other tasks, because most group members might have lacked the necessary mathematical skills to
understand the properties of the random walk model and how it evolves over time.
Our work makes three important contributions. First of all, whereas prior research on
overconfidence in groups (Plous, 1995; Russo & Schoemaker, 1992; Sniezek & Henry, 1989) was limited
to a direct comparison of individual and group judgments, we focus on the comparison between all-male
groups and groups with at least one female member. Our study establishes gender composition as an
important moderating factor that determines to what extent group discussions can alleviate miscalibration
in confidence judgments. In particular, our findings reveal that group deliberations have a mostly positive
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 28
effect on calibration for group with one or more female members, but actually harms calibration in all-
male groups. In addition, our study also extends prior work on group confidence calibration to the area of
financial forecasts. Thus our results might also have important practical implications. In particular,
organizations in the financial sector that rely on such forecasts could attempt to improve the quality of
their forecasts by adjusting their human resource practices to ensure that groups of analysts contain at least
one female member.
Secondly, our study contributes to the literature on group diversity. In particular, our findings
demonstrate that the benefits of group diversity could be more subtle than an increase in group
performance---they might instead arise from a lower susceptibility to typical perils of group decision
making such as overconfidence. In recent years considerable attention from academic research, public
media, and politics has been paid to the gender composition of top management teams and board of
directors. Even though a higher share of women on boards and in top management teams is often regarded
as desirable for reasons of gender equality, there have been mixed findings on its actual impact on firms
financial performance (e.g., Erhardt, Werbel, & Shrader, 2003; Wolfers, 2006; for a partial exception see
Adams & Ferreira, 2009). Our results suggest that one important advantage of a gender-diverse team
might be its ability to better deal with situations that involve substantial levels of uncertainty and to take
on more adequate levels of risk in such situations due to better confidence calibration. Such advantage
might not be directly visible in firms financial performance (which is also influenced by a large variety of
other factors), but is crucial in keeping firms from being exposed to excessive risk, and hence away from
the danger of bankruptcy (e.g., Ben-David et al. 2013).
Third, our work adds further evidence to the literature on the psychological processes that are
triggered by group diversity. In particular, our findings concerning the beneficial effects of gender
diversity on information and opinion sharing are in line with prior theoretical frameworks suggesting that
the effects of group diversity do not predominantly derive from additional knowledge or skills that
members add to the group, but rather from the effects of group diversity on within-group processes such
as information sharing (e.g., Van Knippenberg & Schippers, 2007; Van Knippenberg, de Dreu, & Homan,
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 29
2004). Importantly, our findings also establish an interesting discontinuity in the effects of gender
composition: whereas compared to groups with at least one female member, all-male groups are
significantly worse calibrated because group members are less willing to share their opinions during the
group discussion, there is no difference in calibration or the quality of group discussions between groups
with all female members, a majority of female members, or a majority of male members. This is
consistent with the prior observation that female group members shape the nature of group discussions not
only through their own behavior but also by affecting the behavior of male group members (e.g., Adams
& Ferreira, 2009; Williams & Polman, 2014). Therefore, even the presence of just one woman in the
group appears to be sufficient to derive all potential benefits.
Finally, whereas some prior work suggests that a possible disadvantage of gender-diverse groups
might be lower group member satisfaction---since men and women prefer to work in gender homogenous
groups (e.g., Tsui, Egan, & OReilly III, 1992), our results showed that members of all-male groups
displayed the least satisfaction and the lowest willingness to work with the other group members again. It
would be an interesting topic for future research to identify the precise conditions under which gender
diversity leads to higher or lower group satisfaction. For example, it might be that men and women enjoy
working together more in relatively gender-neutral tasks, such as those in our experiment, but satisfaction
becomes lower for tasks that are perceived by group members as stereotypically male or female.
An important limitation of our work is that we focused on small groups of only three members. In
particular, our findings that even the presence of one female member was sufficient to improve the quality
of the group discussion and confidence calibration might not necessarily hold in larger groups. Prior work
has pointed out that female group members in groups that otherwise consist only of men might be prone to
being perceived as a token and thus to being marginalized and ignored (e.g., Kanter, 1977a, b).
Therefore, in order for larger groups to benefit from the advantages of gender diversity there might need
to be a certain critical mass of female group members (e.g., Joecks, Pull, Vetter, 2013; Torchia,
Calabro, & Huse, 2011). An important step for future research would be to further explore the exact
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 30
relationship between the proportion of women in a group and their effect on group processes such as
sharing of information and opinions in larger groups.
Moreover, our study has only focused on a particular type of overconfidence, namely calibration
in confidence interval judgments. An interesting direction for future research would be to explore the
effect of gender composition on other forms of overconfidence such as the tendency of individuals and
groups to underestimate the time required to finish a project (Buehler, Messervey & Griffin, 2005; Staats,
Milkman & Fox, 2012) or the better than average effect (Kruger, & Dunning, 1999; Sniezek, 1989). In
addition, future studies could study the effects of gender composition on common cognitive biases other
than overconfidence, such as escalation of commitment (Bazerman, Giuliano & Appelman, 1984; Whyte,
1993) or the confirmation bias (Schulz-Hardt et al. 2000, 2002). In general, studies that have compared
individuals and groups with respect to cognitive biases have reported very mixed results (e.g.,
Kerr, MacCoun, & Kramer, 1996). Our finding that gender composition has an important effect on the
quality of group deliberations suggests that the gender composition of a group might be an important
moderating factor that could explain under what circumstances groups deal with cognitive biases better
than individuals.
Our research in this paper has mostly focused on how the judgment quality of groups can be
improved, but prior research has also pointed out that there can be strong knowledge transfers from group
discussions to subsequent individual judgment (e.g., Maciejovsky & Budescu, 2007; Maciejovsky, Sutter,
Budescu, & Bernau, 2013). It would therefore be interesting to explore to what extent such learning
effects exit in the context of confidence calibration and how knowledge transfers in general are affected
by group gender composition. For example, it might be the case that learning effects are smaller when
group discussions are strongly dominated by only one or two members as it was often then case in all-
male groups in our study.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 31
References
Adams, R. B., & Ferreira, D. (2009). Women in the boardroom and their impact on governance and
performance. Journal of Financial Economics, 94(2), 291309.
Alpert, Marc & Raiffa, Howard (1982). A progress report on the training of probability assessors. In
Daniel Kahneman, Paul Slovic & Amos Tversky (eds.), Judgment Under Uncertainty: Heuristics
and Biases. Cambridge University Press 294305.
Anderson, K. J., & Leaper, C. (1998). Emotion talk between same-and mixed-gender friends Form and
Function. Journal of Language and Social Psychology, 17(4), 419448.
Anderson, C., Brion, S., Moore, D. A., & Kennedy, J. A. (2012). A status-enhancement account of
overconfidence. Journal of Personality and Social Psychology, 103(4), 718735.
Apesteguia, J., Azmat, G., & Iriberri, N. (2012). The impact of gender composition on team performance
and decision making: Evidence from the field. Management Science, 58(1), 7893.
Asch, S. E. (1952). Group forces in the modification and distortion of judgments. Asch, Solomon E. ,
(1952). Social psychology. , (pp. 450--501). Englewood Cliffs, NJ, US: Prentice-Hall, Inc, xiii, 649
pp.
Barber, B. M., & Odean, T. (2000). Trading is hazardous to your wealth: The common stock investment
performance of individual investors. Journal of Finance, 773806.
Barber, B. M., & Odean, T. (2002). Online investors: do the slow die first?. Review of Financial
Studies, 15(2), 455488.
Baron, R. M., & Kenny, D. A. (1986). The moderatormediator variable distinction in social
psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality
and Social Psychology, 51(6), 11731182.
Bazerman, M. H., Giuliano, T., & Appelman, A. (1984). Escalation of commitment in individual and
group decision making. Organizational Behavior and Human Performance, 33(2), 141152.
Bear, J. B., & Woolley, A. W. (2011). The role of gender in team collaboration and performance.
Interdisciplinary Science Reviews, 36(2), 146153.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 32
Ben-David, I., Graham, J. R., & Harvey, C. R. (2013). Managerial miscalibration. The Quarterly Journal
of Economics, 15471584.
Berdahl, J. L., & Anderson, C. (2005). Men, Women, and Leadership Centralization in Groups over Time.
Group Dynamics: Theory, Research, and Practice, 9(1), 4557.
Biais, B., Hilton, D., Mazurier, K., & Pouget, S. (2005). Judgemental overconfidence, self-monitoring,
and trading performance in an experimental financial market. The Review of Economic Studies,
72(2), 287312.
Boschini, A., Muren, A., & Persson, M. (2011). Men among men do not take norm enforcement seriously.
The Journal of Socio-Economics, 40(5), 523529.
Budescu, D. V., & Du, N. (2007). Coherence and consistency of investors' probability judgments.
Management Science, 53(11), 17311744.
Buehler, R., Messervey, D., & Griffin, D. (2005). Collaborative planning and prediction: Does group
discussion affect optimistic biases in time estimation?. Organizational Behavior and Human
Decision Processes, 97(1), 4763.
Cesarini, D., Sandewall, Ö., & Johannesson, M. (2006). Confidence interval estimation tasks and the
economics of overconfidence. Journal of Economic Behavior & Organization, 61(3), 453--470.
Clemen, R. T. (1996). Making hard decisions: An introduction to decision analysis (2nd ed.). Boston:
PWS-Kent Publishing.
Davis-Stober, C. P., Budescu, D. V., Dana, J., & Broomell, S. B. (2014). When is a crowd wise?.
Decision, 1(2), 79101.
Deaves, R., Lüders, E., & Luo, G. Y. (2008). An experimental test of the impact of overconfidence and
gender on trading activity. Review of Finance, 13, 555575.
Deaves, R., Lüders, E., & Schröder, M. (2010). The dynamics of overconfidence: Evidence from stock
market forecasters. Journal of Economic Behavior & Organization, 75(3), 402412.
Deutsch, M. (1949). An experimental study of the effects of cooperation and competition upon group
process. Human Relations, 2(3), 199231.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 33
Dezsö, C. L., & Ross, D. G. (2012). Does female representation in top management improve firm
performance? A panel data investigation. Strategic Management Journal, 33(9), 10721089.
Edmondson, A. (1999). Psychological safety and learning behavior in work teams. Administrative Science
Quarterly, 44(2), 350383.
Erhardt, N. L., Werbel, J. D., & Shrader, C. B. (2003). Board of director diversity and firm financial
performance. Corporate Governance: An International Review, 11(2), 102111.
Fletcher, J. (1998). Relational practice: A feminist reconstruction of work. Journal of Management
Inquiry 7(2), 163186.
Gaba, A., Tsetlin, I., & Winkler, R. L. (2014). Combining Interval Forecasts. Working Paper.
Gigone, D., & Hastie, R. (1997). The impact of information on small group choice. Journal of Personality
and Social Psychology, 72(1), 132140.
Glaser, M., Langer, T., & Weber, M. (2013). True overconfidence in interval estimates: Evidence based
on a new measure of miscalibration. Journal of Behavioral Decision Making, 26(5), 405417.
Goetzmann, W. N., & Kumar, A. (2008). Equity portfolio diversification. Review of Finance, 12(3),
433463.
Hackbarth, D. (2008). Managerial traits and capital structure decisions. Journal of Financial and
Quantitative Analysis, 43(4), 843--881.
Hackman, J. (1987). The Design of Work Teams in J. Lorcsh (ed.). Handbook of Organizational Behavior:
315342.
Hall J. A. (1978). Gender effects in decoding nonverbal cues. Psychological Bulletin, 85(4), 845857.
Heilman, M. E. (2012). Gender stereotypes and workplace bias. Research in Organizational Behavior, 32,
113135.
Heath, C., & Gonzalez, R. (1995). Interaction with others increases decision confidence but not decision
quality: Evidence against information collection views of interactive decision making.
Organizational Behavior and Human Decision Processes, 61(3), 305326.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 34
Hinsz, V. B., Tindale, R. S., & Vollrath, D. A. (1997). The emerging conceptualization of groups as
information processors. Psychological Bulletin, 121(1), 43-64.
Hogarth, R. M. (1978). A note on aggregating opinions. Organizational Behavior and Human
Performance, 21(1), 4046.
Hoogendoorn, S., Oosterbeek, H., & Van Praag, M. (2013). The impact of gender diversity on the
performance of business teams: Evidence from a field experiment. Management Science, 59(7),
15141528.
Hora, S. C. (2004). Probability judgments for continuous quantities: Linear combinations and calibration.
Management Science, 50(5), 597604.
Ilgen, D. R. (1999). Teams embedded in organizations: Some implications. American Psychologist, 54(2),
129139.
Jackson, S. E. (1992). Consequences of group composition for the interpersonal dynamics of strategic
issue processing. Advances in Strategic Management, 8(3), 345382.
Jain, K., Mukherjee, K., Bearden, J. N., & Gaba, A. (2013). Unpacking the Future: A nudge toward wider
subjective confidence intervals. Management Science, 59(9), 19701987.
Jehn, K. A., Northcraft, G. B., & Neale, M. A. (1999). Why differences make a difference: A field study
of diversity, conflict and performance in workgroups. Administrative Science Quarterly, 44(4),
741763.
Jehn, K. A., Rispens, S., & Thatcher, S. M. (2010). The effects of conflict asymmetry on work group and
individual outcomes. Academy of Management Journal, 53(3), 596616.
Joecks, J., Pull, K., & Vetter, K. (2013). Gender diversity in the boardroom and firm performance: What
exactly constitutes a “critical mass?”. Journal of Business Ethics, 118(1), 6172.
Jonsson, A. C., & Allwood, C. M. (2003). Stability and variability in the realism of confidence judgments
over time, content domain, and gender. Personality and Individual Differences, 34(4), 559574.
Juslin, P., Winman, A., & Hansson, P. (2007). The naive intuitive statistician: a naive sampling model of
intuitive confidence intervals. Psychological Review, 114(3), 678703.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 35
Kanter, R. M. (1977a). Men and Women of the Corporation (Vol. 5049). Basic books.
Kanter, R. M. (1977b). Some effects of proportions on group life: Skewed sex ratios and responses to
token women. American Journal of Sociology, 82(5), 965990.
Keefer, D. L., & Bodily, S. E. (1983). Three-point approximations for continuous random variables.
Management Science, 29(5), 595609.
Kennedy, J. A., Anderson, C., & Moore, D. A. (2013). When overconfidence is revealed to others: Testing
the status-enhancement theory of overconfidence. Organizational Behavior and Human Decision
Processes, 122(2), 266279.
Kerr, N. L., MacCoun, R. J., & Kramer, G. P. (1996). Bias in judgment: Comparing individuals and
groups. Psychological Review, 103(4), 687719.
Klayman, J., Soll, J. B., Gonzalez-Vallejo, C., & Barlas, S. (1999). Overconfidence: It depends on how,
what, and whom you ask. Organizational Behavior and Human Decision Processes, 79(3), 216
247.
Kozlowski, S. W., & Bell, B. S. (2003). Work groups and teams in organizations. Handbook of
Psychology.
Krishnan, H. A., & Park, D. (2005). A few good womenon top management teams. Journal of Business
Research, 58(12), 17121720.
Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: how difficulties in recognizing one's own
incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology,
77(6), 11211134.
Kyle, A. S., & Wang, F. A. (1997). Speculation duopoly with agreement to disagree: Can overconfidence
survive the market test?. Journal of Finance, 52(5), 20732090.
Larrick, R. P., & Soll, J. B. (2006). Intuitions about combining opinions: Misappreciation of the averaging
principle. Management Science, 52(1), 111127.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 36
Laughlin, P. R., Bonner, B. L., & Miner, A. G. (2002). Groups perform better than the best individuals on
letters-to-numbers problems. Organizational Behavior and Human Decision Processes, 88(2),
605620.
Laughlin, P. R., & Ellis, A. L. (1986). Demonstrability and social combination processes on mathematical
intellective tasks. Journal of Experimental Social Psychology, 22(3), 177189.
Levine, J. M., & Smith, E. (2013). Group cognition: Collective information search and distribution. In D.
Carlston (Ed.), Oxford handbook of social cognition (pp. 616-633). New York: Oxford University
Press.
Lichtenstein, S., & Fischhoff, B. (1977). Do those who know more also know more about how much they
know?. Organizational Behavior and Human Performance, 20(2), 159183.
Mannes, A. E., Soll, J. B., & Larrick, R. P. (2014). The wisdom of select crowds. Journal of Personality
and Social Psychology, 107(2), 276299.
Mannix, E., & Neale, M. A. (2005). What differences make a difference? The promise and reality of
diverse teams in organizations. Psychological Science in the Public Interest, 6(2), 3155.
Maciejovsky, B., & Budescu, D. V. (2007). Collective induction without cooperation? Learning and
knowledge transfer in cooperative groups and competitive auctions. Journal of Personality and
Social Psychology, 92(5), 854870.
Maciejovsky, B., Sutter, M., Budescu, D. V., & Bernau, P. (2013). Teams make you smarter: How
exposure to teams improves individual decisions in probability and reasoning tasks. Management
Science, 59(6), 12551270.
Mast, M. S. (2001). Gender differences and similarities in dominance hierarchies in same-gender groups
based on speaking time. Sex Roles, 44(9/10), 537556.
McClure, E. B. (2000). A meta-analytic review of sex differences in facial expression processing and their
development in infants, children, and adolescents. Psychological Bulletin, 126(3), 424453.
McKenzie, C. R., Liersch, M. J., & Yaniv, I. (2008). Overconfidence in interval estimates: What does
expertise buy you?. Organizational Behavior and Human Decision Processes, 107(2), 179191.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 37
McLeod, P. L., Baron, R. S., Marti, M. W., & Yoon, K. (1997). The eyes have it: Minority influence in
face-to-face and computer-mediated group discussion. Journal of Applied Psychology, 82(5), 706
718.
Minson, J. A., & Mueller, J. S. (2012). The cost of collaboration: Why joint decision making exacerbates
rejection of outside information. Psychological Science, 23(3), 219224.
Moore, D. A., & Healy, P. J. (2008). The trouble with overconfidence. Psychological Review, 115(2),
502517.
Nemeth, C. J. (1986). Differential contributions of majority and minority influence. Psychological Review,
93(1), 2332.
Odean, T. (1998). Volume, volatility, price, and profit when all traders are above average. The Journal of
Finance, 53(6), 18871934.
Önkal, D., Yates, J. F., Simga-Mugan, C., & Öztin, Ş. (2003). Professional vs. amateur judgment
Accuracy: The case of foreign exchange rates. Organizational Behavior and Human Decision
Processes, 91(2), 169185.
Park, S., & Budescu, D. V. (2015). Aggregating multiple probability intervals to improve calibration.
Judgment and Decision Making, 10(2), 130143.
Pearson, E. S., & Tukey, J. W. (1965). Approximate means and standard deviations based on distances
between percentage points of frequency curves. Biometrika, 533546.
Phillips, K. W., & Loyd, D. L. (2006). When surface and deep-level diversity collide: The effects on
dissenting group members. Organizational Behavior and Human Decision Processes, 99(2), 143
160.
Plous, S. (1995). A comparison of strategies for reducing interval overconfidence in group judgments.
Journal of Applied Psychology, 80(4), 443454.
Russo, J. E., & Schoemaker, P. J. (1992). Managing overconfidence. Sloan Management Review, 33(2),
717.
Schachter, S. (1959). The psychology of affiliation. Stanford University Press.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 38
Schachter, S., & Singer, J. (1962). Cognitive, social, and physiological determinants of emotional state.
Psychological Review, 69(5), 379399.
Schulz-Hardt, S., Frey, D., Lüthgens, C., & Moscovici, S. (2000). Biased information search in group
decision making. Journal of Personality and Social Psychology, 78(4), 655669.
Schulz-Hardt, S., Jochims, M., & Frey, D. (2002). Productive conflict in group decision making: Genuine
and contrived dissent as strategies to counteract biased information seeking. Organizational
Behavior and Human Decision Processes, 88(2), 563586.
Shrout, P. E., & Bolger, N. (2002). Mediation in experimental and nonexperimental studies: new
procedures and recommendations. Psychological Methods, 7(4), 422445.
Simmons, J. P., Nelson, L. D., Galak, J., & Frederick, S. (2011). Intuitive biases in choice versus
estimation: implications for the wisdom of crowds. Journal of Consumer Research, 38(1), 115.
Smith-Lovin, L., & Brody, C. (1989). Interruptions in group discussions: The effects of gender and group
composition. American Sociological Review, 54(3), 424435.
Sniezek, J. A., & Henry, R. A. (1989). Accuracy and confidence in group judgment. Organizational
Behavior and Human Decision Processes, 43(1), 128.
Sniezek, J. A. (1990). A comparison of techniques for judgmental forecasting by groups with common
information. Group & Organization Management, 15(1), 519.
Sniezek, J. A. (1992). Groups under uncertainty: An examination of confidence in group decision making.
Organizational Behavior and Human Decision Processes, 52(1), 124155.
Soll, J. B., & Klayman, J. (2004). Overconfidence in interval estimates. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 30(2), 299314.
Spetzler, C. S., & Stael von Holstein, C. A. S. (1975). Exceptional paper-probability encoding in decision
analysis. Management Science, 22(3), 340358.
Staats, B. R., Milkman, K. L., & Fox, C. R. (2012). The team scaling fallacy: Underestimating the
declining efficiency of larger teams. Organizational Behavior and Human Decision
Processes, 118(2), 132-142.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 39
Tindale, R. S., & Larson, J. R. (1992). Assembly bonus effect or typical group performance? A comment
on Michaelsen, Watson, and Black (1989). Journal of Applied Psychology, 77(1), 102105.
Torchia, M., Calabro, A., & Huse, M. (2011). Women directors on corporate boards: From tokenism to
critical mass. Journal of Business Ethics, 102(2), 299317.
Tost, L. P., Gino, F., & Larrick, R. P. (2013). When power makes others speechless: The negative impact
of leader power on team performance. Academy of Management Journal, 56(5), 1465-1486.
Triandis, H. C., Kurowski, L. L., & Gelfand, M. J. (1994). Workplace diversity. Triandis, Harry C. (Ed);
Dunnette, Marvin D. (Ed); Hough, Leaetta M. (Ed), (1994). Handbook of industrial and
organizational psychology, Vol. 4 (2nd ed.). , (pp. 769--827). Palo Alto, CA, US: Consulting
Psychologists Press, xxv, 869 pp.
Tsai, C. I., Klayman, J., & Hastie, R. (2008). Effects of amount of information on judgment accuracy and
confidence. Organizational Behavior and Human Decision Processes, 107(2), 97105.
Tsui, A. S., Egan, T. D., & O'Reilly III, C. A. (1992). Being different: Relational demography and
organizational attachment. Administrative Science Quarterly, 549579.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science,
185(4157), 11241131.
Wegge, J., Roth, C., Neubach, B., Schmidt, K. H., & Kanfer, R. (2008). Age and gender diversity as
determinants of performance and health in a public organization: the role of task complexity and
group size. Journal of Applied Psychology, 93(6), 13011313.
Williams, M., & Polman, E. (2014). Is It Me or Her? How Gender Composition Evokes Interpersonally
Sensitive Behavior on Collaborative Cross-Boundary Projects. Organization Science, 26(2), 334
355.
Wolfers, J. (2006). Diagnosing discrimination: Stock returns and CEO gender. Journal of the European
Economic Association, 4(2/3), 531541.
Woolley, A. W., Chabris, C. F., Pentland, A., Hashmi, N., & Malone, T. W. (2010). Evidence for a
collective intelligence factor in the performance of human groups. Science, 330(6004), 686688.
Running Head: GENDER COMPOSITION AND CONFIDENCE CALIBRATION 40
Van Knippenberg, D., De Dreu, C. K., & Homan, A. C. (2004). Work group diversity and group
performance: an integrative model and research agenda. Journal of Applied Psychology, 89(6),
10081022.
Van Knippenberg, D., & Schippers, M. C. (2007). Work group diversity. Annual Review of Psychology,
58, 515541.
Van Vugt, M., & Iredale, W. (2013). Men behaving nicely: Public goods as peacock tails. British Journal
of Psychology, 104(1), 313.
Whyte, G. (1993). Escalating commitment in individual and group decision making: A prospect theory
approach. Organizational Behavior and Human Decision Processes, 54(3), 430455.
Zarnoth, P., & Sniezek, J. A. (1997). The social influence of confidence in group decision making.
Journal of Experimental Social Psychology, 33(4), 345366.
Article
To support broader global efforts to improve diversity and inclusion in economics, this paper provides a statistical picture of the gender composition of the economics profession in Australia and the evidence‐based initiatives taken by the Women in Economics Network (WEN) to improve women's representation and recognition. WEN's impact is evaluated across a range of metrics. This includes a case study of WEN's mentorship programme for university students that was delivered as a behavioural intervention and evaluated as a randomised control trial. Drawing on practical experiences in combination with research insights, the paper identifies some of the challenges encountered and the lessons that can be shared with similar organisations globally that are pursuing diversity and inclusion goals.
Article
Groups such as committees or boards make many important decisions within organizations. Many of these decisions affect external parties. This paper uses an experimental approach to study how the gender composition of three-person groups affects choices and beliefs in a Coordination game with selfish and prosocial equilibria. We find that the social preferences of group members are a key determinant of the group’s coordination choice. Controlling for social preferences of the group, groups with more women are more likely to make choices that are kinder to external parties. Both men and women believe that women will make kinder choices more frequently. Groups comprised of all men are expected to make 18 percentage points fewer kind choices than groups of all women. Men are also expected to be 9 percentage points less kind than women overall. These results have implications for public policies intended to increase gender diversity and women’s representation on decision-making committees in the corporate sector, in politics, and in academia.
Article
Our understanding of the link between women managers and firm-level innovation remains incomplete. Building on recent research on gender and leadership styles, we argue that there is a positive association between women managers and firm innovation. We highlight the selection process of women managers as an important underlying mechanism and discuss institutional and environmental contingencies as factors that influence this association. Specifically, we theorize and garner empirical support for the idea that in countries with legislation that promotes legally-mandated gender quotas, underqualified women may be selected for management positions, whereas in countries with voluntary gender quotas (or quotas are entirely absent), women are predominantly selected on the basis of their qualifications. The association between women and innovation is strengthened (weakened) in the latter (former) case. We also argue that this positive relationship is stronger under conditions of environmental complexity, which typically characterize innovation activities. These predictions are supported on the basis of data from the Management, Organization and Innovation (MOI) survey which covers manufacturing firms in twelve countries.
Article
Full-text available
In this article, we attempt to distinguish between the properties of moderator and mediator variables at a number of levels. First, we seek to make theorists and researchers aware of the importance of not using the terms moderator and mediator interchangeably by carefully elaborating, both conceptually and strategically, the many ways in which moderators and mediators differ. We then go beyond this largely pedagogical function and delineate the conceptual and strategic implications of making use of such distinctions with regard to a wide range of phenomena, including control and stress, attitudes, and personality traits. We also provide a specific compendium of analytic procedures appropriate for making the most effective use of the moderator and mediator distinction, both separately and in terms of a broader causal system that includes both moderators and mediators. (46 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Averaging estimates is an effective way to improve accuracy when combining expert judgments, integrating group members' judgments, or using advice to modify personal judgments. If the estimates of two judges ever fall on different sides of the truth, which we term bracketing, averaging must outperform the average judge for convex loss functions, such as mean absolute deviation (MAD). We hypothesized that people often hold incorrect beliefs about averaging, falsely concluding that the average of two judges' estimates would be no more accurate than the average judge. The experiments confirmed that this misconception was common across a range of tasks that involved reasoning from summary data (Experiment 1), from specific instances (Experiment 2), and conceptually (Experiment 3). However, this misconception decreased as observed or assumed bracketing rate increased (all three studies) and when bracketing was made more transparent (Experiment 2). Experiment 4 showed that flawed inferential rules and poor extensional reasoning abilities contributed to the misconception. We conclude by describing how people may face few opportunities to learn the benefits of averaging and how misappreciating averaging contributes to poor intuitive strategies for combining estimates.
Article
Full-text available
We apply the principles of the "Wisdom of Crowds (WoC)" to improve the calibration of interval estimates. Previous research has documented the significant impact of the WoC on the accuracy of point estimates but only a few studies have examined its effectiveness in aggregating interval estimates. We demonstrate that collective probability intervals obtained by several heuristics can reduce the typical overconfidence of the individual estimates. We re-analyzed data from Glaser, Langer and Weber (2013) and from Soll and Klayman (2004) and applied four heuristics Averaging, Median, Enveloping, Probability averaging-suggested by Gaba, Tsetlin and Winkler (2014) and new heuristics, Averaging with trimming and Quartiles. We used the hit rate and the Mean Squared Error (MSE) to evaluate the quality of the methods. All methods reduced miscalibration to some degree, and Quartiles was the most beneficial securing accuracy and informativeness.
Article
When combining forecasts, a simple average of the forecasts performs well, often better than more sophisticated methods. In a prescriptive spirit, we consider some other parsimonious, easy-to-use heuristics for combining interval forecasts and compare their performance with the benchmark provided by the simple average, using simulations from a model we develop and data sets with forecasts made by professionals in their domain of expertise. The relative performance of the heuristics is influenced by the degree of overconfidence in and dependence among the individual forecasts, and different heuristics come out on top under different circumstances. The results provide some good, easy-to-use alternatives to the simple average, with an indication of when each might be preferable.
Article
Many decisions are based on beliefs concerning the likelihood of uncertain events such as the outcome of an election, the guilt of a defendant, or the future value of the dollar. Occasionally, beliefs concerning uncertain events are expressed in numerical form as odds or subjective probabilities. In general, the heuristics are quite useful, but sometimes they lead to severe and systematic errors. The subjective assessment of probability resembles the subjective assessment of physical quantities such as distance or size. These judgments are all based on data of limited validity, which are processed according to heuristic rules. However, the reliance on this rule leads to systematic errors in the estimation of distance. This chapter describes three heuristics that are employed in making judgments under uncertainty. The first is representativeness, which is usually employed when people are asked to judge the probability that an object or event belongs to a class or event. The second is the availability of instances or scenarios, which is often employed when people are asked to assess the frequency of a class or the plausibility of a particular development, and the third is adjustment from an anchor, which is usually employed in numerical prediction when a relevant value is available.
Article
In the article by S. Schachter and J. Singer, which appeared in Psychological Review (1962, 69(5), 379-399) the following corrections should be made: The superscript "a" should precede the word "All" in the footnote to Table 2. The superscript "a" should appear next to the column heading "Initiates" in Table 3. The following Tables 6-9 should be substituted for those which appeared in print. (The following abstract of this article originally appeared in record 196306064-001.) It is suggested that emotional states may be considered a function of a state of physiological arousal and of a cognition appropriate to this state of arousal. From this follows these propositions: (a) Given a state of physiological arousal for which an individual has no immediate explanation, he will label this state and describe his feelings in terms of the cognitions available to him. (b) Given a state of physiological arousal for which an individual has a completely appropriate explanation, no evaluative needs will arise and the individual is unlikely to label his feelings in terms of the alternative cognitions available. (c) Given the same cognitive circumstances, the individual will react emotionally or describe his feelings as emotions only to the extent that he experiences a state of physiological arousal. An experiment is described which, together with the results of other studies, supports these propositions. (PsycINFO Database Record (c) 2006 APA, all rights reserved).