When, and Why, Do Teams Beneﬁt from Self-Selection?
Mira Fischer Rainer Michael Rilke B. Burcin Yurtoglu
September 25, 2021
We investigate the eﬀect of team formation and task characteristics on performance in
high-stakes team tasks. In two natural ﬁeld experiments, we found that randomly assigned
teams performed signiﬁcantly better than self-selected teams in a task that allowed for an
unequal work distribution. If the task required the two team members to contribute more
equally, the eﬀect was reversed. Investigating mechanisms, we observe that teams become
more similar in terms of ability and cooperate better when team members can choose each
other. We show how diﬀerent levels of skill complementarity across tasks may explain our
results: If team performance largely depends on the abilities of one team member, random
team assignment may be preferred because it leads to a more equal distribution of skills across
teams. However, if both team members’ abilities play a signiﬁcant role in team production,
the advantage of random assignment is reduced, and the value of team cooperation increases.
Keywords: Team Performance, Self-selection, Field Experiment, Education
JEL Classiﬁcation: I21, M54, C93
Fischer: WZB Berlin Social Science Center, Reichpietschufer 50, 10115 Berlin, Germany, email:
mira.ﬁscher@wzb.eu; Rilke: WHU - Otto Beisheim School of Management, Economics Group, Burgplatz
2, 56176 Vallendar, Germany, email: email@example.com; Yurtoglu: WHU - Otto Beisheim School of Manage-
ment, Finance Group, Burgplatz 2, 56176 Vallendar, Germany, email: firstname.lastname@example.org. This paper
analyses two natural ﬁeld experiments. The ﬁeld experiments were pre-registered with the code AEARCTR-
0002757 and AEARCTR-0003646 under the title "Peer selection and performance - A ﬁeld experiment in higher
education". We thank Steﬀen Loev, Marek Becker, and Andrija Denic for their extremely helpful assistance with
the data. We also thank Bernard Black, Robert Dur, Ayse Karaevli, Simeon Schudy, Gari Walkowitz, participants
of the Advances with Field Experiments Conference in Boston, and seminar participants at the Higher School
of Economics in Moscow, Humboldt University of Berlin, University of Trier, University of Duisburg-Essen,
University of Mannheim, Burgundy School of Business in Dijon, University of Amsterdam, and WHU - Otto
Beisheim School of Management for their helpful comments and suggestions on earlier versions of this paper.
Financial support by Deutsche Forschungsgemeinschaft through CRC TRR 190 (project number 280092119) is
In today’s highly complex economic environment, cooperation among individuals is crucial
for organizational success. As businesses become increasingly global and cross-functional, the
need for teamwork has been growing in all domains of work and life (O’Neill and Salas, 2018;
Cross et al., 2016). Indeed, ﬁrms and organizations create value by providing mechanisms
for people to work together, and to take advantage of complementarities in their skills and
interests (Lazear and Oyer, 2012). The nature and the eﬀectiveness of teamwork in a variety
of productive activities matter for outcomes in diverse settings, ranging from entrepreneurial
ventures (Reagans and Zuckerman, 2019) to the mutual fund industry (Patel and Sarkissian,
2017), and from medical practices (Geraghty and Paterson-Brown, 2018) to research projects
seeking to achieve scientiﬁc breakthroughs (Wuchty et al., 2007).
Economists and management scholars have studied extensively the inﬂuence of various forms of
team incentives (e.g., team bonuses or tournaments) on team performance, while recognizing
the importance of cooperation in teams. Although research has shown that team bonuses and
team piece rates tend to have a positive eﬀect on productivity (e.g., Englmaier et al., 2018;
Friebel et al., 2017; Hamilton et al., 2003; Erev et al., 1993), the evidence on the eﬀects of team
tournament incentives on performance has been inconclusive (e.g., Delfgaauw et al., 2019, 2018;
Bandiera et al., 2013). Moreover, because the underlying team tasks these previous studies
examined varied, the transferability of existing ﬁndings to diﬀerent types of tasks is limited. For
example, while some team tasks may require one person to be the team’s main driver, other
tasks may require all team members to pull in the same direction. Thus, as diﬀerent team
tasks require diﬀerent team compositions, which team assignment mechanism is used can have a
substantial impact on team performance.
Two potential mechanisms through which the team assignment process aﬀects team performance
are the composition and the motivation of teams. For example, when people are allowed to
choose their teammates, they match with people they like (e.g., Curranrini et al., 2009; Leider
et al., 2009), but they also trade oﬀ both the pecuniary beneﬁts of better cooperation and
the non-pecuniary beneﬁts of working in teams with friends against the pecuniary beneﬁts of
working with higher-ability team members (Bandiera et al., 2013; Hamilton et al., 2003).1
Our study analyzes how self-selection and random assignment inﬂuence composition, cooperation,
and performance of teams on diﬀerent team tasks using two natural ﬁeld experiments. We argue
that the impact of the team formation process hinges on the degree of skill complementarity
among the team members and on the collaborative eﬀorts required to perform well on a particular
team task. When the team task requires the team members’ abilities to be substitutes, which
renders collaboration relatively unimportant, we expect to ﬁnd that randomly assigned teams
perform better. In such cases, self-selection is detrimental to average team performance, because
it leads to a concentration of skills in some teams. Thus, when performing well requires high
levels of skill complementarity and collaboration, we hypothesize that self-selection is beneﬁcial
for average team performance.
We embedded the experiments in a mandatory microeconomics course for ﬁrst-year undergraduate
students at a major German business school. The course consisted of two parallel study groups
who were receiving the same course content from the same instructor. In the winter quarters of
2017/18 and 2018/19, two cohorts of students were randomly assigned to those study groups. In
one class, students were allowed to choose a teammate during the ﬁrst week of class (treatment
Self ). In the other class, students were randomly assigned to a team of two during the ﬁrst week
of class (treatment Random).
The teams had to work on two types of high-stakes tasks that varied in the distribution of the
work required to achieve a high level of performance that counted towards students’ course
grades: either a written task that required the team members to submit a written team solution,
or a video task that required the team members to submit a videotaped team solution in which
each of the team members was equally visible. The teams’ scores depended solely on the accuracy
of their solutions. Since the written task required the team to submit a joint written solution,
the contributions of the individual team members could be unequal. By contrast, the video task
required the two team members to be equally visible.
We ﬁnd that compared to teams that were randomly assigned, teams that were self-selected
were more homogeneous in terms of their abilities, and had higher levels of perceived team
Laboratory experiments examined the link between diﬀerent group formation mechanisms in cooperation
games. This literature has shown that cooperation in endogenously formed groups were similar to the contribution
levels in groups with exogenous matching (e.g., Gächter and Thöni, 2005; Guido et al., 2019; Chen, 2017).
cooperation. Furthermore, our results show that self-selected teams performed signiﬁcantly worse
than randomly assigned teams on the written task, but tended to be better on the video task.
These ﬁndings can be explained by a simple formal model thatdemonstrates that the beneﬁts of
self-selection in terms of the homogeneity of the team members’ abilities and the motivation
of the team members come into play only if the contributions and the cooperation of both
team members are needed to complete the task. However, if a task can be solved by one main
contributor, and the ability of the other team member and their cooperation were, therefore,
of little importance, random assignment may lead to superior average team performance, as it
generally results in a more equal distribution of abilities across teams. In other words, if the
skills needed to perform a task are substitutable, this task is, on average, performed better by
randomly assigned teams; whereas if the skills needed to perform a task are complementary,
and the level of cooperation required to complete the task was suﬃciently high, the task is, on
average, performed better by self-selected teams.
This study adds to the small body of existing work on the consequences of team assignment
mechanisms in real-world settings. Chen and Gong (2018) found that university students who
self-selected their teammates performed better on a presentation task than students who were
randomly assigned to teams. Likewise, Dahlander et al. (2019) found that students who could
freely choose with whom they worked performed better when they were given an entrepreneurial
task than another group of students who were free to choose their entrepreneurial task.
Chen and Gong (2018) showed that self-selection led to a process of team formation that was
based on the members’ social connections rather than on their skills, neither they nor Dahlander
et al. (2019) examinee the mechanisms that underlie their ﬁndings. The question of whether
their results can be generalized to other settings and to other types of team tasks thus remains
open. Our setting, by contrast, allows us to shed light on several important mechanisms (task
characteristics, team composition, and cooperation) and to advance a straightforward explanation
for the existing ﬁndings.
Our study makes three contributions to the literature. First, using two randomized natural
ﬁeld experiments, we tested how the self-selection of teams aﬀected the composition of the team
members’ abilities, their cooperation levels, and their performance across diﬀerent tasks. Second,
In a laboratory experiment Büyükboyaci and Robbett (2019) investigated the interaction of complementarity
of skills and specialization. They found that when specialization was not possible, self-selection had no eﬀect
on performance; and that the option to specialize had a positive eﬀect on performance, which was signiﬁcantly
magniﬁed when agents had a say in who joined their team.
we demonstrat that self-selection (compared to random selection) can have opposite eﬀects on a
team’s performance depending on the task’s production function. Finally, our study combines
these insights to oﬀer an explanation for why self-selected teams may be expected to perform
better than randomly assigned teams on highly collaborative tasks, but not on other types of
The paper proceeds as follows: Section 2 presents a slightly formalized exposition of how random
team assignment versus self-selection may aﬀect team performance on diﬀerent tasks; Section 3
describes the ﬁeld experiment; Section 4 presents the results; and Section 5 concludes.
2 Team performance on diﬀerent tasks: Relative importance of
abilities and collaboration
Though our ﬁeld setting did not allow us to impose a speciﬁc production function for the team
tasks, and we do not intend to test a theoretical model of team performance, we use a short,
slightly formalized exposition that captures the key features of our experiment to facilitate
the development of our hypotheses. To illustrate how the composition of the team members’
abilities and the intensity of their collaboration may aﬀect the team’s performance depending on
the type of task they are engaged in, we assume a hypothetical setting that involves two team
tasks that vary in their production function: Two individuals, denoted as
and j, form a team.
Each teammate has a uni-dimensional cognitive ability level
, and the team can invest
collaborative eﬀort (q).
We assume that a team’s output, which determines their score, s, is given by:
max(ai, aj)α·min(ai, aj)β·qγ.
represent the elasticities of the score with respect to the ability of the more able
teammate, the less able teammate, and the collaborative eﬀort, respectively. In other words,
these parameters measure the responsiveness of the team’s output to a change in the levels of
the team members’ abilities and of the collaborative eﬀort. This exposition allows us to capture
the intuition that the division of labor and the level of collaboration tasks require may diﬀer.
To illustrate this intuition, we discuss two extreme examples. For example, if the structure
of a task requires that both team members implement a solution together, even if one team
member’s ability is more important in ﬁnding the solution (
α > β
), the abilities of both team
members, as well has the quality of their collaboration, matter for team performance; thus,
, β >
0. Therefore, the team’s score on this kind of task – i.e., a task in
which the team members’ abilities and collaborative eﬀorts are complements – is determined by:
sC=max(ai, aj)α·min(ai, aj)β·qγ.
However, if the task is best done by the most able person alone, the ability of the ablest team
member may be of paramount importance for team performance; thus, in such cases, the ability
of the other team member and team collaboration may not matter. Under these assumptions,
= 0 and
= 0. The team’s score on this kind of task – i.e., a task in which the team
members’ abilities are substitutes – is thus given by: sN C =max(ai, aj).
If the score of one individual depends positively on the productivity of their teammate, there is
an incentive for subjects to match with a high-ability teammate. If the matching is two-sided –
i.e., if all individuals can actively search for a teammate – the subjects will assortatively match
by ability. This tendency results in high-ability individuals forming teams with other high-ability
individuals, and low-ability individuals forming teams with other low-ability individuals. If the
productivity of one individual additionally depends on the team’s collaborative eﬀorts, there
is an incentive to choose teammates who are likely to put in considerable eﬀort. In line with
this theoretical result, the empirical literature has suggested that when subjects are allowed to
choose their teammates, they tend to choose teammates who have similar abilities, and with
whom they are acquainted (Leider et al., 2009; Ai et al., 2016; Chen and Gong, 2018). Based
on this reasoning, we would expect to ﬁnd that the maximum ability is, on average, lower in
self-selected teams than in randomly assigned teams, because high-ability individuals tend to
cluster in some of the teams. Nevertheless, we would expect the levels of collaborative eﬀort to
be higher in self-selected teams, as the team members may enjoy working together more, and
may thus work together more productively than the team members in randomly assigned teams.
Combining the above strands of reasoning, we expect to observe that the performance of
randomly assigned teams is, on average, better and more heterogeneous if they are performing
a task in which the team members’ abilities are substitutes, and collaboration is unimportant.
Furthermore, we expect to ﬁnd that the beneﬁt of randomly assigned teams over self-selected
teams is smaller when they are performing a task in which the team members’ abilities are
complements, and collaboration matters. Thus, we also expect to observe smaller diﬀerences in
the performance of the randomly assigned teams and the self-selected teams on the video task
than on the written task. If
is suﬃciently small (i.e., if the ability of the lower-ability team
member, as well as the collaborative eﬀort of both team members are suﬃciently high for the
team to perform well), then self-selected teams may even outperform randomly assigned teams
in the latter task.
3.1 Context and background
The ﬁeld experiment was conducted with students of the BSc program at a well-known German
business school between October 2017 and April 2019. The business school oﬀers university
education in business administration, with degrees at the BSc, MSc, MBA, and PhD levels,
as well as executive education programs. The school has around 2,000 students. At the BSc
level, the school oﬀers the International Business Administration program. In academic year
2017/2018, a total of 672 students were enrolled in the program, 26% of whom were female.
Studying the impact of team formation mechanisms on team performance requires an environment
in which participants can choose teammates, in which the selection mechanism can be exogenously
varied, and in which team performance can be objectively measured. The environment of the
business school class we studied fulﬁlled all of these criteria, while allowing us to maintain a high
degree of control. Furthermore, to observe self-selection not only on demographic characteristics,
but also on ability, we needed a sample of participants who were already acquainted with each
other. This was assumed to be the case for our student subjects, given that at the point in time
when they were attending the class, they had already completed courses together, and had ample
opportunities to get to know each other through extracurricular activities (e.g., through student
societies and sports teams; and through involvement in music, drama, political campaigning, or
community work) that took place at the business school.
Figure 1: Sequence of events and data sources
TEAM TASK 1 TEAM TASK 2EXAM SURVEY
- Team work quality
- Course quality
- Performance beliefs
- Social preferences
- Analytical test
Figure displays the variables and the sequence of events in the experiments. The sequence of events is
the same for both Experiment I and Experiment II.
3.2 Experimental timeline and treatments
The ﬁeld experiments took place in the Microeconomics I course, with two cohorts of ﬁrst-year
students in the BSc program in International Business Administration participating. In each
cohort, students were randomly assigned to two separate classes, both taught by the same
instructor (one in the morning and one in the afternoon of the same day). During the ﬁrst week,
students learned that to fulﬁl the course requirements, they had to complete two tasks in teams
of two, and to pass an exam at the end of the quarter. As the instructor did not announce any
task-speciﬁc details about the team tasks in the ﬁrst week, the students only knew that these
tasks were take-home assignments that they had to complete during study hours, and that they
would have to complete both tasks together with the same team member, because re-matching
was not permitted.
For each cohort, in one class – i.e., the Self treatment – the instructor told the students on the
ﬁrst day to form a team with a fellow student of their choice. The students had to write down
their team’s composition and submit it to the instructor before the second meeting. In the other
class – i.e., the Random treatment – the students were randomly assigned to a team of two, and
they were informed of their team’s composition by email before the second meeting.
The ﬁrst team task was assigned to the students in mid-November, and had to be completed by
early December. The second team task was assigned to the students in early December, and
had to be completed by the end of January. The ﬁnal exam took place in March. During the
course, the students received no feedback on their performance on the team tasks. After the
ﬁnal exam, the feedback consisted only of the students’ overall course grades. Upon request, the
students could also receive detailed information about both their team’s performance on the
diﬀerent tasks, and their individual performance on the exam. Figure 1 displays the timeline of
In the winter quarter of 2017/18 (Experiment I, n=190, 31% female) the students completed two
written team tasks. In the winter quarter of 2018/19 (Experiment II, n=192, 29% female), the
ﬁrst task was a written task, and the second task was a video task. Across the two experiments,
the ﬁrst task was identical, and the students were supposed to submit their solutions in written
form. By contrast, the second task diﬀered across the two experiments, although it had very
similar content. In Experiment I, the students were supposed to submit their solutions in
written form; whereas in Experiment II, students were required to videotape their solution. This
design allowed us to identify interaction eﬀects of the team formation mechanism with the task
characteristics, as well as heterogeneous trends in collaboration across the treatments.
Given that we expected the eﬀect of the team assignment mechanism on team performance to
hinge on the degree to which the abilities of both team members and their levels of collaboration
mattered for productivity, we aimed to design two types of tasks that required the same levels
of cognitive ability, but that diﬀered in the extent to which they required both inputs and
collaboration from both team members. We chose to use microeconomics exercise sets that
required very similar cognitive skills to complete, but for which the solutions were submitted in
diﬀerent forms: i.e., through a written or a videotaped presentation. The students’ submissions
for both types of tasks were evaluated based on whether they gave correct, concise, and coherent
answers to the microeconomics problems. However, the instructions for the video task contained
the additional requirement that both team members present part of the solution.
In the written
tasks, the students were required to reach an agreement about which of the teammates’ solutions
was best. However, the students could produce the correct solution by themselves. In contrast,
for the solution to the video task to be considered acceptable, the students had to jointly prepare
3The exercise sets appear in the online appendix.
the presentation, and each student had to correctly present part of the solution, which required
a higher level of cooperation and inputs from both team members.
The written task consisted of problems for which students had to submit written solutions. These
problems called for the application of the theoretical knowledge that the students had acquired
during lectures, such as analyzing demand patterns, calculating market outcomes, or designing
pricing strategies. Providing a solution involved explaining the theoretical background, applying
a correct approach to the solution, and performing a series of calculations that possibly included
one or two graphs. In addition, the instructions for the written tasks speciﬁed that the students
had to present their written answers clearly. The answers could be either typed or handwritten,
but they had to be legible.
The video task consisted of questions for which students had to submit their solutions in a
ﬁve-minute video. The questions required a level of microeconomics skills very similar to that
required in the written task, and the solutions also consisted of explaining the theoretical
background, applying a correct approach to the solution, and performing a series of calculations.
The teams were allowed to use whiteboards, graphs, illustrations, and slides to make their videos
more eﬀective. In addition, the instructions speciﬁed that the video should be comprehensible;
i.e., that the presenters’ speech should be understandable. The instructions further stated that
the teams could use their smartphones to produce the video, and that the technical quality of the
video itself would not be graded. Finally, and crucially, the instructions stated that both team
members, along with their individual contributions, had to be visible in the video. The lecturer
explained that videos in which only one team member could be seen giving the presentation
were not acceptable. All students’ submissions met this criterion, and were thus evaluated for
their correctness, conciseness, and coherence.
Data for the study were gathered from three sources (see Figure 1). The pre-experiment data
contained the students’ high school performance (GPA) and their performance on the business
school’s admission tests. Both the GPA and the results of the admission tests were independent
measures of each student’s academic ability prior to the experiment, as they were not aﬀected
by their peers at the business school. Moreover, our endline data included information on each
student’s performance on the two team tasks and on the ﬁnal course exam at the end of the
quarter. The data also included information on each student’s perceptions of the cooperative
behavior within their team, their relationship with their team member, their evaluation of
the teacher’s performance in the course, and an incentivized measure of pro-sociality collected
through a post-experiment survey that was conducted after the ﬁnal exam and before the
students received feedback about their performance.
3.4.1 Pre-experiment ability measures
Our pre-experiment ability measures came from the business school’s student registry; speciﬁ-
cally, from its admissions data.
The business school’s program, which is known to be highly
competitive, uses a selective admissions procedure. In the ﬁrst step of the admissions process,
applicants to the BSc program provide basic demographic information and their high school
grade point average (GPA).
The admissions oﬃce ranks applicants by their GPA, and invites the
top 10% to an admissions day, where the applicants take a written test designed to measure their
analytical reasoning (quantitative) skills. They also take an oral test that has a presentation,
a group discussion, and an interview component, and is intended to measure the applicants’
communication, social, and problem-solving skills, and to assess whether they are a good ﬁt for
the program. The components of the oral test are each rated by two independent evaluators,
whose ratings are then averaged.
For our analysis, we will use the students’ GPA scores, the quantitative part of the admission
test (henceforth called the Analytical Test), and an aggregate measure of the oral part of the
admission test (henceforth called the Admission Test).6
We did not have access to an IRB at the beginning of the project, we could not obtain a formal IRB approval.
In the students’ contract with the school, they consent to the anonymous processing of their data. The agreement
stipulates that the university can use the administrative data for statistical and scientiﬁc purposes. Moreover, the
variation was implemented with the permission of the business school’s Dean of Studies and is within the normal
range of changes the private business school regularly implements to improve its teaching.
The German GPA (Abiturnote) ranges from 4.0 (suﬃcient) to 1.0 (excellent) grade and is the most important
criterion for university admission in Germany (e.g., Fischer and Kampkötter, 2017). Our entire sample had an
average GPA of 1.79 (SD=.504). For our analysis, we inverted the GPA so that higher values indicated better
high school performance.
6For each student, we averaged the scores over all components and standardized them.
3.4.2 Team outcomes
Each student’s performance on both team tasks and the individual exam determined their ﬁnal
grade. Each team received a common grade for their performance per task, and each task had a
weight of 15% toward the individual ﬁnal grade. The exam was written at the end of the course,
and contributed 70% to the ﬁnal grade. A teaching assistant who had previous experience with
the course, but who was unaware that an experiment was taking place, graded the students’
performance on the written and the videotaped team tasks, and on the exam.7
3.4.3 Post-experiment survey
On the day following the ﬁnal exam, we invited the students to take part in an online post-
This survey elicited the students’ perceptions of the quality of the collab-
oration in their team, of their relationship with their team member, and of the teaching. To
incentivize participation, we used a raﬄe in which one survey participant was picked randomly
to receive a 200 EUR reward. For an incentivized measure of the students’ pro-sociality, we
asked the students what fraction of this amount they would like to donate to UNICEF if they
We begin our analysis by establishing the internal validity of our experimental approach. We
show that the student sample did not diﬀer between the treatments on any observable variable
elicited before the experiments. We then present the experimental results together with an
analysis of the eﬀects of the two assignment mechanisms on team performance, our primary
outcome measure. Next, we show how the two assignment mechanisms aﬀected the team
formation process, while focusing on the eﬀects of these mechanisms on the composition and the
cooperation levels of the teams.
As we were concerned that the ratings of the video task might suﬀer from low reliability due to the video
format, we subjected them to a validation exercise. In this exercise, two additional independent raters rated the
videos based on the same instructions as those used by the teaching assistant who made the original assessments.
The additional ratings had correlations of 0.72 and 0.71 with the original rating, and a correlation of 0.81 between
each other. Thus, the reliability of the presentation ratings can be considered satisfactory. All results were found
to be robust to using these additional ratings.
8The survey was accessible until the exam grades were published; which usually takes up to six weeks.
Table 1: Randomization checks
Experiment I Experiment II Experiment I + II
Variable Self Random p-value Self Random p-value Self Random p-value
GPA .057 -.053 .195 .074 -.084 .523 .066 -.068 .134
Analytical test .022 -.010 .889 -.056 .064 .428 -.019 .026 .666
Admission test -.039 .035 .479 .037 -.042 .591 .001 -.002 .864
% female .287 .323 .593 .356 .220 .038 .323 .273 .282
Note: Descriptive statistics of pre-experiment data, admission test scores and pro-sociality. GPA is inverted and
z-standardized, with a higher GPA indicating better school performance. The Analytical Test, and the Admission test
are z-standardized. The p-values are from a Mann-Whitney U test (MWU) test (two-sided) comparing diﬀerences in
mean ranks between the two treatments. The p-values for the comparison of %female are from a
Unstandardized values (Table A.1) and a correlation matrix (Table A.3) appear in the appendix.
4.1 Randomization checks
Table 1 provides an overview of the properties of our sample in the treatments and the experiments.
We show separate summary statistics for Experiment I and Experiment II, and pooled statistics
for both experiments. The table shows that the randomization was successful in producing
highly similar groups based on observable characteristics, such as high school performance (GPA)
and performance on the admission tests. The only characteristic that diﬀered signiﬁcantly
between treatments in Experiment II was the percentage of female students (
We therefore provide results from two regression speciﬁcations, both with and
without controlling for gender (and other observables).
4.2 Team performance
Our primary outcome measure, team performance, is the score that the teams received for their
work on two separate tasks during the quarter. We summarize our results in Figure 2, which
plots the standardized average team score for each task by treatment, and also shows individual
exam performance. The left panel shows the outcomes for Experiment I, while the right panel
shows the outcomes for Experiment II.
For Experiment I, in which the solutions to the ﬁrst and the second team tasks had to be
submitted in written form, the ﬁgure indicates that, on average, the teams in Random performed
better than the teams in Self. A non-parametric comparison of average team scores yielded
a signiﬁcantly lower score for the teams in Self than for the teams in Random (
9Unless otherwise stated, all p-values are based on two-sided tests.
Figure 2: Team assignment, performance, and task characteristics
Experiment I Experiment II
1st team task
2nd team task
1st team task
2nd team task
Self Random 95% Confidence interval
Figure shows the average team performance (z-standardized) for the tasks in our experiments. The left panel
shows the results from Experiment I, while the right panel shows the results from Experiment II.
Mann-Whitney U test; hereafter, MWU test). The results of a non-parametric test for the
equality of variances between the treatments underlined this pattern, and showed that the
variance of team performance was signiﬁcantly larger in the Self treatment (
We also observed no change in performance over time. A comparison of the average
performance between the ﬁrst and the second team tasks revealed no signiﬁcant diﬀerences (Self :
p=.885, Random:p=.9291, MWU test).
First, for Experiment II, the ﬁgure indicates that the teams in Self performed worse on the
written task than those in Random, while the eﬀect appears to have reversed when the teams
were working on the video task. Indeed, in the ﬁrst team task of Experiment II, we replicated
the observed pattern of Experiment I. The teams in Self performed marginally signiﬁcantly
worse than those in Random when the task was written (
=.064, MWU test), but this time the
variances were not signiﬁcantly diﬀerent (
=.194, Levene’s test). Figure 2 appears to show that
A separate analysis of the ﬁrst and the second team task yielded similar signiﬁcant diﬀerences in averages
=.068, MWU test), and (marginally) signiﬁcant diﬀerences in variances
=.001, Levene’s test). A detailed pairwise comparison appears in
Table A.2 in the appendix.
the average team performance was higher in Self than in Random in the second team task (the
video task). However, the results of non-parametric tests comparing the mean and the variance
of average team performance between Self and Random did not reject the null hypothesis that
the performance in both treatments was equal (
=.156, MWU test;
=.381, Levene’s test).
This time, however, we observed a large change in performance between the two tasks. The
performance of the self-selected teams was marginally signiﬁcantly better on the video task than
on the written task (
=.0790, Wilcoxon Signed Rank test; hereafter, WSR test), while the
performance of the randomly assigned teams was marginally signiﬁcantly worse on the video
task than on the written task (p=.0556).
More evidence for this change in behavior across types was provided by a diﬀerence-in-diﬀerence
analysis. Calculating the diﬀerence between the ﬁrst and the second team task for both experi-
ments and comparing them between treatments yielded a signiﬁcant diﬀerence for Experiment
II (p=.0095, MWU test), but no signiﬁcant diﬀerence for Experiment I (p=.9424).
Furthermore, the ﬁgure also shows that the exam performance was unaﬀected by the treatment.
Neither the average student performance nor the variance of the student performance on the
ﬁnal exam diﬀered signiﬁcantly across treatments (Experiment I:
=.455, MWU test;
Levene’s test; Experiment II:
=.984, MWU test,
=.603, Levene’s test). This ﬁnding indicates
that the team assignment mechanism did not have a spillover eﬀect on exam performance. It
can also be seen as evidence that the eﬀectiveness of teaching did not diﬀer between the two
treatment groups, and, therefore, that the lecturer’s behavior was unlikely to have inﬂuenced
the diﬀerent levels of team performance.11
Second, we ran regressions controlling for pre-experiment observables to verify these observations.
To do so, we analyzed the teams’ performance on the ﬁrst (written) and the second (written
or video) team tasks. The ﬁrst team task in both experiments was identical, with the students
submitting their work in writing. To test the inﬂuence of the task characteristics on the team
performance, we varied the second team task. In Experiment I, the teams had to submit their
solutions in written form, while in Experiment II, the teams had to submit video clips (as
To check whether the students perceived the quality of the teaching diﬀerently between the two treatments,
we included four items in our post-experimental survey. We did not observe a signiﬁcant diﬀerence between the
two experimental conditions for any of these questions. This ﬁnding supports our assumption that the teacher
had no inﬂuence on the study results. In Panel C of Table 6, we display the respective items and results.
Table 2: Predicting team performance
Performance on 1st team task Performance on 2nd team task
(Exp. I and II: written) (Exp. I: written, II: video)
Independent variables (1) (2) (3) (4) (5) (6)
1 if Self -0.415*** -0.473** -0.478** -0.107 -0.496** -0.521***
(0.141) (0.201) (0.201) (0.145) (0.201) (0.199)
1 if Experiment II -0.045 0.031 -0.392** -0.320*
(0.163) (0.149) (0.177) (0.163)
Self x Experiment II 0.114 0.029 0.775*** 0.728**
(0.283) (0.276) (0.288) (0.286)
GPA 0.117** 0.123**
1 if female -0.024 -0.250
Admission Test 0.006 -0.080
Constant 0.212*** 0.234** 0.240** 0.054 0.245*** 0.332***
(0.081) (0.114) (0.121) (0.089) (0.086) (0.107)
Observations 382 382 377 382 382 377
R-squared 0.043 0.044 0.067 0.003 0.041 0.067
Note: Columns (1) - (3) show OLS regressions of z-standardized team performance on the ﬁrst task.
In both experiments, the students had to submit a written solution to the task. Columns (4) - (6)
show OLS regressions of z-standardized team performance on the second task. In Experiment I, the
students had to submit a written solution to the task; while in Experiment II, the students had to
submit a video clip. The control variables are GPA, admission test scores, and gender. GPA and
Admission Test have been z-standardized. Standard errors clustered on teams are in parentheses.
Signiﬁcance indicators: ∗∗∗ p <.01, ∗∗ p <.05, ∗p <.1.
Table 2 shows the results of OLS regressions with standard errors clustered at the team level,
where the the dependent variable is the team performance (z-standardized) for the both team
tasks, separately. In Models (1)-(3), we predicted the team performance on the ﬁrst team task.
Model (1) included only a dummy variable for the Self treatment (“1 if Self ”). The self-selected
teams performed, on average, .415 (
=.004; CI = [-.694; -.136]) standard deviations worse on
the ﬁrst task than the randomly assigned teams. Model (2) included a dummy variable for
the experiment (“1 if Experiment II”) and an interaction term of the Self treatment and the
experiment (“1 if Self x Experiment II”) to control for potential interactions. While both of
these control variables remained insigniﬁcant, the coeﬃcient on the treatment dummy Self
remained signiﬁcant and almost unchanged at -.473 (
=.02; CI = [-.870; -.077]), which indicates
that for the ﬁrst task, the treatment eﬀect was not signiﬁcantly diﬀerent across the experiments.
In Model (3), we included additional controls, and found that the treatment eﬀect was not
aﬀected by their inclusion (
=.018; CI = [-.875; -.082]). Interestingly, we found that
the students’ GPAs, but not their admission test scores, predicted the team performance.
Next, we studied the second team task. The regression results appear in Models (4)-(6). In
Model (4), we pooled observations from both experiments (ignoring the type of task), and
included only a treatment dummy. Consistent with the results of the non-parametric analysis,
we found no signiﬁcant eﬀect of self-selection, which suggests that a meaningful investigation of
the eﬀects of the team assignment process on performance should take into account the task
characteristics. After we controlled for the experiment and interacted with the treatment, we
found that the teams in the Self treatment in Experiment I performed .496 (
-.100]) standard deviations worse on the second task than the teams in the Random treatment
(model 5). We thus found very similar treatment eﬀects for the ﬁrst and the second tasks in
Experiment I, which suggests that there was no heterogeneous learning across treatments, and
that the ordering of the tasks did not matter. In addition, after adding up the ﬁrst and the third
coeﬃcients in Model (5), we found that in Experiment II, the teams in Self tended to perform
.279 standard deviations better than the teams in Random on the second task (a video).In line
with the non-parametric analysis, a joint F-test showed that this diﬀerence was not signiﬁcant
(p= .1774). Adding additional control variables did not signiﬁcantly change the coeﬃcients.
Interestingly, we again found that GPA positively predicted the performance on the second team
4.2.1 Heterogeneity analysis
When we split the sample at the median high school GPA, we found (in Table 3) that the Self
treatment had signiﬁcant negative eﬀects on the team performance on the ﬁrst written task
(Model 1) for both low-ability (
=.040; CI=[-.979; -.026]) and high ability students (
=.001; CI=[-.669; -.169]). These eﬀects were not signiﬁcantly diﬀerent from each other.
Furthermore, the Self treatment had a signiﬁcantly negative eﬀect on the team performance of
=.067; CI=[-.749; .05]) and low-ability (
-.198]) students on the second written task (Experiment I only). The negative eﬀect of the
Self treatment was signiﬁcantly larger for low-ability students than for high-ability students.
When we looked at the video task (only in Experiment II), we found that high-ability students
tended to perform better in the Self treatment than in the Random treatment; however, this
diﬀerence was not signiﬁcant (
=.174; CI=[-.141; .73]). We did not ﬁnd that the team
performance on the video task of low-ability students was signiﬁcantly diﬀerent between the
two treatments (
=.984; CI=[-.564; .573]). Overall, the results of this heterogeneity
analysis suggested that allowing for self-selection into groups harmed the performance of both
low- and high-ability students on the written task, but that low-ability students tended to suﬀer
more. However, self-selection did not harm the performance of low-ability students on the video
task, while it tended to beneﬁt the performance of high-ability students on this task.
Table 3: Heterogeneity analysis
Dependent variable: Performance on ...
... 1st team task ... 2nd team task
(Exp. I + II: written) (Exp. I: written) (Exp. II: video)
Independent variables (1) (2) (3)
1 if Self -0.491** -0.782*** -0.006
(0.238) (0.296) (0.293)
Self x GPA >Median 0.046 0.411 0.303
(0.237) (0.268) (0.313)
1 if GPA >Median 0.156 0.037 0.037
(0.163) (0.283) (0.265)
Constant 0.160 0.348** -0.048
(0.142) (0.152) (0.207)
Self + (Self x GPA >Median) -0.445*** -0.371* 0.297
(0.133) (0.200) (0.217)
Observations 377 189 188
R-squared 0.070 0.130 0.043
This table displays the result of a OLS regression analysis (robust standard errors clustered on the team level in
parentheses). All speciﬁcations include GPA, Admission Test, and female as control variables. All scores have been
z-standardized. Signiﬁcance indicators: ∗∗∗ p <.01, ∗∗ p <.05, ∗p <.1.
4.3 Team formation
4.3.1 Ability composition
In this subsection, we investigate how allowing team members to self-select aﬀected the team
composition, and how the team composition aﬀected the team performance. We begin by
looking at how students (in the Self treatment) formed teams. To do so, we used pre-experiment
Table 4: Self selection and composition of teams
Variable Self Random Random
GPA .978 1.093 ∗∗ 1.118∗∗∗
% female .204 .348 ∗∗ .409 ∗∗
Analytical Test 1.012 1.150 ∗∗ 1.167∗∗∗
Admission Test 1.237 1.072 1.096
Note: The table displays the average absolute diﬀerences between teammates
on the pre-experiment observables. Simulation Random denotes the average
absolute diﬀerence for the respective variable from a simulation in which we
pairwise matched all students within a treatment within an experiment. The
stars indicate the two-sided signiﬁcance level of a WSR test that compared
the observed score from Self the simulated value from Simulation Random,
or the outcome of a MWU test that compared the observed score from Self
and the observed score from Random. Signiﬁcance indicators:
∗∗∗ p <
p <.05, ∗p <.1.
registry data on each student’s ability (measured as their performance on the various tasks in
the admission test and their GPA), gender, and an incentivized measure of pro-sociality from
the post-experiment survey. For each team and measure
, we calculate the absolute diﬀerence
between both teammates
where iand jare teammates.
Thus, lower absolute diﬀerences indicate that the teammates were more similar, and higher values
indicate that they were less similar. If the students in the Self treatment were matched on certain
measures, we would observe a higher degree of similarity; i.e., a lower average absolute diﬀerence.
Moreover, as a reference point, we calculated the average absolute diﬀerence after simulating
the matching of each student with all potential teammates from the respective treatment. This
simulation provided us with information about what a hypothetical within-sample random team
composition might look like.
The results appear in Table 4. The ﬁrst column shows the absolute diﬀerences for all measures in
Self, while the second column shows the absolute diﬀerences for all measures in Random and the
simulated Random “treatment.” A comparison of the values suggests that the students sorted
themselves into teams with students of similar levels of ability and of the same gender. More
speciﬁcally, we observed that the self-selected teams were more similar in terms of their GPAs,
their scores on the written admission test, and their gender. These diﬀerences were signiﬁcant
=.0111; Analytical test:
=.0155, MWU test). Interestingly, we
did not ﬁnd signiﬁcant diﬀerences in the Self treatment or the Random for the admission test.
Up to now, we have made two main observations: First, we have established that the teams in
Random performed better on the written task, for which less skill complementarity was needed;
and that the team performance on the video task, for which more skill complementarity was
needed, did not diﬀer between Random and Self. Second, we showed that the skill composition
of the teams diﬀered between Random and Self ; i.e., that in the latter treatment, the students
tended to choose a partner with similar skills.
To better understand the role of individual skills in team tasks, we will now focus on the
relationship between skills and team performance in Random.
We operationalized each
student’s ability with their exam score, as this measure appeared to be a reliable measure of
ability in the context of this course.14
In Table 5, we predicted team performance. As independent variables, we used the maximum
ability and the minimum ability of the team members. For the written task (Model 1 and 2),
we observed that – if anything – the maximum ability tended to positively inﬂuence the team
=[.102, .164]). For the same models, the coeﬃcients for the minimum ability were
very small and negative (
= [-.0719, -.0132]). For the video task (Model 3), we observed that
both coeﬃcients tended to positively inﬂuence the performance (β=[.359; .108]).15
While the signs of the coeﬃcients on max(
)were consistent with our line of
reasoning, both variables lacked statistical signiﬁcance, and the predictive power of the model
was low. Given the small sample size, we cautiously interpret these results as being mildly
The Admission Test score also did not correlate with the performance on the diﬀerent team tasks (see Table
A.3). In principle, it is possible that we were lucky in the team composition in Random. For this reason, we show
the results of the simulation in the third column of the table. Comparing Self with the results of our simulation
yielded similar results.
We concentrated our analysis on this treatment, since we can be sure that in Random, the composition of
the team members’ abilities was exogenous and was confounded by other factors of the team member selection
process, unlike in Self.
The students’ GPAs or scores on the analytical test contained more noise, but yielded qualitatively similar
Figure A.2 , we display the linear relationships between the team performance and individual abilities. The
black line shows the relationship for the team member with the highest ability (max(
)), and the gray line
shows the relationship for the team member with the lowest ability (min(
)). In line with our reasoning above,
we observed a positive relationship between the team performance and the maximum ability for the task that
required low levels of skill complementarity. It appears that the minimum ability had no impact on the teams’
outcomes. For the video task, in which higher levels of skill complementarity were required, both the maximum
and the minimum ability had a positive impact on the team performance.
Table 5: Team performance and individual abilities
Dependent variable: Log Performance on ...
...1st team task ...2nd team task
(written) (written) (video)
Independent variables (1) (2) (3)
max(ai;aj)0.164 0.102 0.359
(0.88) (0.62) (1.22)
min(ai;aj)-0.0719 -0.0132 0.108
(-1.03) (-0.24) (0.79)
Constant 2.100∗∗∗ 2.116∗∗∗ 1.205∗∗∗
(4.87) (5.72) (1.84)
Observations 92 48 44
Experiment (I+II) I II
R20.020 0.008 0.084
Note: The table displays regression coeﬃcients (Standard errors are in brackets) of OLS
regressions. Column (1) predicts the log transformed team performance on the ﬁrst task across
both experiments in Random. In both experiments, the students had to submit a written
solution to the task. This model includes a dummy variable for the experiments, which was
=-.0000188) and is not displayed. Columns (2) and (3) show OLS regressions
for the second task. Signiﬁcance indicators: ∗∗∗ p <.01, ∗∗ p <.05, ∗p <.1.
suggestive of diﬀerences in the relationship between the composition of the team members’
abilities and the team performance.
4.3.2 Perceived cooperation
A second mechanism that we hypothesized to be aﬀected by the treatment variation and to
inﬂuence the team performance was the quality of cooperation. In our post-experiment survey,
we asked the students to evaluate their collaboration experience in their team during the course
(see Table 6 for an overview of all of the questions).
We asked the students to agree or disagree (on a 7-point Likert scale) with several statements
aimed at capturing various aspects of team collaboration and organization. More speciﬁcally, we
also asked questions about the perceived quality of the cooperation and the pleasure of working
Table 6 displays the results from the post-experiment survey, pooled for Experiments I and
II, and for each experiment separately. When asked about their experience during the task,
the students in Self reported that they communicated more (“We communicated a lot”;
.0001, MWU test) and that they cooperated better (“We helped each other a lot”;
than the students in Random. Moreover, they indicated that the teammates’ contributions were
more equally distributed (“Both team members contributed equally”;
=.021), and that both
teammates exerted eﬀort (“Both team members exerted eﬀort”;
=.002). These comparisons
clearly show that the teams in Random used a diﬀerent approach to solving the problem sets
than the teams in Self, likely by assigning the task to the more able teammate, but also by
Furthermore, we found that the students’ moods (“The mood in our team was good”;
levels of stress (“Our team was very stressed.”;
=.134) and motivation levels (“Our team
was very motivated”; p=.151) for the teams in Self did not diﬀer from those for the teams in
Although the students in Self were more likely to report being friends (“My team member was a
We also ask a battery of questions about the perceived teaching quality, which might have inﬂuenced
performance. However, we found no signiﬁcant diﬀerences between the treatments and experiments, which
indicates that the lecturer’s teaching was of the same quality in both classes and experiments.
As Table 6 shows, these diﬀerences between the treatments were mostly driven by the reports of students from
Experiment I, in which the students were not explicitly required to collaborate. As the video clip in Experiment
II required each teammate to cooperate equally and to appear in the video to present the results, the students
might have tried to fulﬁl this expectation. Therefore, a desirability bias might explain why we did not ﬁnd as
strong a diﬀerence in self-reported cooperation in Experiment II as we did in Experiment I. Our ﬁnding that the
average ratings of cooperation also tended to be higher in Self in Experiment II than in Experiment I points in
the same direction.
Table 6: Overview of survey items and survey results
Experiment I Experiment II Experiment I + II
Survey item Random Self Random Self Random Self
A. Perceived quality of cooperation (1=Not agree, 7=Completely agree)
We communicated a lot. 5.16 <∗6.08 5.58 <∗∗ 6.14 5.35 <∗∗∗ 6.11
We helped each other a lot. 5.41 <∗∗ 5.95 5.82 <6.07 5.60 <∗∗ 6.01
Both team members exerted eﬀort. 5.46 <∗∗ 6.07 5.93 <∗6.34 5.67 <∗∗∗ 6.20
Both team members contributed equally. 5.12 <∗5.59 5.44 <5.87 5.26 <∗∗ 5.73
Our individual skills complemented very well. 4.99 <∗∗ 5.56 5.26 <5.52 5.11 <∗∗ 5.54
Our team was very stressed. 2.87 <∗3.30 2.53 <2.69 2.71 <3.00
Our team was very motivated. 5.53 <5.78 5.70 <5.85 5.61 <5.81
The mood in our team was good. 5.79 <∗6.10 6.19 <6.27 5.98 <6.18
The coordination of our team was very good. 5.03 <∗5.38 5.56 <5.59 5.27 <5.49
I was dominant in leading the team. 4.49 >4.22 4.30 >4.21 4.40 >4.22
One person was dominant in leading the team. 4.10 >3.84 4.14 >3.85 4.12 >3.84
B. Attitude towards the other (1=Not agree, 7=Completely agree)
My team member is a friend. 3.82 <∗∗∗ 6.25 4.33 <∗∗∗ 6.06 4.06 <∗∗∗ 6.15
I knew the team member very well before the course. 2.60 <∗∗∗ 6.19 2.93 <∗∗∗ 5.66 2.75 <∗∗∗ 5.93
C. Perceived teaching quality (1=Not agree, 7=Completely agree)
I learned a lot from the professor to complete the exercises. 5.57 <5.66 5.77 >5.34 5.66 >5.50
The professor asked questions to test our understanding. 5.72 <5.74 5.88 >5.55 5.79 >5.65
Professor was too fast in explaining the contents. 2.41 <2.62 2.58 <2.86 2.49 <2.74
The lecturer spend too much time on simple things. 3.28 <3.30 3.16 <3.38 3.22 <3.34
The professor gave too complicated answers. 2.24 <2.23 2.18 <2.62 2.21 <2.42
Observations 68 73 57 71 125 144
Table reports descriptive statistics of student responses in the post-experimental survey. P-values stem from a two-sided Mann-Whitney U test for a comparison of
averages between Self and Random. Signiﬁcance indicators: ∗∗∗ p <.01, ∗∗ p <.05, ∗p <.1.
.0001) or having been acquainted with their teammate before the course (“I knew
the team member very well before the course”;
.0001), it was not the overall pleasure of
working together, but rather the higher level of cooperation, that was diﬀerent between the
teams in the two treatments.
These ﬁndings highlight a potentially important channel through which random assignment
may have increased performance on the written task, while tending to decrease performance
on the video task. For the written task, which required less collaborative eﬀort, letting the
ablest student perform the task was most eﬃcient; whereas for the video task, which required
more collaborative eﬀort, both the communication and the coordination worked better in the
This paper has provided evidence from natural ﬁeld experiments that studied how team formation
processes inﬂuenced team performance. We used data on students’ individual characteristics
and behavior at a business school to examine the eﬀects on team performance of varying both
the team formation process and the skill complementarity needed to perform well on diﬀerent
tasks. The results of our randomized ﬁeld experiments add a new dimension to the debate
on the eﬀects of the team formation process on team performance. Previous experiments did
not use objective ability measures to capture team formation patterns, and they did not oﬀer
an explanation for the observed eﬀects of the team formation process on the team members’
performance on diﬀerent tasks. By contrast, we used data on student ability generated prior to
the experiments to study how the team formation process aﬀected the teams’ abilities and social
composition, which, in turn, aﬀected the teams’ cooperation and performance on team tasks
with diﬀerent skill complementarities.
We found that the team formation mechanism chosen for assigning subjects to teams was a
70% (Experiment I: 74%, Experiment II: 67%) of the students responded to our request to participate in the
survey. We tested and found no signiﬁcant diﬀerence in the fraction of participating students between the Random
and the Self treatment (Experiment I:
=.282, Experiment II:
test). Furthermore, participation in
the survey was balanced in terms of GPA (
=.466, MWU test), the analytical test scores (
p=.334, MWU test), and gender (p=.730, p=.822, χ2test).
useful tool for strategically inducing performance.
Importantly, this relationship hinged on
the speciﬁc requirements of the underlying task. When the subjects were allowed to choose
their teammate, the team assignment mechanism substantially inﬂuenced their performance on
the team tasks through assortative selection patterns. These selection patterns proved to be
performance-enhancing when the underlying task required a high degree of skill complementarity.
In contrast, the random assignment of teammates led to better team performance when the
task required little or no skill complementarity. After the students completed the team tasks,
we measured the individual performance of the subjects, and found no diﬀerences between the
team formation mechanisms, which indicates that the eﬀect observed at the team level did not
translate into individual performance diﬀerences.
Our study oﬀers insights for managers and team leaders; i.e., for individuals who decide how
teams are put together in ﬁrms and other organizations. If managers want to maximize team
performance, they ﬁrst need to consider the type of task involved before deciding whether
employees should be able to self-select their teammates. Given that randomly assigned teams
can produce superior outcomes for tasks that are characterized by a low level of collaboration
intensity, our ﬁndings also reveal a weakness in the trends towards more “agile work practices”
(e.g., Mamoli and Mole, 2015), which give employees the freedom to choose their working groups
regardless of the circumstances.
Moreover, our results also provide insights into the trade-oﬀ between diversity and ability. When
managers want to create a more inclusive work environment by forming more diverse teams or
teams with similar average skill levels, random team assignment might prove more beneﬁcial.
Our ﬁeld experiment showed that students are more likely to match with teammates of the same
gender when they are allowed to self-select. This ﬁnding suggests that self-selection might create
not just inequalities in abilities across teams, but also less gender-diverse teams.
In this study, we focused on the contrast between self-selection and random assignment. An alternative
approach would be to assign subjects based on algorithms that maximize team performance (e.g., Wei et al.,
2020). For tasks with low collaboration intensity, this could be an algorithm that maximizes the diﬀerences in the
team members’ abilities.
Ai, W., R. Chen, Y. Chen, Q. Mei, and W. Phillips (2016). Recommending teams promotes proso-
cial lending in online microﬁnance. Proceedings of the National Academy of Sciences 113(52),
Bandiera, O., I. Barankay, and I. Rasul (2013). Team incentives: Evidence from a ﬁrm level
experiment. Journal of the European Economic Association 11(5), 1079–1114.
Büyükboyaci, M. and A. Robbett (2019). Team formation with complementary skills. Journal
of Economics & Management Strategy 28 (4), 713–733.
Chen, R. (2017). Coordination with endogenous groups. Journal of Economic Behavior &
Organization 141 (5), 177–187.
Chen, R. and J. Gong (2018). Can self selection create high-performing teams? Journal of
Economic Behavior & Organization 148, 20–33.
Cross, R., R. Rebele, and A. Grant (2016). Collaborative overload. Harvard Business Review.
Curranrini, S., M. O. Jackson, and P. Pin (2009). An economic model of friendship: Homophily,
minorities, and segregation. Econometrica 77 (4), 1003–1045.
Dahlander, L., V. Boss, C. Ihl, and R. Jayaraman (2019). The eﬀect of choosing teams and
ideas on entrepreneurial performance: Evidence from a ﬁeld experiment. Mimeo.
Delfgaauw, J., R. Dur, O. A. Onemu, and J. Sol (2019). Team incentives, social cohesion, and
performance: A natural ﬁeld experiment. Tinbergen Institute Discussion Paper .
Delfgaauw, J., R. Dur, and M. Souverijn (2018). Team incentives, task assignment, and
performance: A ﬁeld experiment. The Leadership Quarterly.
Englmaier, F., S. Grimm, D. Schindler, and S. Schudy (2018). The eﬀect of incentives in
non-routine analytical team tasks - Evidence from a ﬁeld experiment. CESifo Working Paper
Erev, I., G. Bornstein, and R. Galili (1993). Constructive intergroup competition as a solution to
the free rider problem: A ﬁeld experiment. Journal of Experimental Social Psychology 29(6),
Fischer, M. and P. Kampkötter (2017). Eﬀects of German universities’ excellence initiative on
ability sorting of students and perceptions of educational quality. Journal of Institutional and
Theoretical Economics 173 (4), 662.
Friebel, G., M. Heinz, M. Krüger, and N. Zubanov (2017). Team incentives and performance:
Evidence from a retail chain. American Economic Review 107 (8), 2168–2203.
Gächter, S. and C. Thöni (2005). Social learning and voluntary cooperation among like-minded
people. Journal of the European Economic Association 3 (2), 303–314.
Geraghty, A. and S. Paterson-Brown (2018). Leadership and working in teams. Surgery
(Oxford) 36 (9), 503–508.
Guido, A., A. Robbett, and R. Romaniuc (2019). Group formation and cooperation in so-
cial dilemmas: A survey and meta-analytic evidence. Journal of Economic Behavior &
Organization 159, 192 – 209.
Hamilton, B. H., J. A. Nickerson, and H. Owan (2003). Team incentives and worker heterogeneity:
An empirical analysis of the impact of teams on productivity and participation. Journal of
Political Economy 111 (3), 465–497.
Lazear, E. P. and P. Oyer (2012). Chapter 12: Personnel Economics. The Handbook of
Organizational Economics (Ed.) Robert Gibbons and John Roberts, 479–519.
Leider, S., M. M. Möbius, T. Rosenblat, and Q.-A. Do (2009). Directed altruism and enforced
reciprocity in social networks. Quarterly Journal of Economics 124 (4), 1815–1851.
Mamoli, S. and D. Mole (2015). Creating Great Teams: How Self-selection Lets People Excel.
O’Neill, T. A. and E. Salas (2018). Creating high performance teamwork in organizations.
Human Resource Management Review 28 (4), 325–331.
Patel, S. and S. Sarkissian (2017). To group or not to group? Evidence from mutual fund
databases. Journal of Financial and Quantitative Analysis 52(5), 1989–2021.
Reagans, R. and E. W. Zuckerman (2019). Networks, diversity, and productivity: The social
capital of corporate R&D teams. Organization Science 12 (4), 502–517.
Wei, A., Y. Chen, Q. Mei, J. Ye, and L. Zhang (2020). Putting teams into the gig econonmy: A
ﬁeld experiment at a ride-sharing platform. Working Paper.
Wuchty, S., B. F. Jones, and B. Uzzi (2007). The increasing dominance of teams in production
of knowledge. Science 316 (5827), 1036–1039.
Table A.1: Randomization checks (unstandardized)
Experiment I Experiment II Experiment I + II
Variable Self Random p-value Self Random p-value Self Random p-value
GPA 5.260 5.208 .196 5.214 5.128 .524 5.236 5.170 .167
[.520] [.406] [.488] [.595] [.503] [.506]
Analytical Test 5.324 5.267 .889 5.207 5.438 .429 5.263 5.349 .649
[1.720] [1.865] [1.794] [2.060] [1.755] [1.958]
Admission Test 6.139 6.225 .480 6.391 6.295 .592 6.269 6.259 .957
[1.195] [1.124] [1.260] [1.127] [1.232] [1.123]
Presentation 6.351 6.174 .550 7.085 6.727 .125 6.729 6.440 .078
[1.860] [1.710] [1.474] [1.652] [1.708] [1.701]
Interview 6.548 6.837 .128 6.085 6.091 .879 6.309 6.478 .282
[1.623] [1.750] [1.770] [1.523] [1.712] [1.682]
Discussion 5.516 5.649 .689 5.959 6.018 .932 5.742 5.823 .776
[1.709] [1.647] [1.723] [1.520] [1.726] [1.595]
Descriptive statistics (unstandardized) of pre-experiment data. The p-values are from a Mann-Whitney U test (two-sided)
comparing the diﬀerences in the mean ranks of the two treatments. Standard deviations are in brackets.
Table A.2: Average and standard deviation of performance (z-Standardized)
Experiment I Experiment II
Variable Self Random p-value Self Random p-value
Total team task -.303*** 297 .007 .049 -.089 .451
[1.214]*** [.621] .002 [.967] [1.046] .534
1st team task -.239** 234 .011 -.196* .183 .064
[1.140] [.791] .104 [1.150] [.791] .193
2nd team task .251* 245 .068 .114 -.153 .156
[1.248]*** [.599] .001 [.956] [1.047] .381
Exam -.022 .028 .455 .004 -.005 .984
[1.004] [1.002] .995 [.989] [1.018] .603
Descriptive statistics (z-Scores) of the students’ performance in the experiment. Average [standard
deviation] of the team performance for the team tasks at the team level and for the exam at
the individual student level. The p-values stem from a two-sided Mann-Whitney U test for a
comparison of averages between Self and Random. Leven’s p-values are the results of a comparison
of variances between the two treatments. Signiﬁcance indicators: ∗∗∗ p <.01, ∗∗ p <.05, ∗p <.1.
Table A.3: Pairwise correlations of variables
(1) (2) (3) (4) (5) (6) (7) (8)
(1) GPA 1.000
(2) Female 0.135*** 1.000
(3) Analytical Test 0.205*** -0.310*** 1.000
(4) Admission Test 0.176*** 0.033 -0.112** 1.000
(5) Total team task 0.126** -0.056 0.056 -0.020 1.000
(6) 1st team task 0.101* 0.000 -0.011 0.027 0.566*** 1.000
(7) 2nd team task 0.092* -0.073 0.073 -0.057 0.919*** 0.220*** 1.000
(8) Exam score 0.320*** -0.090* 0.291*** -0.009 0.122** 0.127** 0.078 1.000
The table displays correlation coeﬃcients of pairwise correlations. All scores are z-standardized. The table includes data
from all treatments and experiments. Signiﬁcance indicators: ∗∗∗ p <.01, ∗∗ p <.05, ∗p <.1.
Figure A.1: Distribution of performance for (a) 1
team task in Experiment I and Experiment
II (both written) and (b) 2
team task in Experiment I (written) and Experiment II (video)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Experiment I, Random Experiment I, Self
Experiment II, Random Experiment II, Self
Performance on written task (unstandardized)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Experiment I, Random Experiment I, Self
Experiment II, Random Experiment II, Self
Performance on video task (unstandardized)
Figure A.2: Team performance and individual abilities
(a) 1st team task (b) 2nd team task (c) 2nd team task
(Experiment I + II, written) (Experiment II, written) (Experiment II, video)
Log Team Performance
2 2.1 2.2 2.3 2.4
2 2.1 2.2 2.3 2.4
Log Individal ability
2 2.1 2.2 2.3 2.4
90% Confidence Interval
Note: The ﬁgure shows the relationship between the team performance and the individual exam performance (as a measure of ability) of team members. The lines show linear
ﬁts, all variables are log transformed.