PreprintPDF Available

Sequential Collaboration: About the Accuracy of Dependent, Incremental Judgments

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Online collaborative projects in which users contribute to extensive knowledge bases such as Wikipedia or OpenStreetMap have become increasingly popular while yielding highly accurate information. Collaboration in such projects is organized sequentially with one contributor creating an entry and the following contributors deciding whether to adjust or to maintain the presented information. We refer to this process as sequential collaboration since individual judgments directly depend on the previous judgment. As sequential collaboration has not yet been examined systematically, we investigate whether dependent, sequential judgments become increasingly more accurate. Moreover, we test whether final sequential judgments are more accurate than the unweighted average of independent judgments from equally large groups. We conducted three studies with groups of four to six contributors who either answered general knowledge questions (Experiments 1 and 2) or located cities on maps (Experiment 3). As expected, individual judgments became more accurate across the course of sequential chains and final estimates were similarly accurate as unweighted averaging of independent judgments. These results show that sequential collaboration profits from dependent, incremental judgments, thereby shedding light on the contribution process underlying large-scale online collaborative projects.
Content may be subject to copyright.
Sequential Collaboration: About the Accuracy of Dependent, Incremental1
Maren Mayer1,2 & Daniel W. Heck3
1University of Mannheim4
2Heidelberg Academy of Sciences and Humanities5
3University of Marburg6
Author Note7
Maren Mayer, Department of Psychology, School of Social Science, University of9
Mannheim, Germany.
Daniel W. Heck, Department of Psychology, University of Marburg, Germany.11
Data and R scripts for the analyses are available at the Open Science13
We thank Mark Steyvers, Jack Soll and two anonymous reviewers for helpful15
feedback on earlier versions of the manuscript.16
This work was presented at the 62nd Conference of Experimental Psychologists17
(Virtual TeaP, 2021). The present version of the manuscript (May 16, 2022) has not yet18
been peer reviewed. A preprint was uploaded to PsyArXiv and ResearchGate for timely19
This work was funded by the WIN programme of the Heidelberg Academy of21
Sciences and Humanities, financed by the Ministry of Science, Research and the Arts of22
the State of Baden-Württemberg and also supported by the Research Training Group23
“Statistical Modeling in Psychology” funded by the German Research Foundation24
(DFG grant GRK 2277).25
The authors made the following contributions. Maren Mayer: Conceptualization,26
Investigation, Methodology, Writing - Original Draft, Writing - Review & Editing;27
Daniel W. Heck: Conceptualization, Methodology, Writing - Review & Editing.28
Correspondence concerning this article should be addressed to Maren Mayer, B6,29
30-32, 68169 Mannheim. E-mail: maren.mayer@students.uni-mannheim.de30
Online collaborative projects in which users contribute to extensive knowledge bases32
such as Wikipedia or OpenStreetMap have become increasingly popular while yielding33
highly accurate information. Collaboration in such projects is organized sequentially34
with one contributor creating an entry and the following contributors deciding whether35
to adjust or to maintain the presented information. We refer to this process as36
sequential collaboration since individual judgments directly depend on the previous37
judgment. As sequential collaboration has not yet been examined systematically, we38
investigate whether dependent, sequential judgments become increasingly more39
accurate. Moreover, we test whether final sequential judgments are more accurate than40
the unweighted average of independent judgments from equally large groups. We41
conducted three studies with groups of four to six contributors who either answered42
general knowledge questions (Experiments 1 and 2) or located cities on maps43
(Experiment 3). As expected, individual judgments became more accurate across the44
course of sequential chains and final estimates were similarly accurate as unweighted45
averaging of independent judgments. These results show that sequential collaboration46
profits from dependent, incremental judgments, thereby shedding light on the47
contribution process underlying large-scale online collaborative projects.48
Keywords: wisdom of crowds, teamwork, mass collaboration, group decision49
Sequential Collaboration: About the Accuracy of Dependent, Incremental51
Collaborative online projects have become a popular source for information53
gathering over the last twenty years. The most prominent example is Wikipedia, an54
online encyclopedia that allows users to contribute semantic information to various55
topics in the form of structured articles (Wikipedia Contributors, 2021). Giles (2005)56
showed that information on Wikipedia is very accurate in general. Moreover, certain57
topics such as information on cancer or certain drugs are similarly accurate as official58
health information or text books (Kräenbring et al., 2014; Leithner et al., 2010).59
Another example of online collaboration is OpenStreetMap, a collaborative project that60
aims at generating a comprehensive, open, and free-to-use map of the world61
(OpenStreetMap Contributors, 2021). OpenStreetMap does not only comprise62
geographical numeric information about the locations of objects such as coordinates but63
also semantic information such as names of streets, areas, buildings, and other useful64
information (e.g., addresses or websites of shops and restaurants). Comparing the65
accuracy of OpenStreetMap with commercial map providers or governmental sources66
also revealed a comparable accuracy (e.g., Girres & Touya, 2010; Zielstra & Zipf, 2010).67
The high accuracy of Wikipedia and other online collaborative projects has often68
been attributed to the wisdom of crowds (e.g., Arazy et al., 2006; Kittur & Kraut, 2008;69
Niederer & Dijck, 2010), which refers to the aggregation of judgments from different70
informants (Galton, 1907; Surowiecki, 2004). The term "wisdom of crowds" is a broad71
concept encompassing various methods of eliciting and aggregating judgments. As a72
measure for crowd wisdom, prior work often examined the unweighted mean or median73
of independent individual judgments (Budescu & Chen, 2014; Davis-Stober et al., 2014;74
e.g., Galton, 1907; Hueffer et al., 2013; Larrick & Soll, 2006; Merkle et al., 2020). The75
high accuracy of these judgments is due to the central limit theorem which ensures that76
errors in independent, individual judgments cancel out (Hogarth, 1978) and has been77
demonstrated for various tasks and in various contexts (e.g., Hueffer et al., 2013;78
Steyvers et al., 2009). The accuracy of unweighted averaging of independent individual79
judgments increases when judgments bracket the true answer (Larrick & Soll, 2006;80
Simmons et al., 2011) and are negatively correlated and unbiased (Davis-Stober et al.,81
2014; Keck & Tang, 2020).82
The collection and aggregation of judgments in online collaborative projects can83
be regarded as a certain type of wisdom of crowds. However, in online collaboration,84
judgments are usually not collected independently and then aggregated mechanically,85
but rather elicited in a dependent and sequential manner. Instead of providing86
independent judgments, contributors encounter already existing entries and decide87
whether to change the presented information reflecting the latest version of an entry or88
whether to maintain the presented version. We refer to this way of collaborating as89
sequential collaboration. Because unweighted averaging is known to result in highly90
accurate estimates for various tasks and contexts, we will use it as a benchmark for91
assessing the accuracy of sequential collaboration.92
In the following, we first define sequential collaboration and distinguish it from93
other forms of collaboration and aggregating judgments. Next, we discuss prior research94
on dependent judgments which has shown both positive and detrimental effects of95
dependency. Our main goal is to compare sequential collaboration to unweighted96
averaging. We investigate why and under which conditions the elicitation of97
incremental, dependent judgments can benefit accuracy compared to taking the98
unweighted average of independent individual judgments. In three studies, we used99
general knowledge questions and maps on which cities should be positioned to test100
whether sequential collaboration within small groups of four to six contributors yields101
improved judgments. Moreover, we tested whether the final judgments of a sequential102
chain are more accurate than estimates obtained by aggregating independent individual103
judgments. In line with our hypotheses, we found that judgment accuracy increased104
over the course of sequential chain and that sequential collaboration yielded similarly105
accurate results as unweighted averaging.106
Sequential collaboration107
As outlined above, collaboration in online projects is often organized sequentially108
by making incremental changes to the latest available information. Sequential109
collaboration starts with one contributor creating an initial, independent entry. The110
next contributors encountering this entry then decide whether to adjust or maintain the111
presented information. Whenever the entry is changed, the information is updated such112
that only the latest version of the entry is presented to the next contributor. For113
example, the first contributor might respond to the question “How tall is the Eiffel114
Tower?” with 420 meters. The second contributor encountering this judgment could115
simply maintain it while the third contributor might adjust the height to 290 meters.116
After several contributors have adjusted and maintained the judgment, the correct117
height of 300 meters may be entered. The sequence of decisions of whether to maintain118
or adjust entries made by a previous contributor forms a sequential chain. Figure 1119
displays how group estimates are generated with unweighted averaging and in sequential120
collaboration. In the former, the aggregated estimate is obtained by averaging121
independent individual judgments; in the latter, the estimate is the last judgment in a122
sequential chain generated by adjusting and maintaining previous judgments.123
Even though sequential collaboration is performed by a group of individuals and124
shares some aspects with other forms of group decision making, it also has some unique125
features distinguishing it from other forms of collaboration. In research on group126
decision making, group work usually takes place simultaneously (Kerr & Tindale, 2004;127
Lu et al., 2012; Stasser & Titus, 1985) even though interactions do not necessarily take128
place in person (Dennis, 1996; Dennis et al., 1998; Lu et al., 2012). In a paradigm129
organized like this, all members of the group have the opportunity to listen to all130
judgments and opinions, to ask questions to other group members, and to share131
justifications and other information. In sequential collaboration, however, information is132
shared only by adding or correcting the judgment of a previous contributor which133
implies that the dependency between judgments is limited to the displayed information.134
Figure 1
Illustration of group estimates in unweighted averaging and sequential collaboration.
Furthermore, direct interactions with other contributors are neither necessary nor135
possible in sequential collaboration, and additional information such as the number of136
adjustments already made to this information or reasons why information was adjusted137
are initially not available.138
A form of collaboration similar to sequential collaboration is the Delphi method139
(Dalkey & Helmer, 1963; Geist, 2010; Jeste et al., 2010). The Delphi method was140
designed to obtain judgments on a given topic from a group of experts who do not141
interact directly. First, experts provide independent judgments and justifications for142
these judgments, which are then combined in a report by a moderator. This report is143
sent to all experts who can then revise their judgments based on the judgments and144
information included in the report. When experts have reached a sufficient consensus,145
the individual judgments are aggregated to a final result. The Delphi method is similar146
to sequential collaboration in that individuals do not directly interact with each other.147
However, in sequential collaboration, contributors do neither receive judgments of148
several other contributors nor justifications of these judgments. Moreover, contributors149
are not necessarily required to provide a judgment, and even if they do, they may not150
notice if their judgment is in turn adjusted by others. Finally, the Delphi method151
focuses on eliciting judgments by a group of experts, whereas in sequential collaboration,152
neither the specific contributors nor the number of contributors has to be predefined.153
Possible issues and benefits of sequential collaboration154
Even though sequential collaboration seems to be a successful way of integrating155
judgments of various individuals, the process of sequentially deciding whether to adjust156
or maintain a previous judgment has not been systematically examined yet.157
Nonetheless, research on related phenomena allows us to derive testable predictions.158
Possible issues for the accuracy of sequential collaboration may arise from the159
anchoring effect (Tversky & Kahneman, 1974). Anchoring describes the robust160
phenomenon that a presented numerical value influences subsequent, often unrelated161
numerical judgments (Mussweiler et al., 2004). This effect may undermine the accuracy162
of sequential collaboration such that adjustments made to a previous judgment are163
systematically biased toward the previous judgment. Especially, when the previous164
judgment heavily over- or underestimates the correct value, anchoring might affect later165
judgments such that arriving at accurate, unbiased estimates is prolonged or hindered.166
The conditions under which information provided by others is considered in167
forming a judgment has been extensively studied in the advice-taking literature168
(Bonaccio & Dalal, 2006). Egocentric discounting describes the phenomenon that169
advice is generally underweighted relative to one’s own initial judgment, in turn170
resulting in less accurate judgments compared to equally weighing the advice and one’s171
own judgment (Yaniv & Kleinberger, 2000). In sequential collaboration, egocentric172
discounting could lead contributors to adjust the presented previous judgment mainly173
according to their prior beliefs, which in turn could be detrimental to accuracy as the174
chain may not converge to the correct answer. However, advice taking improves when175
no initial individual judgment is formed before receiving advice (Koehler & Beauregard,176
2006). This resembles the situation in sequential collaboration more closely since177
contributors are directly confronted with the previous judgment and do not have to178
form an initial, independent judgment. Hence, compared to the standard advice-taking179
paradigms, contributors in sequential collaboration may be more likely to accept a180
presented judgment.181
Prior research also provides preliminary evidence in favor of the accuracy of182
sequential collaboration. Providing participants with a frame of reference improves183
subsequent judgments, especially because it prevents extreme judgments (Bonner et al.,184
2007; Laughlin et al., 1999). Previous judgments in a sequential chain may serve as a185
frame of reference that prevents extreme judgments and fosters to reach an accurate186
estimate earlier. However, especially at the beginning of a sequential chain, judgments187
by previous contributors may not provide an accurate frame of reference.188
Providing judgments of other individuals can also improve the accuracy of189
aggregation methods based on unweighted averaging. Imitating successful individuals190
leads to more accurate judgments (King et al., 2012), and discussions in dyads also191
improve judgments, but only when initial independent judgments are formed (Minson et192
al., 2017). Moreover, Becker et al. (2017) showed that information about others’193
judgments is beneficial when this information equally weighs all other judgments (as194
opposed to overweighing judgments of a single, highly influential individual). Given195
that individual judgments can be improved by providing judgments of others, sequential196
collaboration may lead to more accurate judgments. Especially the finding that197
imitating successful individuals improves accuracy (King et al., 2012) is relevant for198
sequential collaboration as contributors may often be presented with the currently best199
judgment in the sequential chain which can easily be imitated by not making a change.200
However, while King et al. (2012) selected the currently most accurate judgment from a201
large pool of independent judgments, the judgments presented in a sequential chain are202
not necessarily very accurate, especially if only a few contributors have encountered and203
edited it.204
Sequential collaboration may also benefit from the fact that in group work, not205
all group members contribute to a given task equally and some do not contribute at all206
(free-rider effect, Bray et al., 1978) and that group members often contribute less the207
more they feel that their contribution is dispensable (Kerr & Bruun, 1983). Such effects208
may also be observed in sequential collaboration since contributors can maintain a209
previous judgment when thinking that they cannot substantially improve fit. This210
opt-out mechanism could in turn improve accuracy since giving respondents the211
possibility to select the questions to be answered improves accuracy of unweighted212
averaged judgments (Bennett et al., 2018). The fact that contributors can self-select213
which judgments to adjust may thus lead to a higher accuracy of the resulting214
judgments. However, this requires that contributors can accurately distinguish which215
judgments to maintain (assuming they cannot substantially contribute to it) and which216
judgments to adjust (assuming they can improve the present state of an entry).217
Miller and Steyvers (2011) performed a study closely resembling sequential218
collaboration in a rank-ordering task. Participants were presented either with219
judgments of previous participants (resulting in a sequential chain of judgments) or with220
randomly generated rank orders (resulting in independent judgments). They could opt221
out from answering by accepting the presented rank order, which was the order made222
by the previous participant. Miller and Steyvers (2011) found that both the group223
aggregate and the average subject’s performance increased for sequential compared to224
independent judgments. Moreover, the last judgment in such a sequential chain was225
more accurate than the group aggregate of independent judgments. Forming sequential226
chains may, thus, lead to improved judgment accuracy and even outperform the227
aggregation of independent judgments. Moreover, these results suggest that providing228
an opportunity to opt out can be beneficial both for individual and for group accuracy.229
Based on prior research, we expect that sequential collaboration is an effective231
method of eliciting and aggregating individual judgments. Our first two hypotheses232
concern basic assumptions about sequential collaboration:233
Hypothesis 1: Over the course of a sequential chain, (1a) the probability of234
changing a judgment and (1b) the magnitude of change decrease.235
Hypothesis 2: Over the course of a sequential chain, the accuracy of the236
most recent judgment increases.237
Given its high accuracy, unweighted averaging can be used as a benchmark for238
other forms of collaboration. As discussed above, sequential collaboration may profit239
from the possibility that contributors are not required to adjust the presented240
information (Bennett et al., 2018), but rather can decide to maintain the presented241
judgment if they perceive their own judgments to be dispensable (Kerr & Bruun, 1983).242
Since providing information about judgments from others can improve accuracy (Becker243
et al., 2017; King et al., 2012; Minson et al., 2017), especially when such information is244
organized in a sequential chain (Miller & Steyvers, 2011), accuracy of sequential245
collaboration may exceed that of unweighted averaging:246
Hypothesis 3: Sequential collaboration yields more accurate group estimates247
than unweighted averaging.248
Since unweighted averaging is known to yield highly accurate estimates, it is also249
plausible that sequential collaboration does not yield better but merely similarly250
accurate estimates. This would already be an important and relevant finding because251
sequential collaboration may not profit from the central limit theorem for means of252
independent random variables the same way as unweighted averaging does (Hogarth,253
1978). Sequential collaboration does not involve the computation of a mean of254
individual judgments, and thus, the central limit theorem is not directly applicable255
(Zhang et al., 2022). However, contributors may integrate the presented information256
when forming their own judgment (Bonaccio & Dalal, 2006). Such an implicit averaging257
of the presented and the internally generated judgment could result in increased258
accuracy due to error cancellation.259
To test our hypotheses, we conducted three online experiments (two of which260
were preregistered) using chains of four to six contributors in sequential collaboration261
and corresponding group sizes for unweighted averaging.1The materials comprised262
general knowledge questions with numerical judgments in the first two experiments and263
geographic maps on which participants had to position cities in the third experiment.264
Experiment 1265
Materials. We presented 65 difficult general knowledge questions such as “How267
tall is the Eiffel Tower?” or “When was Leonardo da Vinci born?” to the participants.268
The questions were taken from an item pool on general knowledge questions (Pohl,269
1998) and updated with contemporary information whenever necessary. The median of270
correctly answered questions was 0.53% (MAD = 0.78%) indicating that the questions271
were indeed difficult to answer correctly for participants. All items, their correct272
numerical answers, and the unit in which the answer had to be given are provided in273
Table A1 in the Appendix.274
Participants. For this online study, 310 German college students participated275
via a German panel provider. The compensation ranged between 0.60EUR and 1EUR276
according to the time for participation. To control data quality already during data277
collection, participants who changed their browser window or switched to other278
programs more than five times were excluded during participation. Based on the results279
of the pilot study, we suspected participants to look up answers when more than 10% of280
the questions were answered correctly. This was the case for three participants who281
were not considered for building sequences and whose data were excluded for the282
analysis. Two participants were excluded due to irregular answer patterns in more than283
1Prior to conducting the three experiments reported in the present manuscript, we also conducted a
pilot study to pretest and improve the experimental paradigm.
10% of the questions (i.e., answering with number series such as “23456” or answering284
“0”). Lastly, two participants were excluded since the same position in a sequential285
chain was assigned to two participants due to a technical issue. We kept the data of the286
participant whose data was used throughout the rest of this sequential chain. Our final287
sample comprised 303 participants, of whom 76.6% were female, 22.8% were male, and288
0.7% did identify as diverse. The mean age of the sample was 23.8years (SD = 3.2).289
Design and procedure. Participants were randomly assigned to the290
independent-judgments questionnaire (192 participants) or the sequential-collaboration291
questionnaire (111 participants). After consenting to the study, they were introduced to292
the corresponding tasks illustrated in Figure 2. In the independent-judgments293
questionnaire, one general knowledge question per trial was presented to participants294
who had to type their judgment into a text box before proceeding to the next question.295
In the sequential-collaboration questionnaire, participants also saw one general296
knowledge question per trial, but additionally, the answer of a previous participant was297
shown below the question. Participants decided whether to adjust or maintain the298
presented judgment. Only in the former case, the text box appeared in which the new299
judgment could be entered before proceeding to the next question. Participants were300
informed that the presented judgments are from one or more of the previous301
participants and did not know their position in the sequential chain.302
General knowledge questions were presented in random order and the unit in303
which the judgment had to be given was provided directly after the text box. To304
prevent looking up answers, we implemented a time limit of 30 seconds for entering a305
judgment in both conditions. Additionally, we implemented a minimum waiting time of306
two seconds in the sequential-collaboration condition to prevent clicking through the307
study. After answering all questions, participants provided demographic information,308
were thanked for participation, and debriefed.309
In the unweighted-averaging condition, 155 participants completed the310
independent-judgments questionnaire and their judgments were then averaged. In the311
Figure 2
Questionnaires used in Experiments 1 and 2.
sequential-collaboration condition, independent initial judgments are required to start312
sequential chains. Hence, we initialized sequential chains by 37 participants who313
answered the independent-judgments questionnaire. We used a sequence length of four,314
meaning that each sequential chain consists of one participant completing the315
independent-judgments questionnaire followed by three participants completing the316
sequential-collaboration questionnaire consecutively. For each participant, only the317
latest judgment in the sequential chain was presented. This procedure resulted in a318
sample size of 148 participants in the sequential-collaboration condition.319
Before analyzing the data, we excluded 314 judgments that were timed out after321
30 seconds. Since sequential chains containing such a judgment were excluded322
completely, this resulted in the exclusion of 809 judgments in total. Hence, 18,886323
judgments remained for analysis.324
Since judgments are given on vastly different scales (e.g., from single digits for325
the length of a soccer goal up to millions for the number of students enrolled in German326
universities), a standardization of raw judgments is necessary. To this end, we327
subtracted the correct answer for each question from the raw judgments before dividing328
the result by the standard deviation of all judgments obtained with the329
independent-judgments questionnaire. The resulting standardized errors are equal to330
zero for correct judgments while being negative and positive in case of under- and331
overestimation, respectively. Moreover, we used absolute standardized errors for testing332
hypotheses concerning the accuracy of judgments and estimates.333
After standardizing the judgments, we removed the 1% most extreme values334
from the data as these judgments may distort the results. Excluding outliers across335
conditions is also recommended by André (2021) who demonstrated that excluding336
outliers separately within each condition can increase false-positive rates. We identified337
189 extreme judgments with this procedure. Again, we excluded both these judgments338
and the corresponding sequences, resulting in a final sample of 18,626 judgments.339
Confirmatory analyses.340
Hypothesis 1a states that change probability decreases over the course of a341
sequential chain. To test this prediction, we only considered data of participants342
completing the sequential-collaboration questionnaire who could decide whether to343
change or maintain a presented judgment (position 2, 3, or 4 in a sequential chain). We344
modeled the decision whether to adjust or maintain a judgment as a function of the345
chain position in the sequence by fitting a generalized linear mixed model with the R346
packages lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017). Since the347
dichotomous dependent variable can only be 1 (adjust) or 0 (maintain), we used a logit348
link function. As every participant answered the same 65 items, and we added random349
intercepts for items and participants to account for the nested structure of our data350
(Pinheiro & Bates, 2000). We added crossed random effects of items and participants to351
all hierarchical models reported since all reported studies have a data structure in which352
individual judgments are nested in items and participants. Lastly, we set polynomial353
contrasts to test for a decline in change probability with increasing chain position.2
Figure 3shows the mean change probability for each chain position with355
corresponding between-subjects error bars and violin and box plots indicating the356
distribution of change probabilities for participants (aggregated across items). Even357
though descriptively in line with Hypothesis 1a, the linear trend of the effect of chain358
position on change probability was not significant (β=0.311,CI = [0.685,0.063],359
Hypothesis 1b states that change magnitude decreases over the course of a361
sequential chain. Again, we only used judgments from the sequential-collaboration362
questionnaire (chain positions 2, 3, and 4) and excluded all trials in which participants363
did not change the presented judgment of a previous participant. Change magnitude364
was computed as the absolute difference between the standardized error of a judgment365
and the standardized error of the previous judgment. We fitted a linear mixed model366
with change magnitude as a continuous dependent variable and chain position as367
independent variable, and included polynomial contrasts for the factor chain position.368
In line with Hypothesis 1b, the model revealed a significant negative linear trend of369
chain position (β=0.042,CI = [0.074,0.011],t(82.093) = 2.633,p=.010).370
Figure 4shows the empirical means of change magnitude for each chain position with371
between-subjects error bars as well as violin and box plots of the distribution of change372
magnitudes for participants (aggregated across items).373
Hypothesis 2 states that judgments become more accurate over the course of a374
2Whenever testing polynomial contrasts, we only report results for the linear trend since we are
interested in a decrease in change probability (Hypothesis 1a), in change magnitude (Hypothesis 1b),
and in absolute error (Hypothesis 2). Other trends are only reported when being statistically
Figure 3
Change probability within a sequential chain.
Experiment 1
Experiment 2
2 3 4 2 3 4 5 6
Chain position
Change probability
Note. Points display the empirical means for each chain position, error bars show the
corresponding 95% between-subjects confidence intervals. Violin and box plots illustrate
the distribution of change probabilities for the participants (aggregated across items).
sequential chain. To test this prediction, we only considered data from the375
sequential-collaboration condition. We fitted a linear mixed model with absolute376
standardized errors as dependent variable and chain position as independent variable.377
Since fixed-effect coefficients in linear mixed models have been shown to be robust when378
residuals are not normally distributed (LeBeau et al., 2018; Schielzeth et al., 2020), we379
did not transform the dependent variable accordingly. Furthermore, we set polynomial380
contrasts for the factor chain position.381
Figure 5shows the mean absolute standardized errors for each chain position382
with corresponding violin and box plots. In line with Hypothesis 2, judgments became383
more accurate as the distance to the correct answer declines over the course of a384
Figure 4
Change magnitude within a sequential chain.
Experiment 1
Experiment 2
2 3 4 2 3 4 5 6
Chain position
Change magnitude
Note. Points display the empirical means for each chain position, error bars show the
corresponding 95% between-subjects confidence intervals. Violin and box plots illustrate
the distribution of change magnitude for the participants (aggregated across items).
sequential chain. This pattern was also confirmed by the linear mixed model showing a385
significant negative linear trend (β=0.024,CI = [0.036,0.011],386
t(143.55) = 3.800,p<.001).387
Hypothesis 3 states that sequential-collaboration estimates are more accurate388
than estimates obtained with unweighted averaging. Before testing this hypothesis, we389
checked whether the randomization worked as intended, meaning that participants390
completing the independent-judgments questionnaire had the same judgment accuracy391
irrespective whether they were included in the unweighted-averaging condition or served392
as starters for sequential chains (position 1) in the sequential-collaboration condition.393
We only considered data obtained with the independent-judgments questionnaire and394
Figure 5
Accuracy of judgments within a sequential chain.
Experiment 1
Experiment 2
1 2 3 4 1 2 3 4 5 6
Chain position
Absolute standardized error
Note. Points display the empirical means for each chain position, error bars show the
corresponding 95% between-subjects confidence intervals. Violin and box plots illustrate
the distribution of errors for the participants (aggregated across items).
fitted a linear mixed model with the absolute standardized error as dependent and395
condition as independent variable. We did not find a significant effect of condition on396
the absolute standardized error (β= 0.005,CI = [0.012,0.023],t(198.37) = 0.598,397
p=.550) indicating that the randomization worked as intended.398
Next, we computed the accuracy of the group estimates for each condition. For399
sequential collaboration, the estimate for each chain is the judgment at the last chain400
position (position 4). Accuracy of these estimates is defined as the absolute401
standardized error of the last judgment in a sequential chain. For unweighted averaging,402
the group estimate is the mean of judgments for groups of four participants. We403
computed the absolute standardized errors of these estimates by (1) randomly assigning404
participants to virtual groups of four, (2) averaging the four standardized errors for405
each question, and (3) computing the absolute value of the average.3Since the number406
of participants in the unweighted-averaging condition (i.e., 155 participants) is not a407
multiple of four, we randomly selected one participant whose data was duplicated before408
generating the virtual groups. This procedure resulted in absolute standardized errors409
of estimates of the 65 items presented in the study for 39 groups of participants in the410
unweighted-averaging condition and 37 sequences of participants in the411
sequential-collaboration condition.412
To test Hypothesis 3, we fitted a linear mixed model with absolute standardized413
errors as dependent variable and condition as independent variable. Figure 6displays414
the mean absolute standardized error and corresponding 95% between-subjects415
confidence intervals with violin and box plots referring to the distribution of group416
accuracy (aggregated across items). Contrary to Hypothesis 3, estimates in the417
sequential-collaboration showed a slightly higher mean absolute error of estimates418
meaning that these judgments were descriptively less accurate. However, the linear419
mixed model did not show a significant effect of condition (β= 0.008,420
CI = [0.005,0.022],t(72.15) = 1.237,p=.220).421
To explore the robustness of our results, we conducted additional analyses422
examining the influence of outlier exclusions and transformations of judgments on the423
accuracy of unweighted averaging and sequential collaboration. We also performed424
non-parametric, descriptive assessments of the hypotheses using the common-language425
effect size and observation-oriented modeling. The additional analyses led to similar426
conclusions as those reported in the main text and can be found in the supplementary427
material at
3To check the robustness of the grouping, we computed the mean difference in absolute standardized
errors of estimates between conditions and the corresponding linear mixed model for 100 different
random groupings in the unweighted-averaging condition. The mean difference in absolute estimates
was 0.007 (SD = 0.001) for Experiment 1 and 0.01 (SD = 0.001) for Experiment 2. The results of the
linear mixed model remained the same for all 100 comparisons in both experiments.
Figure 6
Accuracy of unweighted averaging and sequential collaboration.
Experiment 1
Experiment 2
Sequential Collaboration Unweighted averaging Sequential Collaboration Unweighted averaging
Absolute error of estimates
Note. Estimates for sequential collaboration pertain to the last judgment in a sequential
chain. Black points display the empirical means for each condition, error bars show the
corresponding 95% between-subjects confidence intervals. Violin and box plots illustrate
the corresponding distribution of participants (aggregated across items).
Exploratory comparison of different aggregation methods.429
We additionally assessed the effect of various aggregation methods for obtaining430
group estimates. For independent judgments, we computed the mean and the median of431
the four judgments within each group. For sequential collaboration, we did not only432
consider the last value in a chain but also aggregated the four individual judgments in a433
chain using the mean and the median. Furthermore, we computed a weighted mean434
with the weights 1/10, 2/10, 3/10, and 4/10 for judgments at chain positions 1, 2, 3,435
and 4, respectively.436
Table 1
Comparison of different aggregation methods for independent judgments and sequential
Experiment Method Aggregation Absolute Error SE Squared Error SE
Last Value 0.16 0.01 0.11 0.01
Mean 0.16 0.01 0.11 0.01
Median 0.17 0.01 0.12 0.01
Sequential collaboration
Weighted Mean 0.16 0.01 0.11 0.01
Mean 0.16 < 0.01 0.08 < 0.01
Experiment 1
Independent judgments
Median 0.14 < 0.01 0.07 < 0.01
Last Value 0.12 < 0.01 0.07 0.01
Mean 0.13 < 0.01 0.08 0.01
Median 0.13 < 0.01 0.08 0.01
Sequential collaboration
Weighted Mean 0.13 < 0.01 0.08 0.01
Mean 0.13 < 0.01 0.07 < 0.01
Experiment 2
Independent judgments
Median 0.13 < 0.01 0.08 < 0.01
Last Value 46.17 1.01 4120.77 208.22
Mean 46.00 0.86 3615.98 152.01
Median 44.90 0.95 3820.80 184.06
Sequential collaboration
Weighted Mean 43.33 0.85 3338.07 151.41
Mean 53.74 0.71 4205.06 114.84
Experiment 3
Independent judgments
Median 47.95 0.74 3723.38 131.24
For all aggregation methods and both conditions, Table 1shows the mean437
absolute error and the mean squared error of the group estimates. The most accurate438
estimates were obtained when using median aggregation for independent judgments.439
Similar to the confirmatory analysis of Hypothesis 3, estimates obtained with440
unweighted averaging were descriptively more accurate than those of sequential441
collaboration irrespective of the aggregation method for both absolute and squared442
errors. For sequential collaboration, all aggregation methods yielded similar accurate443
results which were overall less accurate than aggregating independent individual444
Overall, Experiment 1 yielded mixed results. The results indicate that a basic447
assumption about sequential collaboration holds, namely, that contributors can improve448
the accuracy of previous judgments by correcting each other sequentially. There was,449
however, no evidence for Hypothesis 3 as unweighted averaging and sequential450
collaboration yielded similarly accurate group estimates. Furthermore, the exploratory451
moderator analyses showed that the accuracy of sequential collaboration was not much452
influenced by different aggregation methods while independent judgments showed more453
accurate estimates when taking the median rather than the mean. This may be due to454
reducing the effect of extreme judgments which may distort the mean.455
While Experiment 1 provides first insights into sequential collaboration, it has456
some limitations which restrict the generalizability of our results. First, the sample was457
restricted to college students who typically have a similar age and educational458
background. The limited diversity in individuals’ expertise might have reduced the459
chances of improving previous judgments in sequential collaboration since the460
distribution of knowledge might have been too homogeneous. Second, we implemented461
a rather short chain length of four individuals. Results might differ when using longer462
chains since additional contributors might improve the sequential-collaboration estimate463
Experiment 2465
To address the limitations of Experiment 1, we conducted a second experiment466
using the same general knowledge questions while increasing the chain length from four467
to six and collecting an adult sample with no restrictions in age or education. Thereby,468
we test the robustness of our findings, especially the improvement of judgments within a469
sequential chain, and apply the paradigm to a more diverse sample and a longer470
sequential chain. The design and confirmatory analyses were again preregistered.4Note471
that we did not preregister Hypotheses 1a and 1b concerning the change probability472
and change magnitude within a sequential chain. Moreover, we improved the exclusion473
criteria for extreme judgments as already applied in Experiment 1 and added an474
exploratory analysis comparing different aggregation methods for independent475
judgments and sequential collaboration. All analyses concerning the accuracy of476
estimates (Hypotheses 2 and 3) are again based on absolute standardized errors similar477
as in Experiment 1.478
Materials, design, and procedure. Experiment 2 used the same design and480
questions as Experiment 1 (see Figure 2and Table A1, respectively) with some minor481
adjustments. Since the sample was not restricted in age, we extended the time limit for482
answering a question from 30 to 40 seconds. Moreover, we implemented a chain length483
of six meaning that the first participant in a sequential chain answered the484
independent-judgments questionnaire and was then followed by five participants485
answering the sequential-collaboration questionnaire.486
Participants. A German panel provider sampled 686 participants for this487
study. Participants were compensated between 0.75EUR and 1EUR depending on the488
time for study completion. During data collection, 21 participants were suspected to489
look up answers as they entered more than 10% correct answers, and were thus excluded490
both for building sequential chains and for the analysis. Moreover, 8 participants had491
irregular answer patterns and were excluded. One participant was excluded since the492
position in the sequential chain was allocated to two different participants. After493
excluding these participants and, if necessary, participants in the corresponding494
sequential chains, the final sample comprised 646 participants. Half of the participants495
were female (49.9%), the mean age was 48.1years (SD = 19.5). Most participants had496
a college degree (27.2%), followed by a high-school diploma (25.5%), and vocational497
4The preregistration form is available at
education (23.1%) while 24.2% of the participants had a lesser education attainment.498
As preregistered, we first excluded judgments (and corresponding chains) that500
were timed out after 40 seconds. After this exclusion, 40,324 out of 41,990 judgments501
remained. The judgments were then standardized item-wise similarly as in Experiment502
1 by subtracting the correct answer and dividing by the standard deviation of all503
independent judgments. Finally, the 1% most extreme judgments (and corresponding504
chains) were excluded from the data resulting in 39,699 judgments for the analysis. We505
conducted the same confirmatory and exploratory analyses as for Experiment 1.506
Confirmatory analyses.507
To test whether change probability decreases over the course of a sequential508
chain (Hypothesis 1a), we fitted a generalized linear mixed model with the decision509
whether to adjust or maintain a judgment as dependent variable. Figure 3displays the510
distribution of change probabilities for each chain position. As hypothesized, the change511
probability decreased over the course of a sequential chain as indicated by a significant512
negative linear trend (β=0.581,CI = [0.982,0.181],z=2.843,p=.004).513
Next, we fitted a linear mixed model to test whether change magnitude decreases514
over the course of a sequential chain (Hypothesis 1b). Figure 4shows that change515
magnitude slightly decreased over the course of a sequential chain. However, we did not516
find a significant negative linear trend (β=0.016,CI = [0.040,0.009],517
t(174.228) = 1.249,p=.214).518
To test whether accuracy of judgments increases over the course of a sequential519
chain (Hypothesis 2), we estimated a linear mixed model with absolute standardized520
errors as dependent variable. As predicted, the model revealed a significant negative521
linear trend of chain position (β=0.045,CI = [0.054,0.036],t(290.96) = 9.528,522
p<.001). Accordingly, Figure 5shows a decrease of absolute standardized errors over523
the course of a sequential chain.524
Before testing whether sequential collaboration yielded more accurate estimates525
than unweighted averaging (Hypothesis 3), we checked whether the randomized526
assignment to conditions was successful. We analyzed only data obtained with the527
independent-judgments questionnaire and fitted a linear mixed model with absolute528
standardized error of individual judgments as dependent variable and condition as529
independent variable. The model did not show a significant effect of condition on530
accuracy (β= 0.003,CI = [0.011,0.017],t(390.99) = 0.385,p=.701). This indicates531
that no condition had an a priori advantage in judgment accuracy.532
The error of group estimates in both conditions was computed as already533
established in Experiment 1. Since sequential chains had six participants, virtual groups534
for unweighted averaging were also composed of six participants.5We then fitted a535
linear mixed model with the absolute standardized error of the group estimate as536
dependent variable. Figure 6shows that sequential-collaboration estimates were slightly537
more accurate than those obtained with unweighted averaging. This impression was538
confirmed by a linear mixed model showing a significant difference of the absolute539
standardized error across conditions (β= 0.014,CI = [0.005,0.023],t(100.71) = 3.067,540
Exploratory comparison of different aggregation methods.542
Table 1shows a comparison of different aggregation methods for sequential543
collaboration and independent judgments. In contrast to Experiment 1, for absolute544
errors, sequential collaboration provided the most successful aggregation of individual545
judgments when using the last judgment in a sequence. However, when focusing on546
squared instead of absolute errors, taking the mean in unweighted averaging was547
similarly accurate as the last value of a sequential chain. Overall, the differences548
between conditions and aggregation methods are small and there seems to be no clear549
5To check the robustness of the grouping, we computed the mean difference in absolute standardized
errors of estimates between conditions and the corresponding linear mixed model for 100 different
random groupings in the unweighted-averaging condition. The mean difference in absolute estimates
was 0.007 (SD = 0.001) for Experiment 1 and 0.01 (SD = 0.001) for Experiment 2. The results of the
linear mixed model remained the same for all 100 comparisons in both experiments.
advantage in the accuracy of estimates between unweighted averaging and sequential550
collaboration. Further robustness analyses can be found in the supplementary material.551
Experiment 2 also showed that a basic assumption of sequential collaboration553
holds, namely, that the accuracy of judgments increases through incremental changes.554
Moreover, sequential-collaboration estimates were more accurate than estimates555
obtained with unweighted averaging in the confirmatory analysis. However, the556
comparison of different aggregation methods across conditions revealed that there are557
only small differences in accuracy between sequential collaboration and unweighted558
Experiments 1 and 2 investigated the accuracy of sequential collaboration when560
eliciting quantitative judgments. However, both experiments used general knowledge561
questions which limits the generalizability of the results and may pose some issues.562
First, the questions are prone to extreme judgments. For instance, one participant563
answered 120,000,000,000,000,000 kilometers to the question “How long is the mean564
distance between Earth and Moon?” for which the correct answer is 384,400 kilometers.565
Having extreme judgments in the data might especially hurt the performance of566
unweighted averaging. Furthermore, general knowledge questions occur rather seldom in567
online collaboration projects, thus limiting the ecological validity of the conclusions.568
Thus, we conducted a conceptual replication using different materials which are less569
prone to extreme judgments and more closely resemble actual online collaboration570
Experiment 3572
Experiment 3 is a conceptual replication of Experiment 1. Both studies used a573
similar design with some minor changes due to the different material. Instead of general574
knowledge questions, participants were presented with geographic maps on which they575
had to locate the positions of different cities. We thus focus on two-dimensional location576
judgments (i.e., x- and y-coordinates) rather than one-dimensional numerical577
judgments. In contrast to general knowledge questions, two-dimensional location578
judgments on geographical maps are naturally constrained by the size of the map (more579
precisely, by the maximum distance between the correct location and all possible580
judgments) which limits the maximum range of extreme judgments. We preregistered581
Experiment 3 at beyond the preregistration, we also tested582
whether change probability (Hypothesis 1a) and change magnitude (Hypothesis 1b)583
decrease over the course of a sequential chain. Furthermore, we adjusted the outlier584
analysis as described in Experiments 1 and 2, and added an exploratory analysis585
comparing different aggregation methods.586
Participants. We recruited 417 adult participants via a commercial German588
panel provider which compensated participants according to the time for completing the589
study. Since participants were presented with maps, they were supposed to only590
participate using a computer. Due to issues in the recruitment of participants by the591
panel provider, 39 participants were nonetheless able to access and complete the study592
using mobile devices. We excluded all participants using mobile devices and all593
sequences including these participants, thus excluding 70 participants in total.594
Additionally, four participants were able to access and complete the study a second595
time. Therefore, we excluded the data collected at the second participation. Since two596
of these participants were assigned to the sequential-collaboration condition for their597
second participation and sequences were built based on their judgments, we excluded598
another 10 participants in total. We also checked whether participants looked up the599
correct answers or whether participants clicked at a similar position for all items and600
identified one participant who was suspected to look up answers who was thus excluded.601
The final sample comprised 333 participants of whom 46.0% were female. The mean age602
was 45.5years (SD = 15.2). Participants had a diverse educational background with603
35.4% holding a college degree, 24.9% having a high school diploma, 24.0% having604
6The preregistration form is available at
vocational education, and 18.3% having a lesser educational attainment.605
Materials. As stimulus material, we selected seven maps displaying different606
European countries: (1) Italy, (2) France, (3) Germany, (4) United Kingdom and607
Ireland, (5) Austria and Switzerland, (6) Spain and Portugal, and (7) Poland, Czech,608
Hungary, and Slovenia. All maps were on a scale of 1:5,000,000 with an image609
resolution of 800 x 500 pixels. Regarding the available geographic information, the610
maps only showed land mass, oceans, and country borders. The countries of interest611
were colored white while all other countries were colored gray; oceans were colored blue612
and country borders were represented as black lines. Overall, we selected 57 cities613
across all seven maps. For each map, we selected between 4 and 17 cities while614
considering the expected geographic knowledge of German participants. Table B1615
provides a comprehensive overview of the materials, and all maps are also available in616
the supplementary material (
Design and procedure. We randomly assigned participants to either the618
sequential-collaboration questionnaire (112 participants) or the independent-judgments619
questionnaire (221 participants). As in Experiment 1, we formed sequences of four620
participants meaning that one participant who answered the independent-judgments621
questionnaire started a sequential chain followed by three participants who completed622
the sequential-collaboration questionnaire. This resulted in 183 participants in the623
unweighted-averaging condition and 150 participants in the sequential-collaboration624
After being informed about the aim of the study and providing informed626
consent, participants were instructed about the task. In the independent-judgments627
questionnaire, participants had to indicate the position of the given cities on the628
presented map as accurately as possible. In the sequential-collaboration questionnaire,629
participants were provided with the location judgment of a city given by a previous630
participant. Subsequently, they could choose either to modify the given position by631
indicating a new position or to directly continue to the next city without changing the632
presented location judgment. The order in which the seven maps were presented was633
randomized as was the order of the presented city within each map. Furthermore, each634
trial asked about the position of only one city such that participants provided only a635
single location judgment before continuing to the next city. Participants were given 40636
seconds to indicate the city’s position or to decide to not change the presented position.637
Additionally, participants completing the sequential-collaboration questionnaire had a638
waiting period of 2 seconds before they could continue to the next city. Finally,639
participants provided demographic information, were debriefed, and thanked for640
As dependent variable for Hypotheses 2 and 3, we computed the Euclidean643
distance to the correct answer for each judgment.7Next, we excluded 225 judgments644
(and corresponding sequential chains) that were timed out after 40 seconds, meaning645
that 18,433 out of 18,981 judgments remained for analysis. We again excluded the 1%646
most extreme judgments (i.e., 184 judgments) as defined by the distance to the correct647
answer. After the exclusion of sequential chains that contained outliers, 18,161648
judgments remained for analysis.649
Confirmatory analyses.650
To test whether change probability decreases over the course of a sequential651
chain (Hypothesis 1a), we fitted a generalized linear mixed model with the decision652
whether a judgment was adjusted or maintained as dependent variable. Figure 7653
displays the change probability for each chain position with error bars as well as violin654
and box plots illustrating the distribution of change probabilities for participants655
(aggregated across items). In line with Hypothesis 1a, the plot shows a decreasing656
change probability with increasing chain position. This visual impression was confirmed657
by the model which revealed a significant negative linear trend of chain position658
(β=0.937,CI = [1.845,0.028],z=2.021,p=.043).659
7All hypotheses were also analyzed using the x- and y-coordinate separately as dependent variables.
These analyses yielded the same results as the analysis using Euclidean distances as dependent variable.
Figure 7
Change probability within a sequential chain in Experiment 3.
Chain position
Change probability
Note. Points display the empirical means for each chain position, error bars show the
corresponding 95% between-subjects confidence intervals. Violin and box plots illustrate
the distribution of change probabilities for the participants (aggregated across items).
To test whether the magnitude of changes decreases over the course of a660
sequential chain (Hypothesis 1b), we fitted a linear mixed model with change661
magnitude as dependent variable. Figure 8shows the empirical mean and distribution662
of the change magnitude across chain positions. In line with our hypothesis, we found a663
significant negative linear trend of chain position (β=14.952,CI = [24.139,664
5.765],t(108.971) = 3.190,p=.002). Furthermore, we also found a significant665
positive quadratic trend (β= 9.507,CI = [0.279,18.736],t(108.984) = 2.019,p=.046)666
which indicates a larger difference between positions 2 and 3 than between positions 3667
and 4 (cf. Figure 8).668
To test whether judgments become more accurate over the course of a sequential669
Figure 8
Change magnitude within a sequential chain in Experiment 3.
Chain position
Change magnitude (in kilometers)
Note. Distances were converted from pixels to kilometers for this figure. Points display
the empirical means for each chain position, error bars show the corresponding 95%
between-subjects confidence intervals. Violin and box plots illustrate the distribution of
change magnitude for the participants (aggregated across items).
chain (Hypothesis 2), we fitted a linear mixed model with chain position as independent670
variable and Euclidean distance of each judgment to the true position of a city as671
dependent variable. The model revealed a significant linear trend between chain672
position and distance (β=17.610,CI = [24.801,10.419],t(145.17) = 4.777,673
p<.001). Furthermore, the quadratic trend was also significant (β= 8.316,674
CI = [1.125,15.507],t(145.17) = 2.256,p=.026). In combination with the negative675
linear trend, the positive quadratic trend, indicates that accuracy improved more676
between positions 2 and 3 compared to positions 3 and 4, a pattern also displayed in677
Figure 9. An overview of the judgments given by the participants for each city can be678
found in the supplementary material.679
Figure 9
Accuracy of judgments within a sequential chain in Experiment 3.
1 2 3 4
Chain position
Distance to correct answer (in kilometers)
Note. Distances were converted from pixels to kilometers for this figure. Points display
the empirical means for each chain position, error bars show the corresponding 95%
between-subjects confidence intervals. Violin and box plots illustrate the distribution of
distances to the correct answers for the participants (aggregated across items).
Before comparing the accuracy of sequential collaboration and unweighted680
averaging (Hypothesis 3), we again performed a randomization check. To compare681
judgment accuracy in the independent-judgments questionnaire across conditions, we682
estimated a linear mixed model with the Euclidean distance to the true position as683
dependent variable. We did not find a significant difference in accuracy between684
judgments in the unweighted-averaging condition and judgments that were used to start685
sequential chains in the sequential-collaboration condition (β=0.768,686
CI = [10.691,9.160],t(220.25) = 0.152,p=.879), thus indicating that the687
randomization was successful.688
Figure 10
Accuracy of estimates obtained with unweighted averaging and sequential collaboration in
Experiment 3.
Unweighted averaging Sequential collaboration
Distance to correct answer (in kilometers)
Note. Distances were converted from pixels to kilometers for this figure. Estimates
for sequential collaboration pertain to the last judgment in a sequential chain. Points
display the empirical means for each chain position, error bars show the corresponding
95% between-subjects confidence intervals. Violin and box plots illustrate the distribution
of distances to the correct answers for the participants (aggregated across items).
We computed group estimates for each condition similar as in Experiments 1 and689
2. Estimates obtained with independent judgments were based on random groups of690
four participants. We averaged the four location judgments separately for each691
coordinate, thereby computing the geometric center. For sequential-collaboration,692
estimates pertain the last judgment in each chain. As a dependent variable, we693
computed the Euclidean distance between the resulting mean estimate and the true694
position of each city.695
Figure 11
Estimated and actual locations of five cities on the map of Italy.
500 0 200 400 600 800
X−coordinate (in pixels)
Y−coordinate (in pixels)
Correct location Sequential collaboration Unweighted averaging
City Florence Milan Naples Rome Venice
Figure 10 displays the mean distance to the true position with corresponding696
error bars and violin and box plots indicating the distribution of mean distances for697
different the two experimental groups (aggregated across items). In line with698
Hypothesis 3, sequential-collaboration estimates resulted in a smaller distance to the699
true position than estimates obtained with unweighted averaging. This impression was700
supported by a linear mixed model with the Euclidean distance as dependent variable,701
showing that sequential collaboration yielded more accurate estimates than unweighted702
averaging (β=7.411,CI = [14.532,0.301],t(80.70) = 2.049,p=.044).703
Figure 11 illustrates the high accuracy of sequential collaboration for five cities704
on the map of Italy. The figure shows the mean estimates of the two methods as well as705
the actual positions of the five cities. For Florence, Milan, Rome, and Venice, sequential706
collaboration yielded more accurate estimates than unweighted averaging, while both707
methods yielded similarly accurate estimates for Naples. Across all maps and cities,708
sequential collaboration resulted in location estimates that were 16.03 kilometers closer709
to the actual position than those obtained via unweighted averaging (see Figure 10). In710
Figure 11,this value resembles the difference of the average distance of the two711
estimates for the location of Rome (13.55 kilometers). Similar plots for the other six712
maps are available in the supplemental material.713
Exploratory comparison of different aggregation methods.714
Similarly as in Experiments 1 and 2, we compared different aggregation methods715
for sequential collaboration and unweighted averaging. Table 1shows that sequential716
collaboration generally yielded more accurate estimates than unweighted averaging717
when focusing on the Euclidean distance. However, sequential-collaboration estimates718
yielded similar accurate estimates when using the mean, median, or the weighted mean.719
When measuring accuracy in terms of squared Euclidean distance, median aggregation720
of independent judgments was more accurate than the last judgment in sequential721
collaboration but all other aggregation for sequential collaboration remained more722
Experiment 3 replicated the results of Experiments 1 and 2 using geographic725
maps instead of general knowledge questions. Sequential collaboration yielded more726
accurate estimates over the course of a sequential chain while change probability and727
change magnitude of judgments decreased. Additionally, sequential collaboration728
yielded more accurate results than unweighted averaging in the confirmatory analysis.729
However, the exploratory analysis revealed that various methods of aggregating730
judgments into estimates for sequential collaboration yielded similar accurate estimates.731
General discussion732
Sequential collaboration describes a collaboration method in which contributors733
form a sequential chain of judgments by deciding whether to adjust or maintain the734
latest judgment provided by a previous contributor. In three online studies using735
general knowledge questions and geographic maps, we examined whether change736
probability and change magnitude decrease over the course of a sequential chain737
(Hypotheses 1a and 1b, respectively) whereas judgment accuracy increases (Hypothesis738
2). As a benchmark, we compared the accuracy of estimates obtained with sequential739
collaboration to estimates obtained with unweighted averaging (Hypothesis 3). All740
three experiments provided evidence that accuracy increased within sequential chains of741
judgments, whereas a decrease in change probability was observed only in Experiments742
2 and 3, and a decrease in change magnitude only occurred in Experiments 1 and 3.743
Sequential collaboration outperformed unweighted averaging only in Experiments 2 and744
3 while showing a similar level of accuracy in Experiment 1. While this pattern did not745
remain as clear in the exploratory analysis showing mixed patterns for the comparison746
with different aggregation measures, it is noteworthy that sequential collaboration and747
unweighted averaging overall performed similarly well in terms of accuracy.748
The present work contributes to research on how judgments are influenced when749
providing information about the judgments of others. Several studies already hinted750
towards dependent judgments being beneficial for individual judgments in certain751
situations (Becker et al., 2017; King et al., 2012; Koehler & Beauregard, 2006; Minson752
et al., 2017) with Miller and Steyvers (2011) implementing a design quite similar to753
sequential collaboration. We extend this line of research by showing that even a very754
high level of dependency of judgments can yield accurate estimates. The observed755
improvement in accuracy within chains of judgments indicates that sequential756
collaboration was neither obstructed by anchoring effects (Mussweiler et al., 2004;757
Tversky & Kahneman, 1974) nor by high rates of inaccurate changes, for instance, due758
to egocentric discounting (Bonaccio & Dalal, 2006). Our results show that sequential759
collaboration provides accurate estimates in controlled online studies involving760
quantitative knowledge and location judgments. This is in line with prior research761
showing that large-scale online collaboration projects which rely on this basic762
mechanism provide high-quality information (e.g., Leithner et al., 2010; Zielstra & Zipf,763
Moderators across and within experiments765
The mixed results across experiments may be due to several factors relevant for766
the performance of sequential collaboration and unweighted averaging. First, while767
Experiment 1 used a homogeneous sample of university students, Experiments 2 and 3768
used more diverse samples (i.e., German adults with a wide range in age and769
educational background). Limited diversity can not only reduce the performance of770
unweighted averaging (Davis-Stober et al., 2014; Oliveira & Nisbett, 2018). Sequential771
collaboration may also generally benefit from diverse samples, since heterogeneity in772
knowledge increases the chances that a few experts provide accurate corrections to the773
judgments of the remaining, less knowledgeable contributors. Hence, participants may774
had fewer chances of correcting each other in Experiment 1 compared to the more775
diverse samples in Experiments 2 and 3.776
Second, the inconsistent pattern of results may be due to the difficulty of the777
different tasks. Whereas the general knowledge questions in Experiments 1 and 2 were778
very difficult, locating cities on geographical maps (Experiment 3) was easier because779
participants likely had some basic geographic knowledge (e.g., from school, the media,780
or by visiting some of the cities). The difficulty of the tasks can explain the substantial781
difference in change probabilities (approximately 20% in Experiments 1 and 2 compared782
to 60% in Experiment 3). Higher change probabilities indicate that participants in the783
sequential-collaboration condition were more likely to improve judgments within a784
sequential chain.785
Third, the experiments differed in the length of the sequential chains (i.e., the786
group size varied between 4 and 6). Sequential collaboration may require longer chains787
especially for more difficult items because substantial improvements in judgments occur788
less frequently. In Experiment 1, the high item difficulty combined with the shorter789
sequential chains of only four contributors may have reduced the performance of790
sequential collaboration.791
Fourth, certain tasks and materials may be better suited either for unweighted792
averaging or for sequential collaboration. Unweighted averaging has often shown to be793
highly accurate when aggregating numerical point judgments as in Experiments 1 and 2794
(Galton, 1907; Hueffer et al., 2013; Surowiecki, 2004) and is also quite robust against795
biases in such scenarios (Davis-Stober et al., 2014). In contrast, sequential collaboration796
may be more suitable for more complex tasks, such as positioning cities on maps as in797
Experiment 3 or generating rank orders (Miller & Steyvers, 2011; Steyvers et al., 2009).798
In more complex tasks, it is easier for contributors to integrate partial knowledge into799
the judgment. For instance, one may know that Berlin is located close to the Polish800
border, and in turn correct the latitude of a presented judgment to the East without801
modifying any other aspect of the judgment (e.g., longitude). The contribution of802
partial knowledge is also likely in online collaborative projects such as Wikipedia where803
contributors usually edit only small parts of an article while rarely (re-)writing complete804
articles. In complex tasks, mechanical aggregation methods may even lead to805
unreasonable estimates. For instance, unweighted averaging of two-dimensional location806
judgments can result in estimates for a city that are located in a lake or an ocean. In807
contrast, contributors in sequential collaboration usually do not provide such808
unreasonable judgments.809
Fifth, we also compared different aggregation methods. For unweighted810
averaging, we computed the mean and the median. For sequential collaboration, we811
used the last value of a sequential chain as well as the mean, the median, and the812
weighted mean across chain positions. We found that all aggregation methods yielded813
similar results for absolute errors in Experiment 1, and for absolute as well as squared814
errors in Experiment 2. However, in Experiment 1, unweighted averaging resulted in815
smaller squared errors than sequential collaboration. In Experiment 3, sequential816
collaboration generally yielded smaller absolute errors than unweighted averaging for all817
aggregation methods. However, with respect to squared errors, sequential collaboration818
showed a similar accuracy as unweighted averaging. Overall, different aggregation819
methods can affect the accuracy of estimates. This may be especially relevant in small820
groups of only four or six contributors in which extreme judgments have a large821
influence on the resulting estimates.822
Possible mechanisms823
Our results provide first insights into sequential collaboration, but the three824
experiments do not provide explanations why sequential collaboration yields accurate825
results. Prior research points to different mechanisms that could lead to improved826
judgments over the course of a sequential chain. Sequential collaboration may yield827
accurate results because individual judgments are implicitly weighted by expertise.828
Both the weighting of judgments by expertise (Budescu & Chen, 2014; Merkle et al.,829
2020) and the selection of experts based on prior performance (Mannes et al., 2014)830
improve accuracy when aggregating independent judgments. However, sequential831
collaboration does neither implement an explicit weighting mechanism nor a selection of832
individuals based on expertise. Instead, the task structure allows contributors to833
maintain a judgment if they do not feel they can substantially contribute to the834
presented judgment. In fact, subjective confidence can be a valid indicator of expertise835
(Mannes et al., 2014), leading participants to not contribute to group work when they836
feel their judgment is dispensable (Kerr & Bruun, 1983). Judgments in sequential837
collaboration are implicitly weighted since experts are more likely to make adjustments,838
thus adding a larger contribution to the final outcome. In contrast, contributors839
without expertise are more likely to maintain the presented judgment. In an ideal case,840
sequential collaboration may lead to improvements in judgments until a correct841
judgment is no longer changed.842
The high accuracy of sequential collaboration may merely be due to the843
opportunity of contributors to opt out of answering and not due to the dependent844
nature of sequential judgments. In fact, allowing participants to self-select the subset of845
questions to be answered improves judgment accuracy and group estimates (Bennett et846
al., 2018; Galton, 1907). However, typical designs building on unweighted averaging do847
not allow individuals to select whether to answer a question or not but rather requires848
them to answer all questions irrespective of their expertise or meta-cognitive knowledge849
(Larrick & Soll, 2006). In our three studies, we compared this typical paradigm to850
sequential collaboration, meaning that the two experimental conditions did not only851
differ in the level of dependency but also in the possibility of opting-out of providing a852
judgment. Future research should further examine the role of the opt-out mechanism in853
sequential collaboration.854
In general, allowing participants to self-select whether to answer a question when855
applying unweighted averaging to the resulting judgments may lead to quite varying856
numbers of judgments per question. Bennett et al. (2018) observed that some of the857
easiest questions were answered by almost all participants while other, more difficult858
questions were answered not even by a single participant or only very few participants.859
This can render independent individual judgments with an opt-out option rather860
uneconomical and inefficient. In contrast, the opt-out mechanism in sequential861
collaboration ensures a higher efficiency because the decision whether to opt out is made862
in context of the perceived quality of the presented judgment. Hence, opting-out of863
adjusting a judgment is informative with respect to the accuracy of the current estimate.864
Limitations and future research865
Our experiments on sequential collaboration have some limitations. First, we866
only studied a very simple type of sequential collaboration where contributors could867
only change or maintain a previous judgment. However, Wikipedia and OpenStreetMap868
offer several additional functions such as discussion sites, a board of moderators869
checking on the contributors’ activities, and a history of all changes ever made to an870
entry. These additional functions are likely to influence contributors’ behavior in online871
collaborative projects even though they are less prominent than the information itself872
which is directly available in an article or on a map.873
Lastly, our studies used small chains of only four or six contributors. While874
estimates obtained with unweighted averaging improve for larger crowds, estimates875
obtained with sequential collaboration may not profit from longer chains. On the876
contrary, sequential chains may not converge to the correct judgment because this877
requires that at least one contributor adjusts the presented judgment to be correct.878
Whereas correct judgments become more likely as chains become longer, they can still879
be modified by subsequent, less knowledgeable contributors. Moreover, anchoring due880
to the presented (possibly incorrect) judgments may hinder contributors to provide881
accurate judgments in the long run. Based on these considerations and our findings, we882
expect that sequential collaboration has advantages over unweighted averaging in short883
sequential chains. However, for longer chains of about twelve to fifteen contributors,884
unweighted averaging would probably outperform sequential collaboration. To address885
this open question, future research should assess the performance of longer sequential886
Sequential collaboration is at the core of many large-scale online collaborative889
projects such as Wikipedia or OpenStreetMap. Our studies show that contributors can890
successfully collaborate through adjusting and maintaining previous judgments of other891
contributors. More generally, sequential collaboration has a high practical and892
theoretical relevance and provides a fruitful paradigm for studying how individuals893
perceive and use information by others in order to decide whether and how to adjust894
previous judgments.895
André, Q. (2021). Outlier exclusion procedures must be blind to the researcher’s897
hypothesis. Journal of Experimental Psychology: General.898
Arazy, O., Morgan, W., & Patterson, R. (2006). Wisdom of the crowds:900
Decentralized knowledge construction in wikipedia. SSRN Electronic901
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear903
mixed-effects models using lme4. Journal of Statistical Software,67 (1), 1–48.904
Becker, J., Brackbill, D., & Centola, D. (2017). Network dynamics of social906
influence in the wisdom of crowds. Proceedings of the National Academy of907
Sciences,114, E5070–E5076.
Bennett, S. T., Benjamin, A. S., Mistry, P. K., & Steyvers, M. (2018). Making a909
wiser crowd: Benefits of individual metacognitive control on crowd910
performance. Computational Brain & Behavior,1, 90–99.911
Bonaccio, S., & Dalal, R. S. (2006). Advice taking and decision-making: An913
integrative literature review, and implications for the organizational sciences.914
Organizational Behavior and Human Decision Processes,101, 127–151.915
Bonner, B. L., Sillito, S. D., & Baumann, M. R. (2007). Collective estimation:917
Accuracy, expertise, and extroversion as sources of intra-group influence.918
Organizational Behavior and Human Decision Processes,103, 121–133.919
Bray, R. M., Kerr, N. L., & Atkin, R. S. (1978). Effects of group size, problem921
difficulty, and sex on group performance and member reactions. Journal of922
Personality and Social Psychology,36, 1224–1240.923
Budescu, D. V., & Chen, E. (2014). Identifying expertise to extract the wisdom925
of crowds. Management Science,61, 267–280.926
Dalkey, N., & Helmer, O. (1963). An experimental application of the DELPHI928
method to the use of experts. Management Science,9, 458–467.929
Davis-Stober, C. P., Budescu, D. V., Dana, J., & Broomell, S. B. (2014). When931
is a crowd wise? Decision,1(2), 79–101.
Dennis, A. R. (1996). Information exchange and use in small group decision933
making. Small Group Research,27, 532–550.934
Dennis, A. R., Hilmer, K. M., & Taylor, N. J. (1998). Information exchange and936
use in GSS and verbal group decision making: Effects of minority influence.937
Journal of Management Information Systems,14 (3), 61–88. https://search.938
Galton, F. (1907). Vox populi. Nature,75, 450–451.940
Geist, M. R. (2010). Using the delphi method to engage stakeholders: A942
comparison of two studies. Evaluation and Program Planning,33, 147–154.943
Giles, J. (2005). Internet encyclopaedias go head to head. Nature,438, 900–901.945
Girres, J.-F., & Touya, G. (2010). Quality assessment of the french947
OpenStreetMap dataset. Transactions in GIS,14, 435–459.948
Hogarth, R. M. (1978). A note on aggregating opinions. Organizational Behavior950
and Human Performance,21, 40–46.951
Hueffer, K., Fonseca, M. A., Leiserowitz, A., & Taylor, K. M. (2013). The953
wisdom of crowds: Predicting a weather and climate-related event. Judgment954
and Decision Making,8, 16.955
Jeste, D. V., Ardelt, M., Blazer, D., Kraemer, H. C., Vaillant, G., & Meeks, T.956
W. (2010). Expert consensus on characteristics of wisdom: A delphi method957
study. The Gerontologist,50, 668–680.958
Keck, S., & Tang, W. (2020). Enhancing the wisdom of the crowd with960
cognitive-process diversity: The benefits of aggregating intuitive and961
analytical judgments. Psychological Science, 1272–1282.962
Kerr, N. L., & Bruun, S. E. (1983). Dispensability of member effort and group964
motivation losses: Free-rider effects. Journal of Personality and Social965
Psychology,44, 78–94.
Kerr, N. L., & Tindale, R. S. (2004). Group performance and decision making.967
Annual Review of Psychology,55, 623–655.968
King, A. J., Cheng, L., Starke, S. D., & Myatt, J. P. (2012). Is the true “wisdom970
of the crowd” to copy successful individuals? Biology Letters,8, 197–200.971
Kittur, A., & Kraut, R. E. (2008). Harnessing the wisdom of crowds in973
wikipedia: Quality through coordination. Proceedings of the 2008 ACM974
Conference on Computer Supported Cooperative Work, 37–46.975
Koehler, D. J., & Beauregard, T. A. (2006). Illusion of confirmation from977
exposure to another’s hypothesis. Journal of Behavioral Decision Making,19,978
Kräenbring, J., Monzon Penza, T., Gutmann, J., Muehlich, S., Zolk, O.,980
Wojnowski, L., Maas, R., Engelhardt, S., & Sarikas, A. (2014). Accuracy and981
completeness of drug information in wikipedia: A comparison with standard982
textbooks of pharmacology. PLoS ONE,9(9).983
Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B. (2017). lmerTest985
package: Tests in linear mixed effects models. Journal of Statistical Software,986
82 (13), 1–26.
Larrick, R. P., & Soll, J. B. (2006). Intuitions about combining opinions:988
Misappreciation of the averaging principle. Management Science,52,989
Laughlin, P. R., Bonner, B. L., Miner, A. G., & Carnevale, P. J. (1999). Frames991
of reference in quantity estimations by groups and individuals.992
Organizational Behavior and Human Decision Processes,80, 103–117.993
LeBeau, B., Song, Y. A., & Liu, W. C. (2018). Model misspecification and995
assumption violations with the linear mixed model: A meta-analysis. SAGE996
Leithner, A., Maurer-Ertl, W., Glehr, M., Friesenbichler, J., Leithner, K., &998
Windhager, R. (2010). Wikipedia and osteosarcoma: A trustworthy patients’999
information? Journal of the American Medical Informatics Association :1000
JAMIA,17, 373–374.
Lu, L., Yuan, Y. C., & McLeod, P. L. (2012). Twenty-five years of hidden1002
profiles in group decision making: A meta-analysis. Personality and Social1003
Psychology Review,16, 54–75.
Mannes, A. E., Soll, J. B., & Larrick, R. P. (2014). The wisdom of select crowds.1005
Journal of Personality and Social Psychology,107, 276–299.1006
Merkle, E. C., Saw, G., & Davis-Stober, C. (2020). Beating the average forecast:1008
Regularization based on forecaster attributes. Journal of Mathematical1009
Psychology,98, 102419.
Miller, B. J., & Steyvers, M. (2011). The wisdom of crowds with communication.1011
Proceedings of the Annual Meeting of the Cognitive Science Society,33.1012
Minson, J. A., Mueller, J. S., & Larrick, R. P. (2017). The contingent wisdom of1014
dyads: When discussion enhances vs. Undermines the accuracy of1015
collaborative judgments. Management Science,64, 4177–4192.1016
Mussweiler, T., Englich, B., & Strack, F. (2004). Anchoring effect. In R. F. Pohl1018
(Ed.), Cognitive illusions (1st ed., pp. 183–199). Psychology Press.1019
Niederer, S., & Dijck, J. van. (2010). Wisdom of the crowd or technicity of1020
content? Wikipedia as a sociotechnical system. New Media & Society,12,1021
Oliveira, S. de, & Nisbett, R. E. (2018). Demographically diverse crowds are1023
typically not much wiser than homogeneous crowds. Proceedings of the1024
National Academy of Sciences,115, 2066–2071.1025
OpenStreetMap Contributors. (2021). OpenStreetMap.1027
Pinheiro, J. C., & Bates, D. M. (Eds.). (2000). Linear mixed-effects models:1029
Basic concepts and examples. In Mixed-effects models in S and S-PLUS (pp.1030
3–56). Springer.
Pohl, R. F. (1998). The effects of feedback source and plausibility of hindsight1032
bias. European Journal of Cognitive Psychology,10, 191–212.1033
Schielzeth, H., Dingemanse, N. J., Nakagawa, S., Westneat, D. F., Allegue, H.,1035
Teplitsky, C., Réale, D., Dochtermann, N. A., Garamszegi, L. Z., &1036
Araya-Ajoy, Y. G. (2020). Robustness of linear mixed-effects models to1037
violations of distributional assumptions. Methods in Ecology and Evolution,1038
11, 1141–1152.
Simmons, J. P., Nelson, L. D., Galak, J., & Frederick, S. (2011). Intuitive biases1040
in choice versus estimation: Implications for the wisdom of crowds. Journal1041
of Consumer Research,38, 1–15.
Stasser, G., & Titus, W. (1985). Pooling of unshared information in group1043
decision making: Biased information sampling during discussion. Journal of1044
Personality and Social Psychology,48, 1467–1478.1045
Steyvers, M., Miller, B., Hemmer, P., & Lee, M. (2009). The wisdom of crowds1047
in the recollection of order information. In Y. Bengio, D. Schuurmans, J.1048
Lafferty, C. Williams, & A. Culotta (Eds.), Advances in neural information1049
processing systems (Vol. 22, pp. 1785--1793). Curran Associates, Inc.1050
Surowiecki, J. (2004). The wisdom of crowds (1. ed). Anchor Books.1053
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics1054
and biases. Science,185, 1124–1131.1055
Wikipedia Contributors. (2021). Wikipedia:about.1057
Yaniv, I., & Kleinberger, E. (2000). Advice taking in decision making: Egocentric1059
discounting and reputation formation. Organizational Behavior and Human1060
Decision Processes,83, 260–281.
Zhang, X., Astivia, O. L. O., Kroc, E., & Zumbo, B. D. (2022). How to think1062
clearly about the central limit theorem. Psychological Methods. In press.1063
Zielstra, D., & Zipf, A. (2010). Quantitative studies on the data quality of1065
OpenStreetMap in Germany. AGILE 2010. The 13th AGILE international1066
conference on geographic information science.1067
Appendix A
General Knowledge Questions
Table A1
Table of items for Experiment 1 and Experiment 2 using general knowledge questions.
Item Question Correct answer
1 How large is the Eiffel Tower? 300 meters
2 How many sovereign countries are located in Africa? 54 countries
3 How long is the Nile? 6650 kilometers
4 How old was Johann Wolfang von Goethe? 82 years old
5 How many bones does a human have? 214 bones
6 What is earth’s mean radius? 6371 kilometers
7 How old was Martin Luther King Jr.? 39 years old
8 How tall is the Brandenburg Gate? 26 meters
9 How high was the highest temperature ever measured on earth? 57 °C
10 At what temperature does lead melt? 328 °C
11 In which year did the first manned space flight take place? 1961
12 How high is Mount Everst? 8848 meters
13 How much does a tennis ball weigh? 57 gramms
14 How many keys does a typical piano have? 88 keys
15 How fast can a cheetah run? 112 kilometers per hour
16 How long can a blue whale become? 33 meters
17 How much do ten liters of oxygen weigh? 14 gramms
18 When was UNICEF founded? 1946
19 How many prime numbers are in the interval between 1 and 1000? 168 prime numbers
20 How many star constellations are officially recognized? 88 constellations
21 How many kilocalories do ten gummy bears have (i.e. 30 gramms)? 98 kilocalories
22 How long is a soccer goal? 7 meters
23 When was the last capital punishment enforced in France? 1977
24 How many plays of Shakespeare are preserved? 33 plays
25 How long is the kidney of a full-grown person? 12 centimeters
26 How many species of the hawaiian honeycreeper exist? 21 species
27 When was the lightning rod invented? 1752
28 When did the first modern Oylmpic Games take place? 1896
29 How fast can a raindrop fall? 9 meters per second
30 When was Leonardo da Vinci born? 1452
31 What is the maximum time that a total solar eclipse can take? 7 minutes
32 How many strings does a concert harp have? 47 strings
33 What is mean life expectancy for women in Germany? 81 years
Table A1
Table of items for Experiment 1 and Experiment 2 using general knowledge questions.
Item Question Correct answer
34 How wide is Lake Constance at its widest point? 14 kilometers
35 How long is the distance between earth and sun in million kilometers? 150 million kilometers
36 When was women’s suffrage adapted in Swizerland? 1971
37 How many chaptes does the Quran have? 114 chapters
38 How many times larger is the diameter of Juptier compared to the diameter of Earth? 11 times
39 How large is the island of Borkum? 31 square-kilometers
40 How many singles did the Beatles officially release? 22 singles
41 How old was Alexander the Great when he waged his first campaign? 18 years old
42 How many species of insects live in Antarctica? 52 species
43 How many federal states does Austria have? 9 federal states
44 When was the first human heart transplant performed? 1967
45 How many marriages were there in Germany in 2018? 449,466 marriages
46 How many students were enrolled in German university in the winter term of 2019 / 2020? 2,897,336 students
47 How many floors does Burj Khalifa have? 163 floors
48 How far is Frankfurt (Main) from Berlin (linear distance)? 424 kilometers
49 When was the first color film available in Germany? 1936
50 When was the numerus clausus first applied in German universities? 1968
51 How far is Paris from London (linear distance)? 343 kilometers
52 How far is Dortmund from Hamburg (linear distance)? 284 kilometers
53 How far is Munich from Athens (linear distance)? 1,496 kilometers
54 How tall is the Statue of Liberty including its pedestral? 93 meters
55 When was slavery officially ended in the United States? 1865
56 When was the first Autobahn inaugurated? 1921
57 When did Albert Schweitzer receive the Nobel Peace Price? 1952
58 How long is the mean distance between Earth and Moon? 384,400 kilometers
59 In which year was Uranus discovered by William Herschel? 1781
60 How many letters does the Arabic script have? 28 letters
61 How deep is the Pacific at the deepest point? 10,094 meters
62 When was Astrid Lindgren born? 1907
63 How much does the heart of a full-grown person weigh? 300 gramms
64 How long can a Green Anakonda become? 8 meters
65 After how many days has a person’s top layer of skin completely renewed? 28 days
Appendix B
Cities selected for different maps
Table B1
Table of items for Experiment 3 using map material.
Item Map Cities
1 Austria and Switzerland Zurich, Geneva, Basel, Bern, Vienna, Graz, Linz, Salzburg
2 France Paris, Marseille, Lyon, Toulouse, Nice
3 Italy Rome, Milan, Naples, Florence, Venice
4 Spain and Portugal Madrid, Barcelona, Seville, Lisbon, Porto
5 United Kingdom and Ireland London, Birmingham, Glasgow, Liverpool, Dublin
6 Poland, Czech, Hungary and Slovenia Warsaw, Prague, Bratislava, Budapest
7 Germany Berlin, Hamburg, Cologne, Frankfurt, Stuttgart, Düsseldorf,
Leipzig, Dortmund, Essen, Bremen, Dresden, Hannover,
Nuremberg, Duisburg, Wuppertal, Bielefeld, Bonn, Münster,
Karlsruhe, Mannheim, Augsburg, Wiesbaden, Braunschweig,
Kiel, Munich
... give more weight to their own judgment compared to the presented judgment. 168 The fact that judgments become increasingly more accurate over the course of a 169 sequential chain (Mayer & Heck, 2021) indicates that contributors decide whether to 170 opt in or opt out of revising the presented judgments according to their expertise. Such 171 a process requires individuals to rely on task-related metacognitive knowledge about 172 their expertise. ...
... It is neither 179 necessary to assign expert roles before group discussions, nor is it required to estimate In Experiment 1, we measured expertise in a city-location task before letting 251 participants decide whether to change or maintain location judgments with varying 252 distances to the correct answer. To this end, we draw on an established paradigm 253 already used by Mayer and Heck (2021) to investigate sequential collaboration. In the 254 original study, participants positioned 57 European cities on maps. ...
... Participants first provided independent location judgments for 17 cities which 280 served as a measure of expertise. We ensured a wide range in item difficulty by selecting 281 cities based on the accuracy of independently provided judgments in a previous study 282 (Mayer & Heck, 2021). Next, in the sequential phase, the remaining 40 cities were 283 presented together with a preselected location judgment which was framed as a 284 response of a previous participant. ...
Full-text available
Many collaborative online projects such as Wikipedia and OpenStreetMap organize collaboration among their contributors sequentially. In sequential collaboration, one contributor creates an entry which is consecutively encountered by other contributors who then decide whether to adjust or maintain the presented entry. Sequential collaboration yields improved judgments over the course of a sequential chain and results in accurate final estimates. We hypothesize that these benefits emerge since contributors adjust entries according to their expertise, implying that judgments of experts have a larger impact compared to those of novices. In three preregistered studies, we measured and manipulated expertise to investigate whether expertise leads to higher change probabilities and larger improvements in judgment accuracy. Moreover, we tested whether expertise results in an increase in accuracy over the course of a sequential chain. As expected, experts adjusted entries more frequently, made larger improvements, and contributed more to the final estimates of sequential chains. Overall, our findings show that the high accuracy of sequential collaboration is due to an implict weighting of judgments by expertise.
... Moreover, we use simulations to examine under which conditions CCT's weighting of judgments by individuals' competence improves the accuracy of location estimates at the group level. Empirically, we apply the new model to reanalyze location judgments of European cities on maps (Mayer & Heck, 2021) and compare the accuracy of the aggregate location estimates to those obtained with unweighted averaging. Overall, the results of our simulation studies and the empirical reanalysis show that CCT's weighting of individual location judgments by informants' competence improves the estimation accuracy compared to weighting all judgments equally. ...
... As in all CCT models, the model requires that multiple informants provide judgments for a set of items from the same competence domain (Weller, 2007). For instance, as illustrated in Figure 1A, several informants could be asked to locate different European cities such as London on geographic maps (Mayer & Heck, 2021). Locations can be measured in different units depending on the application. ...
... In addition to the simulation study, we also apply the CCT-2D model to empirical data of participants who located various European cities on geographic maps (Mayer & Heck, 2021). Additionally, we compare the accuracy of aggregated location judgments of CCT-2D and unweighted averaging. ...
Full-text available
Cultural consensus theory is a model-based approach for analyzing responses of informants when correct answers are unknown. The model provides aggregate estimates of the latent consensus knowledge at the group level while accounting for heterogeneity both with respect to informants' competence and items' difficulty. We develop a specific version of cultural consensus theory for two-dimensional continuous judgments as obtained when asking informants to locate a set of unknown sites on a geographic map. The new model is fitted using hierarchical Bayesian modeling, with a simulation study indicating satisfactory parameter recovery. We also assess the accuracy of the aggregate location estimates by comparing the new model against simply computing the unweighted average of the informant's judgments. A simulation study shows that, due to weighting judgments by the inferred competence of the informants, cultural consensus theory provides more accurate location estimates than unweighted averaging. This result is also supported in an empirical study in which individuals judged the location of European cities on maps.
... The change history including the modifications of all OpenStreetMap elements can be traced back to its creation by using the contribution view for data extraction. This comprehensive information on the processes of adding and modifying elements can help to gain insights on how collaboration in online communities works and how information on OpenStreetMap develops (Mayer & Heck, 2021). ...
Full-text available
Big data are not yet commonly used in psychological research as they are often difficultto access and process. One source of behavioral data containing both spatial andthematic information is OpenStreetMap, a collaborative online project aiming to developa comprehensive world map. Besides spatial and thematic information about buildings,streets, and other geographical features, the collected data also contains informationabout the contribution process itself. Even though such data can be potentially useful forstudying individual judgments and group processes within a natural context, behavioraldata generated in OpenStreetMap have not yet been easily accessible for scholars inpsychology and the social sciences. To overcome this obstacle, we developed a softwarepackage which makes OpenSteetMap data more accessible and allows researchers toextract data sets from the OpenStreetMap database as CSV or JSON files. Furthermore,we show how to select relevant map sections in which contributor activity is high and howto model and predict the behavior of contributors in OpenStreetMap. Moreover, wediscuss opportunities and possible limitations of using behavioral data fromOpenStreetMap as a data source.
Full-text available
Much advice taking research investigates whether advice weighting accords to normative principles for maximizing decision accuracy. The present research complements this normative perspective with an interpersonal one, arguing that judges should also pay attention to how much their advisors want them to weight advice. In four experiments, we found that advisors do not always want their advice to be adopted fully. Instead, they often give advice about which they are uncertain and therefore want their advice to be averaged with judges' initial opinions or not used at all. Furthermore, advisors' desired advice weighting is often congruent with the judges' actual weighting, but moderators that affect advisor or judge confidence can cause desired and actual weighting to diverge (Experiments 1 and 2). When tasks were difficult, judges put more weight on the advice than advisors desired, because increasing the difficulty of the task led advisors to want their advice weighted less, whereas judges placed more weight on the advice. The reverse was true for easy tasks (Experiment 2). Importantly, both weighting more and less than advisors desired caused advisors to evaluate judges more negatively, which resulted in reduced willingness to give advice again in the future (Experiments 3 and 4), indicating that advisors want their advice used, but not too much.
Full-text available
This meta-analysis attempts to synthesize the Monte Carlo (MC) literature for the linear mixed model under a longitudinal framework. The meta-analysis aims to inform researchers about conditions that are important to consider when evaluating model assumptions and adequacy. In addition, the meta-analysis may be helpful to those wishing to design future MC simulations in identifying simulation conditions. The current meta-analysis will use the empirical type I error rate as the effect size and MC simulation conditions will be coded to serve as moderator variables. The type I error rate for the fixed and random effects will be explored as the primary dependent variable. Effect sizes were coded from 13 studies, resulting in a total of 4,002 and 621 effect sizes for fixed and random effects respectively. Meta-regression and proportional odds models were used to explore variation in the empirical type I error rate effect sizes. Implications for applied researchers and researchers planning new MC studies will be explored.
Full-text available
G-I transfer denotes an increase in individual performance due to group interaction, for example, because of acquiring certain skills or knowledge from the other group members. Whereas such G-I transfer has been successfully shown for problem-solving tasks, evidence for G-I transfer on quantitative estimation tasks is scarce. We address this research gap with a focus on how often a group has to interact in order to fully exploit the benefit of this learning effect. Results from two experiments support the idea that a single group interaction is sufficient to induce a stable G-I transfer, which reduces group members’ metric error. Smaller metric errors indicate that people improved their representation of the correct upper and lower boundaries, or what range of values is plausible. In contrast to nominal groups, both members of continuously interacting groups and members of groups with only one initial interaction exhibited stable G-I transfer, and the size of this transfer did not significantly differ between the latter two conditions. Furthermore, we found evidence for differential weighting of group members’ individual contributions that goes beyond sheer individual capability gains under certain circumstances, namely in tasks with a population bias.
Full-text available
A longstanding problem in the social, biological, and computational sciences is to determine how groups of distributed individuals can form intelligent collective judgments. Since Galton's discovery of the "wisdom of crowds" [Galton F (1907) Nature 75:450-451], theories of collective intelligence have suggested that the accuracy of group judgments requires individuals to be either independent, with uncorrelated beliefs, or diverse, with negatively correlated beliefs [Page S (2008) The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies]. Previous experimental studies have supported this view by arguing that social influence undermines the wisdom of crowds. These results showed that individuals' estimates became more similar when subjects observed each other's beliefs, thereby reducing diversity without a corresponding increase in group accuracy [Lorenz J, Rauhut H, Schweitzer F, Helbing D (2011) Proc Natl Acad Sci USA 108:9020-9025]. By contrast, we show general network conditions under which social influence improves the accuracy of group estimates, even as individual beliefs become more similar. We present theoretical predictions and experimental results showing that, in decentralized communication networks, group estimates become reliably more accurate as a result of information exchange. We further show that the dynamics of group accuracy change with network structure. In centralized networks, where the influence of central individuals dominates the collective estimation process, group estimates become more likely to increase in error.
In a variety of real-world forecasting contexts, researchers have demonstrated that the unweighted average forecast is reasonably accurate and difficult to improve upon with more complex, model-based aggregation methods. We investigate this phenomenon by systematically examining the relationship between individual forecaster characteristics (e.g., bias, consistency) and aspects of the criterion being forecast (e.g., “signal strength”). To this end, we develop a model inspired by Cultural Consensus Theory (Batchelder and Romney, 1988) that (i) allows us to jointly estimate both forecaster characteristics and environmental characteristics and (ii) contains the unweighted average as a special case. This allows us to use the model as a regularization method for forecast aggregation, where restrictions on forecaster parameters make the model similar to use of an unweighted average. Relatedly, the model allows us to apply existing results on optimal forecaster weighting to real data. We show how the model provides guidance for identifying prediction environments where the average forecast can potentially be beaten. We also conduct two simulation studies and illustrate the model’s practical application using forecasts of Australian Football League point spreads.
The wisdom of the crowd refers to the finding that judgments aggregated over individuals are typically more accurate than the average individual’s judgment. Here, we examine the potential for improving crowd judgments by allowing individuals to choose which of a set of queries to respond to. If individuals’ metacognitive assessments of what they know is accurate, allowing individuals to opt in to questions of interest or expertise has the potential to create a more informed knowledge base over which to aggregate. This prediction was confirmed: crowds composed of volunteered judgments were more accurate than crowds composed of forced judgments. Overall, allowing individuals to use private metacognitive knowledge holds much promise in enhancing judgments, including those of the crowd.
Averaging independent numerical judgments can be more accurate than the average individual judgment. This “wisdom of crowds” effect has been shown with large, diverse samples, but the layperson wishing to take advantage of this may only have access to the opinions of a small, more demographically homogeneous “convenience sample.” How wise are homogeneous crowds relative to diverse crowds? In simulations and survey studies, we demonstrate three necessary conditions under which small socially diverse crowds can outperform socially homogeneous crowds: Social identity must predict judgment, the effect of social identity on judgment must be at least moderate in size, and the average estimates of the social groups in question must “bracket” the truth being judged. Seven survey studies suggest that these conditions are rarely met in real judgment tasks. Comparisons between the performances of diverse and homogeneous crowds further confirm that social diversity can make crowds wiser but typically by a very small margin.
We evaluate the effect of discussion on the accuracy of collaborative judgments. In contrast to prior research, we show that discussion can either aid or impede accuracy relative to the averaging of collaborators' independent judgments, as a systematic function of task type and interaction process. For estimation tasks with a wide range of potential estimates, discussion aided accuracy by helping participants prevent and eliminate egregious errors. For estimation tasks with a naturally bounded range, discussion following independent estimates performed on par with averaging. Importantly, if participants did not first make independent estimates, discussion greatly harmed accuracy by limiting the range of considered estimates, independent of task type. Our research shows that discussion can be a powerful tool for error reduction, but only when appropriately structured: Decision makers should formindependent judgments to consider a wide range of possible answers, and then use discussion to eliminate extremely large errors.
Psychological scientists draw inferences about populations based on samples—of people, situations, and stimuli—from those populations. Yet, few papers identify their target populations, and even fewer justify how or why the tested samples are representative of broader populations. A cumulative science depends on accurately characterizing the generality of findings, but current publishing standards do not require authors to constrain their inferences, leaving readers to assume the broadest possible generalizations. We propose that the discussion section of all primary research articles specify Constraints on Generality (i.e., a “COG” statement) that identify and justify target populations for the reported findings. Explicitly defining the target populations will help other researchers to sample from the same populations when conducting a direct replication, and it could encourage follow-up studies that test the boundary conditions of the original finding. Universal adoption of COG statements would change publishing incentives to favor a more cumulative science.