PreprintPDF Available

Expertise Determines the Frequency and Accuracy of Contributions in Sequential Collaboration

Preprints and early-stage research may not have been peer reviewed yet.


Many collaborative online projects such as Wikipedia and OpenStreetMap organize collaboration among their contributors sequentially. In sequential collaboration, one contributor creates an entry which is consecutively encountered by other contributors who then decide whether to adjust or maintain the presented entry. Sequential collaboration yields improved judgments over the course of a sequential chain and results in accurate final estimates. We hypothesize that these benefits emerge since contributors adjust entries according to their expertise, implying that judgments of experts have a larger impact compared to those of novices. In three preregistered studies, we measured and manipulated expertise to investigate whether expertise leads to higher change probabilities and larger improvements in judgment accuracy. Moreover, we tested whether expertise results in an increase in accuracy over the course of a sequential chain. As expected, experts adjusted entries more frequently, made larger improvements, and contributed more to the final estimates of sequential chains. Overall, our findings show that the high accuracy of sequential collaboration is due to an implict weighting of judgments by expertise.
Expertise Determines the Frequency and Accuracy of Contributions in1
Sequential Collaboration2
Maren Mayer1,2, Marcel Broß3, and & Daniel W. Heck3
1University of Mannheim4
2Heidelberg Academy of Sciences and Humanities5
3University of Marburg6
Author Note7
Maren Mayer, Department of Psychology, School of Social Science, University of9
Mannheim, Germany.
Daniel W. Heck, Department of Psychology, University of Marburg, Germany.11
Data and R scripts for the analyses are available at the Open Science Framework13
This work was presented at the 63rd Conference of Experimental Psychologists15
(TeaP 2022 in Cologne). The present version of the manuscript (June 20, 2022) has not16
yet been peer reviewed. A preprint was uploaded to PsyArXiv and ResearchGate for17
timely dissemination.18
This publication is part of the research group ‘Shared Data Sources’ in the19
subprogram ‘Collective Decision-Making’ of the WIN-Kolleg of the Heidelberg Academy20
of Sciences and Humanities, financed by the Ministry of Science, Research and the Arts21
of the State of Baden-Württemberg. This work was also supported by the Research22
Training Group “Statistical Modeling in Psychology” funded by the German Research23
Foundation (DFG grant GRK 2277).24
The authors made the following contributions. Maren Mayer: Conceptualization,25
Investigation, Methodology, Writing - Original Draft, Writing - Review & Editing;26
Marcel Broß: Conceptualization, Investigation, Writing - Review & Editing; Daniel W.27
Heck: Conceptualization, Methodology, Writing - Review & Editing.28
Correspondence concerning this article should be addressed to Maren Mayer, B6,29
30-32, 68169 Mannheim. E-mail: maren.mayer@students.uni-mannheim.de30
Many collaborative online projects such as Wikipedia and OpenStreetMap organize32
collaboration among their contributors sequentially. In sequential collaboration, one33
contributor creates an entry which is consecutively encountered by other contributors34
who then decide whether to adjust or maintain the presented entry. Sequential35
collaboration yields improved judgments over the course of a sequential chain and36
results in accurate final estimates. We hypothesize that these benefits emerge since37
contributors adjust entries according to their expertise, implying that judgments of38
experts have a larger impact compared to those of novices. In three preregistered39
studies, we measured and manipulated expertise to investigate whether expertise leads40
to higher change probabilities and larger improvements in judgment accuracy.41
Moreover, we tested whether expertise results in an increase in accuracy over the course42
of a sequential chain. As expected, experts adjusted entries more frequently, made43
larger improvements, and contributed more to the final estimates of sequential chains.44
Overall, our findings show that the high accuracy of sequential collaboration is due to45
an implict weighting of judgments by expertise.46
Keywords: wisdom of crowds, group decision making, mass collaboration, team47
Word count: 7,72649
Expertise Determines the Frequency and Accuracy of Contributions in50
Sequential Collaboration51
Online collaborative projects such as Wikipedia and OpenStreetMap have52
become increasingly important sources of information over the last two decades and are53
frequently used by many people. Prior research showed that Wikipedia yields highly54
accurate information both in general (Giles, 2005) and for specific topics (Kräenbring et55
al., 2014; Leithner et al., 2010). Moreover, OpenStreetMap provides similarly accurate56
geographic information as commercial map services and governmental data (Ciepłuch et57
al., 2010; Haklay, 2010; Zhang & Malczewski, 2017; Zielstra & Zipf, 2010). Both58
Wikipedia and OpenStreetMap build on a sequential process of information gathering59
referred to as sequential collaboration (Mayer & Heck, 2021). One contributor creates60
an entry whereas the following contributors decide whether to adjust or maintain the61
presented entries. Thereby, only the latest version of an entry is shown.62
Mayer and Heck (2021) showed that sequential collaboration represents a63
successful way of eliciting group judgments. In three online studies, participants either64
answered general-knowledge questions or located European cities on geographic maps.65
Participants were randomly assigned to sequential chains of four to six contributors.66
Each chain started with one independent judgment. Following contributors then67
encountered the latest version of the judgment and could decide whether to adjust or68
maintain it. For instance, one individual may start with locating Rome on a map of69
Italy. The second contributor may then maintain the location, whereas the third70
contributor may move the location more to the south. Participants were unaware of71
their position in the sequential chain, the change history of the presented judgment, and72
how often a judgment had already been adjusted.73
While change probability and change magnitude were found to decrease over the74
course of a sequential chain, judgment accuracy improved (Mayer & Heck, 2021). These75
findings show that sequential collaboration is suitable for eliciting accurate judgments.76
Furthermore, the final judgments of sequential chains were similarly accurate, and in77
some cases even more accurate, than unweighted averaging, that is, computing the78
mean of independent individual judgments for the same number of participants. This is79
an important finding given that unweighted averaging is known to yield highly accurate80
estimates in various contexts and tasks (Hueffer et al., 2013; Larrick & Soll, 2006;81
Steyvers et al., 2009; Surowiecki, 2004).82
However, the mechanisms contributing to the high accuracy of sequential83
collaboration are still unclear. In the present paper, we investigate whether individual84
differences in expertise are a relevant factor for the probability of changing presented85
judgments and for the accuracy of such changes. We hypothesize that individuals with86
higher expertise adjust presented judgments more frequently and more accurately since87
they better distinguish between presented judgments that they can improve and those88
they cannot improve (Mayer & Heck, 2021). Thereby, sequential collaboration would89
facilitate an implicit weighing of judgments by expertise, in turn leading to increasingly90
accurate judgments over the course of a sequential chain.91
In the following, we first define expertise and discuss its relevance for judgment92
accuracy in various contexts. We then refer to the literature on the role of expertise for93
individual judgments to establish a theoretical framework of how expertise influences94
both the frequency and accuracy of changing presented judgments in sequential95
collaboration. In three experimental studies using a city-location task and a96
random-dots estimation task, we measured and manipulated expertise. Thereby, we97
examined whether expertise influences how frequently and how accurately presented98
judgments in sequential collaboration are changed. As expected, we found that99
contributors with higher expertise change presented judgments more frequently and100
more accurately. Furthermore, experts have a larger impact on sequential chains than101
novices, while this effect is more pronounced the later experts enter into the chain.102
Expertise in Judgment and Decision Making103
Expertise is a multifaceted concept (Baumann & Bonner, 2013). It comprises104
general abilities such as logical reasoning (Kruger & Dunning, 1999), semantic105
knowledge (Schunn & Anderson, 1999) such as unique information received (Baumann106
& Bonner, 2013; Stewart & Stasser, 1995) or grammar rules learned (Kruger &107
Dunning, 1999), prior experience such as students making decisions on their curriculum108
(Dubrovsky et al., 1991) or forensic experts judging the frequency of handwriting109
features (Martire et al., 2018), and procedural skills such as techniques of designing110
experiments (Schunn & Anderson, 1999). All these aspects have in common that111
expertise is domain-specific. With respect to knowledge, experience, and skills,112
expertise can be acquired by formal training (Martire et al., 2018).113
It has been shown that experts work on tasks in qualitatively different ways114
(Dubrovsky et al., 1991; Franz & Larson, 2002; Schunn & Anderson, 1999) and usually115
show better performance than novices (Budescu & Chen, 2014; Kruger & Dunning,116
1999; Merkle et al., 2020; Wang et al., 2021). Various measures of expertise have been117
used in the past. Most obviously, an individuals’ expertise can be measured based on118
previous performance on similar tasks (Lin & Cheng, 2009) or even the same task119
(Budescu & Chen, 2014; Mayer & Heck, 2022; Merkle et al., 2020; Merkle & Steyvers,120
2011). Moreover, one can take advantage of experts’ ability to accurately predict121
judgments of other individuals. For instance, experts are able to predict how likely122
others provide the same judgment (surprisingly popular method, Lee et al., 2018; Prelec123
et al., 2017) and which judgments are likely provided when asking others in general124
(social projection, Grüning & Krueger, 2021) or when asking their peers (peer125
prediction, Wang et al., 2021).126
Expertise has a positive effect on task performance in various contexts. In group127
decision making, the more individuals are aware of the expertise of other group128
members, the more accurate group decisions become (Baumann & Bonner, 2013).129
However, in such settings, it is crucial to explicitly communicate the expert status of130
group members before the discussion starts (Bonner et al., 2002). When eliciting131
independent judgments by a group of individuals, weighting these judgments by132
expertise improves the accuracy of the aggregated estimates (Budescu & Chen, 2014;133
Lin & Cheng, 2009; Mayer & Heck, 2022; Merkle et al., 2020; Merkle & Steyvers, 2011).134
In such cases, expertise needs to be estimated statistically.135
Deviation of Presented Judgments from the Correct Answer136
We focus on two factors that may affect whether contributors in sequential137
collaboration adjust presented judgments and how accurate such adjustments are,138
namely, the accuracy of presented judgments and the expertise of contributors. In139
sequential collaboration, two aspects determine how difficult it is to accurately change a140
presented judgment. First, from a standard test-theoretical perspective, items or stimuli141
generally differ in how easy or difficult the correct answer is available (Embretson &142
Reise, 2000; Lord et al., 1968). Second, in sequential collaboration, individuals143
encounter not only the item itself but also a judgment by a previous contributor which144
may deviate more or less from the correct answer. In our studies, we focus on the145
deviation of a presented judgment from the correct answer, referred to as presented146
deviation, as an important predictor of whether and how much presented judgments are147
We expect that as the presented deviation increases, judgments are adjusted149
more frequently since contributors can more easily detect whether they are able to150
provide a more accurate judgment. Furthermore, we hypothesize that with increasing151
deviation, contributors change the presented judgments to a larger degree as there is152
more opportunity for improvement:153
Hypothesis 1: With increasing deviation of the presented judgment from the154
correct answer, participants (1a) change judgments more frequently and (1b)155
provide larger improvements.156
The Role of Expertise157
When relying on the aggregation of independent judgments, weighing these158
judgments by expertise results in an increase in accuracy (Budescu & Chen, 2014; Lin159
& Cheng, 2009; Mayer & Heck, 2022; Merkle et al., 2020; Merkle & Steyvers, 2011). We160
hypothesize that sequential collaboration provides accurate outcomes because it results161
in an implicit weighting of judgments by expertise. This follows when assuming that162
experts are able to distinguish between presented judgments they can improve and those163
they cannot improve. In sequential collaboration, an implicit weighing of judgments164
emerges due to the possibility to opt out of providing a judgment. When opting out and165
maintaining the presented judgment, participants assign more weight to the presented166
judgment. In contrast, when opting in and adjusting a presented judgment, participants167
give more weight to their own judgment compared to the presented judgment.168
The fact that judgments become increasingly more accurate over the course of a169
sequential chain (Mayer & Heck, 2021) indicates that contributors decide whether to170
opt in or opt out of revising the presented judgments according to their expertise. Such171
a process requires individuals to rely on task-related metacognitive knowledge about172
their expertise. Metacognition describes individuals’ “cognition about cognitive173
phenomena” (Flavell, 1979) which is divided into metacognitive knowledge and174
metacognitive control (Lai, 2011). In the context of sequential collaboration,175
metacognitive knowledge about one’s own expertise allows contributors to evaluate the176
accuracy of presented judgments and one’s own capacity to provide improvements177
(Kruger & Dunning, 1999). Given that contributors decide whether to opt in or opt178
out, sequential collaboration does not require the identification of experts. It is neither179
necessary to assign expert roles before group discussions, nor is it required to estimate180
expertise statistically as when eliciting independent judgments. Instead, contributors181
determine the weighting of judgments within sequential chains implicitly based on their182
metacognitive assessment of their expertise. Achieving high accuracy only requires a183
sample with sufficient diversity in task-related expertise (Davis-Stober et al., 2014).184
However, individuals may not have well-calibrated metacognitive knowledge185
about their own expertise. Specifically, the Dunning-Kruger effect, according to which186
individuals with low expertise overestimate their performance, may negatively affect187
judgment accuracy in sequential collaboration (Jansen et al., 2021; Kruger & Dunning,188
1999). Overestimating one’s knowledge may be a consequence of low expertise itself,189
since expertise is necessary both to perform well and to evaluate the accuracy of190
judgments (Kruger & Dunning, 1999). In sequential collaboration, expertise is191
especially relevant for assessing the performance of others. For instance, when seeing a192
location judgment of Rome on a map of Italy, contributors require geographic knowledge193
both to evaluate the presented judgment and, when deciding to make a change, to194
provide an improved judgment. If this evaluation fails due to a lack of expertise,195
contributors with lower expertise may decide to adjust already accurate judgments and196
provide worse judgments. If many individuals suffer from the Dunning-Kruger-Effect,197
biased judgments will negatively affect the accuracy of sequential collaboration.198
Frequency of Adjustments to Presented Judgments. If contributors with199
higher expertise are better at detecting which judgments they can improve, the200
deviation of presented judgments from the correct answer will have a larger effect on201
change probability for contributors with higher expertise than for those with lower202
expertise. Imagine Rome being positioned either far away from the correct position203
(e.g., near Milan), closer to the correct position (e.g., near Naples), or at the correct204
position. Contributors with higher expertise should correctly adjust the incorrect205
judgments close to Milan and close to Naples, while realizing that they cannot improve206
the already correct judgment. In contrast, contributors with lower expertise may only207
know that Rome is roughly located in the center of Italy. Hence, they may recognize208
the incorrect position near Milan, but not the incorrect position close to Naples.209
Contributors with lower expertise may even adjust and worsen the already accurate210
position because they erroneously expect Rome to be closer to the eastern coast.211
This line of argument directly implies an interaction of expertise and the212
deviation of the presented judgment from the correct answer. However, it is less clear213
whether to expect a main effect of expertise on change probability. On the one hand,214
contributors with higher expertise should detect incorrect judgments with higher215
probability, but on the other hand, they should also maintain already highly accurate216
judgments. In our studies, most of the presented judgments deviate from the correct217
answer to a considerable degree, and hence, we predict a main effect of expertise:218
Hypothesis 2a: Participants with higher expertise change presented219
judgments more frequently compared to those with lower expertise.220
Hypothesis 3a: Compared to participants with lower expertise, participants221
with higher expertise are better at distinguishing between judgments with222
larger than smaller deviations from the correct answer, in turn leading to a223
larger effect of the presented deviation on change probability.224
Accuracy of Revised Judgments.225
When deciding to adjust a presented judgment, we expect that contributors with226
higher expertise change judgments more accurately than contributors with lower227
expertise. Thus, we expect a main effect of contributors’ expertise on the improvement228
of presented judgments:229
Hypothesis 2b: Participants with higher expertise provide larger230
improvements to the presented judgments compared to participants with231
lower expertise.232
According to Hypothesis 3a, contributors with higher expertise should be better233
at detecting those judgments they can improve. Similarly, contributors with higher234
expertise should make especially large improvements to highly inaccurate presented235
judgments, only minor improvements to moderately inaccurate judgments, and no236
adjustments to correct judgments. In contrast, contributors with lower expertise may237
not be able to make similarly large improvements to highly inaccurate presented238
judgments, because they may also suffer from anchoring on the presented judgments239
(Mussweiler et al., 2004; Tversky & Kahneman, 1974). While contributors with higher240
expertise reach a certain level of accuracy when adjusting presented judgments,241
contributors with lower expertise may be more strongly influenced by the presented242
judgments. Therefore, we predict an interaction of expertise and the deviation of243
presented judgments from the correct answer with respect to the improvement of244
presented judgments.245
Hypothesis 3b: For participants with higher expertise, the effect of the246
deviation of presented judgments on the amount of improvement is larger247
compared to participants with lower expertise.248
Experiment 1249
In Experiment 1, we measured expertise in a city-location task before letting251
participants decide whether to change or maintain location judgments with varying252
distances to the correct answer. To this end, we draw on an established paradigm253
already used by Mayer and Heck (2021) to investigate sequential collaboration. In the254
original study, participants positioned 57 European cities on maps. We modified the255
paradigm with some of these items serving as measure of expertise while the remaining256
items were used to examine how participants adjust judgments in terms of change257
probability and accuracy. Thereby, expertise was operationalized as knowledge acquired258
in the past (Schunn & Anderson, 1999). The study design, sample size, hypotheses, and259
planned analyses were preregistered at
Materials, analysis scripts, and data are available at
We recruited 290 participants who were compensated with 0.75for263
participation via a German panel provider for this study. However, we excluded one264
participant who provided judgments that were on average more accurate than the mean265
accuracy of judgments found in a small test sample in which we instructed participants266
to look up the correct locations of each city before providing a judgment. Furthermore,267
we excluded 8 participants who positioned more than 10% of the cities outside the area268
of the countries of interest. After these exclusions, the final sample comprised 281269
participants who were on average 46.49 years old (SD = 15.33) with 48.75% of270
participants being female. Concerning educational background, 15.71% had a college271
degree, 15% held a high school diploma, 31.07% had vocational education, and 38.21%272
had a lesser educational attainment.273
Materials and Procedure274
Participants had to locate 57 European cities on 7 different European maps,275
namely 1) Austria and Switzerland, 2) France, 3) Italy, 4) Spain and Portugal, 5)276
United Kingdom and Ireland, 6) Germany, and 7) Poland, Czech Republic, Hungary277
and Slovakia. All maps had 800 ×500 pixels and were scaled to 1:5,000,000. Appendix278
A1 provides a list of all cities and the phase they were presented in.279
Participants first provided independent location judgments for 17 cities which280
served as a measure of expertise. We ensured a wide range in item difficulty by selecting281
cities based on the accuracy of independently provided judgments in a previous study282
(Mayer & Heck, 2021). Next, in the sequential phase, the remaining 40 cities were283
presented together with a preselected location judgment which was framed as a284
response of a previous participant. Figure 1displays the map of Italy with four location285
judgments for Rome with different distances from the correct location. Separately for286
each city, participants decided whether to adjust or maintain the presented judgment287
before continuing to the next trial. All seven maps and the corresponding cities were288
presented in random order. Finally, participants provided demographic information,289
indicated their subjective expertise concerning the location of large European cities, and290
were debriefed and thanked for their participation.291
Unknown to the participants, the locations presented in the sequential phase292
were not provided by other participants but preselected to manipulate the presented293
deviation with the Euclidean distance to the correct answer (0, 40, 80, or 120 pixels).294
For each of the 40 cities, one deviation was randomly selected such that each was295
presented 10 times. Furthermore, we ensured that all levels of deviation occurred within296
each of the seven maps. As participants were deceived about the presented locations297
being judgments of other participants, the study was reviewed and approved by the298
ethical committee of the University of Mannheim and participants were debriefed after299
To ensure that participants complied to the instructions, the online study was301
Figure 1
Presented location judgments for Rome with different distances to the correct position.
500 0 200 400 600 800
X−coordinate (in pixels)
Y−coordinate (in pixels)
Distance to the correct location of Rome (in pixels) 0 40 80 120
Note. Each participant was only presented with one of the four preselected judgments.
accessible only for participants using a computer (as opposed to mobile devices). We302
prevented looking up correct answers by implementing a time limit of 40 seconds for303
each response. Moreover, we already excluded participants during participation if they304
left the browser tab more than five times despite repeated warnings.305
Results and Discussion306
We estimated participants’ expertise based on the independently provided307
judgments to the first 17 cities. For each participant, we computed the mean of the308
Euclidean distances between the location judgments and the correct positions. To309
ensure that larger values indicate higher expertise, we use the negatively inverted310
distance as a measure for expertise in the analyses below. We examined the validity of311
this expertise measure by correlating it with the self-reported expertise about the312
location of European cities. The large, positive correlation of r= 0.43313
(t(279) = 7.91, p < .001) indicates a satisfactory convergent validity.314
We tested Hypotheses 1a, 2a, and 3a concerning change probability using a315
generalized linear mixed model with a logistic link function. The model predicts the316
decision whether to adjust (= 1) or maintain (= 0) a presented judgment depending on317
expertise and presented deviation. We standardized our expertise measure for all318
analyses to address issues with model convergence. Moreover, we used a mean-centered319
linear contrast with values -1.5, -0.5, 0.5, 1.5 for the four levels of deviation of the320
presented locations from the correct location. The model accounts for the nested data321
structure by including random intercepts for items and participants (Pinheiro & Bates,322
Figure 2A displays the average change probability whereas Table 1shows the324
estimated regression coefficients. In line with Hypothesis 1a, the linear contrast for the325
presented deviation was positive and significant (β= 0.444,CI = [0.392,0.495]). The326
model also indicated a significant positive relationship between expertise and change327
probability, thus supporting Hypothesis 2a (β= 0.622,CI = [0.202,1.042]).328
Furthermore, we found a significant interaction between expertise and the linear329
contrast for presented deviation (β= 0.218,CI = [0.165,0.272]). However, contrary to330
our predictions, Figure 2A shows that individuals with higher expertise changed correct331
judgments more frequently than individuals with lower expertise which only partially332
supports Hypothesis 3a. The high change probability for accurate presented judgments333
can be explained by demand effects. Participants did not know that 25% of the334
presented judgments were already correct, and thus, they may not have expected that335
optimal behavior required to maintain a substantial proportion of the presented336
Before assessing whether the improvement depends on the presented deviation338
(Hypothesis 1a), expertise (Hypothesis 2b), and their interaction (Hypothesis 3b), we339
1It is often recommended to include random slopes for within-person factors. However, our models
failed to converge when adding random slopes for the presented deviation.
Table 1
Fixed-effects coefficients of the fitted (generalized) linear mixed models.
Independent Variable β SE 95% CI p
Dependent variable: Change probability
Presented deviation 0.444 0.026 0.392 0.495 < .001
Expertise 0.622 0.214 0.202 1.042 .004Experiment 1
Presented deviation ×expertise 0.218 0.027 0.165 0.272 < .001
Presented deviation (V-shaped contrast) 0.208 0.024 0.160 0.256 < .001
Presented deviation (linear contrast) 0.311 0.069 0.176 0.447 < .001
Expertise 0.566 0.213 0.148 0.984 .008
Presented deviation (V-shape) ×expertise 0.067 0.031 0.006 0.128 .030
Experiment 2
Presented deviation (linear) ×expertise -0.342 0.092 -0.522 -0.163 < .001
Presented deviation (V-shaped contrast) 0.141 0.013 0.116 0.167 < .001
Presented deviation (linear contrast) 0.367 0.038 0.292 0.441 < .001
Expertise 0.052 0.178 -0.297 0.401 .771
Presented deviation (V-shape) ×expertise 0.075 0.018 0.040 0.111 < .001
Experiment 3
Presented deviation (linear) ×expertise 0.052 0.053 -0.051 0.156 .322
Dependent variable: Improvement of presented judgments
Presented deviation 32.289 0.455 31.398 33.181 < .001
Expertise 15.545 1.047 13.492 17.598 < .001Experiment 1
Presented deviation ×expertise 3.819 0.453 2.930 4.707 < .001
Presented deviation (V-shaped contrast) 6.770 0.229 6.320 7.220 < .001
Presented deviation (linear contrast) -0.591 0.566 -1.700 0.518 .305
Expertise 8.542 2.338 3.959 13.124 < .001
Presented deviation (V-shape) ×expertise 0.666 0.241 0.194 1.138 .006
Experiment 2
Presented deviation (linear) ×expertise -0.190 0.563 -1.293 0.912 .735
Presented deviation (V-shaped contrast) 6.653 0.185 6.290 7.016 < .001
Presented deviation (linear contrast) 0.791 0.465 -0.122 1.703 .089
Expertise 16.518 2.968 10.701 22.336 < .001
Presented deviation (V-shape) ×expertise -0.024 0.257 -0.527 0.479 .925
Experiment 3
Presented deviation (linear) ×expertise -1.516 0.627 -2.745 -0.287 .016
Note. All models included crossed random effects for participants and items. The models for change
probability (0 = no adjustment, 1 = adjustment) assumed a logit link function.
assess the accuracy of the provided judgments. To this end, we only included trials in340
which participants actually adjusted the presented judgments. Accuracy was341
operationalized as the Euclidean distance between the adjusted and the correct342
location.2Figure 2B displays the average distance to the correct position for the revised343
location judgments. Participants with higher expertise provided similarly accurate344
judgments for all levels of deviation. In contrast, for participants with lower expertise,345
inaccuracy of the revised judgments increased for larger presented deviations. This346
indicates that participants with lower expertise are prone to an anchoring effect.347
Next, we examined the improvement of presented judgments by computing the348
difference in accuracy between the revised and the presented judgment. The accuracy of349
presented judgments corresponds to the presented deviation (i.e., 0, 40, 80, or 120 pixels350
distance to the correct position). Positive (negative) values of the improvement measure351
imply that a revised judgment is more (less) accurate than the presented judgment. We352
used improvement as dependent variable in a linear mixed model with (standardized)353
expertise and presented deviation (linear contrast) as independent variables. Figure 2C354
displays the average improvement in judgment accuracy, whereas Table 1shows the355
regression coefficients. As expected, improvements increased for larger presented356
deviations (Hypothesis 1b: β= 32.289,CI = [31.398,33.181]) and higher expertise357
(Hypothesis 2b: β= 15.545,CI = [13.492,17.598]). In line with the plot, the model358
also showed a significant interaction such that more knowledgeable participants showed359
a steeper increase in improvement than less knowledgeable participants (β= 3.819,360
CI = [2.930,4.707]).361
2The statistical analysis yielded similar results when including non-adjusted judgments in the analysis
with improvement scores of zero (main effect of expertise: β= 12.200,CI = [10.523,13.876],
t(275.82) = 14.263,p<.001; main effect of deviation: β= 21.932,C I = [21.291,22.574],
t(10,861.76) = 66.979,p<.001; interaction of expertise and deviation: β= 5.726,C I = [5.085,6.367],
t(10,860.76) = 17.500,p<.001).
Figure 2
Change probability, distance to the correct position, and improvement of presented judg-
ments in Experiment 1.
0 pixels 40 pixels 80 pixels 120 pixels
Change probability
0 pixels 40 pixels 80 pixels 120 pixels
Judgments' distance to
the correct position in pixels
0 pixels 40 pixels 80 pixels 120 pixels
Distance of the presented judgment to the correct position
Improvement of pre−
sented judgments in pixels
Contributors' expertise low (< M−1SD) average high (> M+1SD)
Note. Points and vertical lines show the empirical means with the corresponding 99%
between-subjects confidence intervals, respectively. Violin plots indicate the distribution
of the dependent variable aggregated across items within each person.
Experiment 2362
Experiment 1 allows only weak causal conclusions since expertise was merely363
measured rather than manipulated. As a remedy, we implemented a new study design364
in which expertise was operationalized as a skill or strategy (Kruger & Dunning, 1999;365
Schunn & Anderson, 1999). While acquiring knowledge is usually a time-consuming366
process, acquiring skills or strategies can often be achieved much easier by learning and367
rehearsing (Anderson et al., 1997; Anderson & Fincham, 1994).368
We manipulated the level of expertise in a random-dots estimation task (Honda369
et al., 2022) in which participants had to estimate the number of randomly positioned,370
colored dots. Participants in the experimental group learned a strategy to provide371
accurate estimates for the number of presented points. This strategy can also be used372
to evaluate the accuracy of presented judgments. In contrast, participants in the control373
condition completed a control task and should thus have a disadvantage in providing374
and evaluating judgments. In a pilot study, we examined whether the manipulation of375
expertise was successful and whether participants in the control condition came up with376
any solution strategy themselves, which was not the case. The preliminary data were377
also used to calibrate the time limit per item and to define outliers. Hypotheses, study378
design, sample size, and planned analyses were preregistered at379 Materials, data and analysis scripts are available380
We recruited 124 college students from the University of Marburg and a study384
exchange platform. Participants received course credit or the opportunity to take part385
in a gift-card lottery in exchange for participation. Of the 11 participants excluded from386
the analysis, one did not complete the study conscientiously, one vastly underestimated387
and one vastly overestimated the number of dots for most items, one almost always388
gave the correct number of dots, one did not answer attention-check questions about389
the instructions correctly, and six participants in the experimental condition indicated390
that they did not apply the learned strategy. The remaining 113 participants (69.03%391
female) had a mean age of 25.72 (SD = 10.14).392
Participants were randomly assigned either to the expertise-manipulation394
condition (referred to as “experts”) or the control condition (“novices”). Experts were395
introduced to raster scanning, a strategy for estimating the number of objects on a396
presented image more accurately by mentally overlaying a 3×3raster on top of the397
presented image. With the raster in mind, one can pick one of the nine areas with an398
approximately average number of dots and count the number of dots within this box.399
Participants simply had to multiply the result by nine to obtain an estimate for the400
total number of dots in the image. To make multiplication easier, we advised401
participants to multiply the number of dots by ten and then subtract the number.402
Participants in the control condition only read an essay about the importance of403
accurate judgments. Afterwards, both groups answered four attention-check questions404
concerning the instructions.405
First, all participants had to estimate the number of dots for five images. Only406
in the experimental condition, these five images were overlaid with a visible 3×3raster407
to train raster scanning. Next, participants were presented with five more images, now408
without a raster. The judgments in this phase served as a manipulation check.409
Participants then saw 30 images, each with an (alleged) judgment of a previous410
participant and had to decide whether to adjust or maintain the presented number of411
dots. The images were shown in random order with a time limit of 60 seconds412
(including a warning after 40 seconds). As in Experiment 1, presented judgments were413
not actually provided by previous participants but rather preselected to manipulate the414
deviation from the correct answer. Lastly, participants provided demographic415
information and were asked whether they used raster scanning in the experimental416
condition, whether they used any special strategy to estimate the number of dots in the417
control condition, and whether they completed the study conscientiously.418
We generated 30 images (600 ×600 pixels, see Figure 3) with white background420
depicting between 100 and 599 randomly-positioned, non-overlapping, colored dots421
using the R package ggplot2 (Wickham, 2016). Five of these images were used to train422
participants and five were used for the manipulation check. The remaining 20 images423
were shown jointly with an (alleged) judgment of the number of dots. These values were424
preselected and either correct or deviated ±35% or ±70% from the correct answer.425
Moreover, for motivational purposes, we also showed 10 additional images depicting426
only 10 to 59 dots which were displayed with a judgment that was either correct or427
deviated ±20% or ±35% from the correct answer. For these items, it was very easy for428
participants in both conditions to detect whether the presented judgment was correct429
since the time limit allowed to simply count the small number of dots.430
Figure 3
Example images in the random-dots estimation task.
Note. Both images show 379 dots. The left image was used in the training phase for the
control condition. The right image displays the 3×3raster overlaid during training in
the expertise-manipulation condition. Images presented for the manipulation check and
in the sequential phase resembled the left image.
Results and Discussion431
To test whether the manipulation was successful, we examined whether experts432
provided more accurate independent judgments than novices. As a measure of accuracy,433
we computed the percentage error for each item, defined as the absolute difference434
between the judgment and the correct answer, divided by the correct answer and435
multiplied with 100. Using this measure allowed us to analyze average accuracy across436
items even though the number of dots varied from 100 to almost 600. Including only437
the independent judgments for the five items in the manipulation-check phase, we fitted438
a linear mixed model with condition as independent variable (dummy-coded with 1 =439
expertise, 0 = control). We found a significant negative effect of condition on the440
percentage error (β=15.805,CI = [23.593,8.017],t(111.18) = 3.977,p<.001).441
Hence, our manipulation of expertise was successful with novices showing a mean error442
of 35.81% in contrast to experts who showed a mean error of only 20.00%.443
We first focus on Hypotheses 1a, 2a, and 3a using a generalized linear mixed444
model for change probability. While expertise was coded with a dummy contrast (1 =445
experts, 0 = novices), we used two orthogonal, centered contrasts for presented446
deviation. Since the presented deviation includes both over- and underestimation of the447
correct answer, we use a centered, V-shaped contrast (values: 4, -1, -6, -1, 4) to test448
whether change probability is lowest for correct presented judgments and increases the449
more the presented judgment deviates from the correct judgment. The regression450
coefficient of this contrast is positive for a V-shape, negative for an inverse V-shape,451
and zero for the absence of such an effect. Participants may not equally often adjust the452
presented judgments when these are over- or underestimating the correct judgment.453
Hence, we also include a linear contrast testing whether the slope of the V-shaped454
contrast differs for over- and underestimation of the presented judgment. A positive455
coefficient indicates a steeper slope for underestimation, a negative coefficient indicates456
a steeper slope for overestimation, and a value of zero indicates a symmetric V-shape.457
Figure 4A illustrates the average change probability including 99% confidence458
intervals. Change probabilities followed the expected V-shape as a function of the459
presented deviation. Moreover, experts generally changed items more frequently than460
novices. Table 1shows the fixed-effects coefficients of the logistic model. Supporting461
Hypothesis 1a, the model revealed a significant, positive V-shaped contrast for item462
difficulty (β= 0.208,CI = [0.160,0.256]). The positive linear contrast was also463
significant, indicating a smaller effect of the presented deviation (i.e., a smaller slope of464
the V-shape) for presented judgments underestimating the correct answer (β= 0.311,465
CI = [0.176,0.447]). In line with Hypothesis 2a, we found a significant positive effect of466
condition (β= 0.566,CI = [0.148,0.984]). As expected in Hypothesis 3a, the467
interaction between condition and the V-shape contrast of the presented deviation was468
positive, meaning that experts better distinguished between accurate and inaccurate469
judgments (β= 0.067,CI = [0.006,0.128]). However, experts adjusted already correct470
presented judgments more frequently than novices (Figure 4), and thus, our results only471
partially support Hypothesis 3a. Besides demand effects, this could also be due to the472
raster-scanning strategy providing only an approximate estimate of the actual number473
of presented dots. While the approximation leads to improved judgments, it is still474
prone to errors. Hence, for already accurate presented judgments, participants may have475
adjusted the judgment even though it already was correct. Lastly, we found a significant476
interaction between condition and the linear contrast of presented deviation, indicating477
that the V-shape is more symmetric (with respect to over- or underestimation of the478
correct answer) for experts than for novices (β=0.342,CI = [0.522,0.163]).479
Next, we assess Hypotheses 1b, 2b, and 3b concerning the accuracy and amount480
of improvement of revised judgments. We thus consider only trials in which the481
presented judgment was adjusted. 3The percentage error of the revised judgments is482
displayed in Figure 4B. Judgments of experts were generally more accurate than those483
of novices. In both conditions, accuracy appeared to be similar for all levels of484
presented deviation suggesting that there is no anchoring effect due to the presented485
judgments in both conditions.486
We statistically test Hypotheses 1b, 2b, and 3b by focusing on the percentage487
improvement, defined as the difference between the percentage errors of the presented488
and the revised judgment. We used a linear mixed model to predict the improvement of489
presented judgments using the same contrasts for condition and presented deviation as490
3Similar results were obtained when analyzing all trials while assigning an improvement of zero to
maintained judgments (condition: β= 8.575,C I = [4.543,12.607],t(111.02) = 4.168,p<.001;
V-shaped contrast for presented deviation: β= 4.833,CI = [4.484,5.182],t(37.58) = 27.139,p<.001;
interaction of condition and V-shaped contrast: β= 1.197,CI = [0.792,1.601],t(2,117.18) = 5.800,
p<.001; all other effects were not significant).
Figure 4
Change probability, percentage error, and percentage improvement of presented judgments
for Experiment 2.
−70 −35 0 35 70
Change probability
−70 −35 0 35 70
Percentage error of
provided judgments
−70 −35 0 35 70
% deviation of the presented judgment from the correct judgment
Percentage improvement
of presented judgments
Contributors' expertise Control condition Experimental condition
Note. Points display empirical means with error bars showing the corresponding 99%
between-subjects confidence intervals. Violin plots show the distribution of the dependent
variable for participants aggregated over items.
in the model for change probability. Figure 4C displays the mean percentage491
improvement of presented judgments including 99% confidence intervals and violin492
plots, while Table 1shows the estimated regression coefficients. Supporting Hypothesis493
1b, presented deviation had a V-shaped effect such that presented judgments were494
improved more the larger the deviation from the correct judgment was. This effect was495
significant in the model-based analysis (β= 6.770,CI = [6.320,7.220]). Compared to496
novices, experts improved presented judgments more if there was room for improvement497
and worsened already correct judgments less (β=0.591,CI = [1.700,0.518]).498
Furthermore, the model showed a positive interaction between condition and the499
V-shaped contrast for presented deviation (β= 0.666,CI = [0.194,1.138]). This speaks500
for a larger anchoring effect for novices compared to experts, which provides evidence501
for Hypothesis 3b.502
Experiment 3503
While experiment 1 and 2 showed that change probability and improvement of504
presented judgments depend on expertise, they implemented only a single incremental505
step in sequential collaboration using preselected values for the presented judgments.506
Importantly, the effects should still hold if individuals encounter actual judgments of507
previous individuals rather than preselected judgments. The benefits of expertise on the508
accuracy of sequential chains of judgments should especially manifest for the final509
In the following, we derive additional hypotheses for sequential judgments made511
by groups of contributors. These hypotheses focus at the sequential-chain level rather512
than the individual level (as Hypotheses 1 to 3). Individuals with higher expertise513
should better distinguish between presented judgments provided by other experts and514
those by novices, which should in turn affect change probability and improvement of515
judgments. In contrast, novices are predicted to be worse at making this distinction,516
meaning that change probability is affected less by the status of the previous517
contributor, and that only judgments of other novices can be improved. Moreover, the518
more experts are assigned to a sequential chain, the more accurate the final estimates519
are expected to be. The improvements made by experts are less likely to be changed by520
others (and possibly worsened) if experts enter into the sequential chain later than521
Hypothesis 4: In sequential chains, experts change presented judgments of523
novices more frequently than those of other experts. In contrast, novices524
have similar change probabilities regardless of the expertise of the previous525
Hypothesis 5: In sequential chains, accuracy improves most when experts527
adjust judgments of novices. Smaller improvements occur when experts528
correct experts or when novices correct novices. In contrast, novices worsen529
judgments of experts.530
Hypothesis 6: The more experts are in a sequential chain, the better the531
final estimates. For sequential chains with the same number of experts and532
novices, final estimates are more accurate if experts are at the end of the533
chain than at the beginning.534
To test Hypothesis 1 to 6, we again relied on the random-dots estimation task535
using the raster-scanning strategy as a manipulation of expertise. However, we now536
implemented a sequential-collaboration paradigm in which participants actually537
encountered judgments made by previous participants. The design allowed us to538
manipulate the number and position of experts and novices in a sequential chain.539
The hypotheses, study design, sample size, and planned analyses were540
preregistered at Materials, data, and analysis541
scripts can be found at
Materials and Procedure544
We used the same experimental paradigm as in Experiment 2 with some minor545
changes. In the expertise condition, we already excluded participants during546
participation if they did not answer at least three questions about the raster-scanning547
strategy correctly. Thus, it was not necessary to exclude data from other persons in the548
same sequential chain later during the analysis. We also generated five new images for549
the sequential-collaboration phase.550
Participants were randomly assigned either to the expertise-manipulation or the551
control condition. We then built sequences of two participants, which differed with552
respect to status and order of the contributors (i.e., novice-novice, expert-novice,553
novice-expert, and expert-expert). As in Experiment 2, the first participant in each554
chain saw preselected judgments which were either correct, ±35% or ±70% below or555
above the correct number of dots. The second participant in each chain then saw the556
revised judgments provided by the first participant. If the first participant maintained a557
presented judgment, the second participant encountered the same value.558
We recruited 464 participants via a German panel provider who were560
compensated with 1. One participant was excluded because they answered “1” to all561
items, which in turn required to remove another participant assigned to the same chain.562
Moreover, five participants were excluded due to duplicate assignments to sequential563
chains. The final sample included 457 participants (46.83% female) with mean age 46.16564
(SD = 14.36) and various educational background (college degree: 34.79%; high-school565
diploma: 26.04%; vocational education: 24.07%; lesser educational attainment: 15.10%).566
Results and Discussion567
We computed the same dependent measures as in Experiment 2. As a568
manipulation check, we fitted a linear mixed model to test whether the independent569
judgments for the five items during the manipulation-check phase were more accurate570
for experts than for novices. As expected, the expertise manipulation lead to a decrease571
of the percentage error (β=28.898,CI = [36.319,21.477],t(111.18) = 3.977,572
p<.001) indicating that judgments of experts were twice as accurate as those of573
novices (mean error = 27.46% vs. mean error = 56.36%, respectively).574
Change Probability and Improvement of Judgment Accuracy at the575
Individual Level576
We first test Hypotheses 1 to 3 which refer to individual-level decisions and577
judgments. To analyze change probabilities in the sequential phase, we included only578
participants who saw the preselected judgments but not those who saw the judgments579
of other participants. Similar as in Experiment 2, we used a generalized linear mixed580
model to predict whether a presented judgment was changed, using the same contrasts581
for presented deviation and condition. Figure 5A displays the average change582
probabilities in Experiment 3. As expected, the V-shaped effect of presented deviation583
emerged, while it was steeper for presented judgments that underestimated rather than584
overestimated the correct answer. Moreover, the plot does not indicate an effect of585
condition. This impression was supported by the model-based analysis (see Table 1). In586
line with Hypothesis 1a, the V-shaped contrast of presented deviation on change587
probability was significant (β= 0.141,CI = [0.116,0.167]). The linear contrast of588
deviation was also significant, indicating a steeper slope for the left than the right limb589
of the V-shaped effect (β= 0.367,CI = [0.292,0.441]). Contrary to Hypothesis 2a, the590
effect of experimental condition was not significant (β= 0.052,CI = [0.297,0.401]).591
The interaction between condition and the V-shaped contrast was significant592
(β= 0.075,CI = [0.040,0.111]) indicating that the effect of presented deviation on593
change probability was slightly stronger for experts than for novices. As shown in 5A,594
experts adjusted presented judgments less often than novices if judgments were already595
correct, but more often if judgments deviated by ±70% from the correct answer.596
We tested Hypotheses 1b, 2b, and 3b concerning the improvement of presented597
judgments including only participants at the first chain position.4Figure 5B displays598
the percentage error of the revised judgments. As in Experiment 2, participants599
achieved a certain level of accuracy in both conditions independent of the presented600
deviation, while accuracy was generally higher for experts than for novices. Figure 5C601
displays the improvement of presented judgments which followed a V-shaped pattern,602
with already correct presented judgments being slightly worsened. Fitting a linear603
mixed model for the percentage improvement provided similar results. We used the604
same contrasts for condition and presented deviation as above. In line with Hypothesis605
4Similar results were obtained when including maintained judgments as providing an improvement of
zero (V-shaped contrast of deviation: β= 4.871,C I = [4.596,5.145],t(5,599.39) = 34.822,p<.001;
linear contrast of deviation: β= 1.143,CI = [0.417,1.870],t(5,599.24) = 3.084,p=.002; condition:
β= 12.784,CI = [7.998,17.571],t(236.25) = 5.235,p<.001; all other terms were not significant).
Figure 5
Change probability, percentage error, and percentage improvement of presented judgments
for Experiment 3.
−70 −35 0 35 70
Change probability
−70 −35 0 35 70
Percentage error of
provided judgments
−70 −35 0 35 70
Percentage deviation of the presented judgment from the correct judgment
Percentage improvement
of presented judgments
Contributors' expertise Control condition Experimental condition
Note. Points display empirical means with error bars showing the corresponding 99%
between-subjects confidence intervals. Violin plots show the distribution of the dependent
variable for participants aggregated over items.
1b, the model showed a V-shape effect of presented deviation (β= 6.653,606
CI = [6.290,7.016]). Supporting Hypothesis 2b, the main effect of condition was607
significant, indicating more improvement of judgments for experts than novices608
(β= 16.518,CI = [10.701,22.336]). In contrast to Hypothesis 3b, the interaction of609
condition and presented deviation was not significant (β=0.024,610
CI = [0.527,0.479]). Moreover, the interaction between the linear slope for presented611
deviation and expertise was significant, indicating a steeper slope for the left than the612
right limb of the V-shape for experts compared to novices (β=1.516,613
CI = [2.745,0.287]).614
As robustness check, we also tested Hypotheses 1, 2, and 3 using judgments of all615
participants. The deviation of presented judgments thus becomes a continuous variable616
since participants at the second chain position may see revised judgments of participants617
at the first position. In the linear mixed models, we included the standardized deviation618
and the corresponding, quadratic trend as predictors. For this analysis, we excluded 44619
judgments provided by participants at the first chain position which had an percentage620
error of more than 200% since these judgments serve as presented judgments for621
participants at the second chain position and could obstruct the analysis.622
For change probability, the results were similar as when including only623
participants at the first chain position. The model showed a significant quadratic effect624
of presented deviation (β= 0.273,CI = [0.208,0.338],z= 8.223,p<.001) and a625
significant interaction with condition (β= 0.362,CI = [0.262,0.461],z= 7.098,626
p<.001), whereas the effect of condition on change probability was not significant627
(β=0.232,CI = [0.490,0.025],z=1.768,p=.077). Concerning the improvement628
of the presented judgments, results were again similar to analyzing only participants at629
the first chain position. We found a positive effect of the quadratic trend of deviation630
(β= 19.345,CI = [18.245,20.445],t(7,197.89) = 34.466,p<.001) and a positive effect631
of condition (β= 15.575,CI = [10.649,20.501],t(555.30) = 6.197,p<.001) while the632
interaction was not significant (β= 0.389,CI = [1.102,1.879],t(7,182.28) = 0.511,633
Change Probability, Judgment Accuracy, and Improvement of Presented635
Judgments at the Chain Level636
We tested the hypotheses at the chain level based on the data of participants at637
the second chain position. Concerning Hypothesis 4, we fitted a generalized linear638
mixed model to predict whether change probability differs between the four639
compositions of sequential chains (i.e., novice-novice, expert-novice, novice-expert, or640
expert-expert). For this purpose, we implemented two contrasts: one comparing641
novice-novice chains against expert-novice chains, and another comparing novice-expert642
chains against expert-expert chains. In line with Hypothesis 4, change probability was643
larger for novice-expert than for expert-expert chains (β= 0.326,CI = [0.063,0.588],644
z= 2.432,p=.015) while novices changed the entries of experts and novices similarly645
frequently (β= 0.136,CI = [0.098,0.370],z= 1.140,p=.254). As illustrated in646
Figure 6A, novices showed similar change probabilities when encountering judgments of647
novices and experts, while experts were more likely to change judgments of novices648
compared to those of experts.649
Figure 6
Change probability, accuracy, and amount of improvement for the four compositions of
sequential chains in Experiment 3.
novice expert−
novice novice−
expert expert−
Change probability
novice expert−
novice novice−
expert expert−
Percentage error of
final estimates
novice expert−
novice novice−
expert expert−
Composition of the sequential chain
Percentage improvement
of presented judgments
Note. Points display empirical means with error bars showing the corresponding 99%
between-subjects confidence intervals. Violin plots illustrate the distribution of changes
and jugdments aggregated for each participant across items.
To test Hypothesis 5, we only considered judgments that were adjusted by650
participants at the second chain position5and implemented a linear mixed model with651
percentage improvement as dependent variable and type of sequential chain as predictor.652
We additionally used Helmert contrasts to test our hypothesis by contrasting the653
novice-expert chain with all other chains, the expert-novice chain with the novice-novice654
and expert-expert chains, and, lastly, testing the novice-novice and expert-expert chains655
against each other. Figure 6C displays the empirical means for percentage improvement656
for all compositions of sequential chains. In line with this pattern and Hypothesis 5, we657
found a significant contrast for the novice-expert sequential chain (β= 3.760,658
CI = [1.264,6.256],t(215.08) = 2.952,p=.004). Furthermore, we found a significant659
contrast for the expert-novice chain (β=3.852,CI = [7.227,0.477],660
t(221.47) = 2.237,p=.026). In fact, as Figure 6displays, novices worsen judgments661
of experts. Lastly, we did not find a significant difference in improvement between662
expert-expert and novice-novice groups (β=5.965,CI = [12.137,0.208],663
t(222.70) = 1.894,p=.060). These findings are in line with Hypothesis 5.664
To test Hypothesis 6, we fitted a linear mixed model with percentage error of the665
final judgment in a sequential chain as dependent variable and chain composition as666
predictor. Depending on whether participants adjusted the presented judgment, the667
final judgment could either be the presented judgment, the judgments entered by the668
first participant, or the judgment entered by the second participant. We used a linear669
contrast to test for a decreasing percentage error and thus increasing accuracy across670
chain compositions.671
In line with Hypothesis 6, we found a significant linear trend between chain672
composition and accuracy of the final estimates (β= 5.779,CI = [2.199,9.359],673
5Similar results are obtained when maintained judgments are considered as not improved with a value
of zero (β= 3.182,CI = [1.195,5.169],t(214.61) = 3.139,p=.002 for comparing relative improvement
of judgments of novice-expert chains to all other types of sequential chains, β=2.985,
CI = [5.633,0.336],t(214.50) = 2.209,p=.028 for comparing expert-novice chains to
novice-novice and expert-expert chains, and β=3.404,CI = [8.161,1.352],t(214.60) = 1.403,
p=.162 for comparing expert-expert and novice-novice chains).
t(216.79) = 3.164,p=.002). This pattern is displayed in Figure 6B showing that the674
percentage error was largest for sequential chains with two novices and smallest for675
sequential chains with two experts. Regarding mixed sequential chains which included676
both an expert and a novice, the percentage error was smaller when chains ended rather677
than started with an expert.678
General Discussion679
Our three experiments show that the probability of changing a presented680
judgments depends on its deviation to the correct answer (Hypothesis 1a), on681
participants’ expertise (Hypothesis 2a), and on the corresponding interaction682
(Hypothesis 3a). However, Experiment 1 and 2 did not provide evidence for the683
interaction, whereas Experiment 3 did not indicate an effect of expertise. Presented684
deviation, expertise, and their interaction also affected the amount of improvement685
made to presented judgments (Hypothesis 1b, 2b, and 3b), while the interaction was686
not supported in Experiment 3. Experiment 3 investigated sequential chains of687
contributors, showing that experts adjust judgments of novices more frequently than688
those of other experts (Hypothesis 4), that experts improve judgments of novices most689
(Hypothesis 5), and that final estimates become more accurate the more experts are in690
a sequential chain and the later they enter (Hypothesis 6).691
Overall, expertise is an important predictor of change probability and the692
amount of improvement of judgments in sequential collaboration. This supports the693
theoretical assumption that contributors adjust and maintain judgments based on their694
expertise which in turn results in an implicit weighting of judgments. Even though this695
weighting happens at the individual level within each sequential step, the increased696
accuracy due to overweighting judgments of experts can be observed at the chain level.697
Still, the number of experts and the position in which they enter a sequential chain698
affects the accuracy of group estimates. Accurate judgments of experts at the beginning699
of a sequential chain may be obstructed by novices later, in turn resulting in reduced700
accuracy. In contrast, possibly inaccurate judgments by novices at the beginning can be701
corrected by experts later.702
Our findings also add to the literature on the wisdom of crowds, supporting the703
notion that weighing judgments by expertise increases accuracy (Budescu & Chen, 2014;704
Mayer & Heck, 2022; Merkle et al., 2020). In contrast to other experimental designs and705
statistical techniques, sequential collaboration does not require researchers to identify706
experts before or after the judgment task, respectively. Instead, sequential collaboration707
results in an implicit weighting of judgments which is determined by the contributors’708
meta-cognitive assessment of whether they can improve the present judgments.709
Future Research Directions710
Our three studies are limited in that they only examined the effect of expertise711
for short sequential chains with only two contributors. We expect that the effects on712
change probability and improvement of judgments should similarly hold for longer713
sequential chains, given that participants were not aware about the number of714
contributors. However, this assumptions needs to be tested using experiments with715
longer chains.716
Other variables besides expertise may also affect the frequency and improvement717
of judgments in sequential collaboration. Specifically, individuals’ confidence will likely718
determine the decision whether to adjust a judgment. Domain expertise can be a source719
of high confidence if individuals do not hold erroneous belief (Koriat, 2008, 2011).720
However, confidence can also stem from miscalibrated meta-cognition (Kruger &721
Dunning, 1999) or from item-specific knowledge. Especially for general knowledge722
questions, contributors’ knowledge for specific facts becomes relevant. For instance,723
contributors may know the location of certain cities since they lived there or recently724
visited them. Future research should thus examine the role of confidence in sequential725
While our studies show that expertise predicts change probability and the727
amount of improvement in sequential chains of judgments, it remains unclear whether728
the increased accuracy is due to the sequential judgment process itself or due to the729
possibility to opt out of answering. Providing the opportunity to opt out increases the730
accuracy of independent individual judgments, since individuals can use their731
metacognitive knowledge to select those tasks that fit their individual expertise best732
(Bennett et al., 2018). Future research should thus disentangle the effects of the733
sequential judgment-elicitation process and of the opportunity to opt out of providing a734
Sequential collaboration is a key mechanism found in many large-scale, online737
collaborative projects. Our studies show that expertise is an important predictor of738
whether individuals adjust or maintain presented entries, how much they improve an739
entry, and how accurate the final estimates are. Thereby, we provide evidence for the740
implicit-weighting of expertise in sequential collaboration, which may contribute to the741
high accuracy of online collaborative projects.742
Anderson, J. R., & Fincham, J. M. (1994). Acquisition of procedural skills from744
examples. Journal of Experimental Psychology: Learning, Memory, and745
Cognition,20, 1322–1340.
Anderson, J. R., Fincham, J. M., & Douglass, S. (1997). The role of examples747
and rules in the acquisition of a cognitive skill. Journal of Experimental748
Psychology. Learning, Memory, and Cognition,23, 932–945.749
Baumann, M. R., & Bonner, B. L. (2013). Member awareness of expertise,751
information sharing, information weighting, and group decision making.752
Small Group Research,44, 532–562.753
Bennett, S. T., Benjamin, A. S., Mistry, P. K., & Steyvers, M. (2018). Making a755
wiser crowd: Benefits of individual metacognitive control on crowd756
performance. Computational Brain & Behavior,1, 90–99.757
Bonner, B. L., Baumann, M. R., & Dalal, R. S. (2002). The effects of member759
expertise on group decision-making and performance. Organizational760
Behavior and Human Decision Processes,88, 719–736.761
Budescu, D. V., & Chen, E. (2014). Identifying expertise to extract the wisdom763
of crowds. Management Science,61, 267–280.764
Ciepłuch, B., Jacob, R., Mooney, P., & Winstanley, A. C. (2010). Comparison of766
the accuracy of OpenStreetMap for ireland with google maps and bing maps.767
Proceedings of the Ninth International Symposium on Spatial Accuracy768
Assessment in Natural Resuorces and Enviromental Sciences 20-23rd July769
2010, 337–340.770
Davis-Stober, C. P., Budescu, D. V., Dana, J., & Broomell, S. B. (2014). When771
is a crowd wise? Decision,1, 79–101.
Dubrovsky, V. J., Kiesler, S., & Sethna, B. N. (1991). The equalization773
phenomenon: Status effects in computer-mediated and face-to-face774
decision-making groups. Human–Computer Interaction,6, 119–146.775
Embretson, S. E., & Reise, S. P. (2000). Item response theory. Psychology Press.777
Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of779
cognitive–developmental inquiry. American Psychologist,34, 906–911.780
Franz, T. M., & Larson, J. R. (2002). The impact of experts on information782
sharing during group discussion. Small Group Research,33, 383–411.783
Giles, J. (2005). Internet encyclopaedias go head to head. Nature,438, 900–901.785
Grüning, D. J., & Krueger, J. (2021). Vox peritorum: Capitalizing on confidence787
and projection to characterize expertise.788
Haklay, M. (2010). How good is volunteered geographical information? A790
comparative study of OpenStreetMap and ordnance survey datasets.791
Environment and Planning B: Planning and Design,37, 682–703.792
Honda, H., Kagawa, R., & Shirasuna, M. (2022). On the round number bias and794
wisdom of crowds in different response formats for numerical estimation.795
Scientific Reports,12, 8167.
Hueffer, K., Fonseca, M. A., Leiserowitz, A., & Taylor, K. M. (2013). The797
wisdom of crowds: Predicting a weather and climate-related event. Judgment798
and Decision Making,8, 91–105.799
Jansen, R. A., Rafferty, A. N., & Griffiths, T. L. (2021). A rational model of the801
dunning–kruger effect supports insensitivity to evidence in low performers.802
Nature Human Behaviour,5, 756–763.803
Koriat, A. (2008). Subjective confidence in one’s answers: The consensuality805
principle. Journal of Experimental Psychology: Learning, Memory, and806
Cognition,34, 945–959.
Koriat, A. (2011). Subjective confidence in perceptual judgments: A test of the808
self-consistency model. Journal of Experimental Psychology: General,140,809
Kräenbring, J., Monzon Penza, T., Gutmann, J., Muehlich, S., Zolk, O.,811
Wojnowski, L., Maas, R., Engelhardt, S., & Sarikas, A. (2014). Accuracy and812
completeness of drug information in wikipedia: A comparison with standard813
textbooks of pharmacology. PLoS ONE,9(9).814
Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties816
in recognizing one’s own incompetence lead to inflated self-assessments.817
Journal of Personality and Social Psychology,77, 1121–1134.818
Lai, E. R. (2011). Metacognition: A literature review. Pearson Research Report.819
Larrick, R. P., & Soll, J. B. (2006). Intuitions about combining opinions:822
Misappreciation of the averaging principle. Management Science,52,823
Lee, M. D., Danileiko, I., & Vi, J. (2018). Testing the ability of the surprisingly825
popular method to predict NFL games. Judgment and Decision Making,13,826
Leithner, A., Maurer-Ertl, W., Glehr, M., Friesenbichler, J., Leithner, K., &828
Windhager, R. (2010). Wikipedia and osteosarcoma: A trustworthy patients’829
information? Journal of the American Medical Informatics Association :830
JAMIA,17, 373–374.
Lin, S., & Cheng, C. (2009). The reliability of aggregated probability judgments832
obtained through cooke’s classical model. Journal of Modelling in833
Management,4, 149–161.
Lord, F. M., Novick, M. R., & Birnbaum, A. (1968). Statistical theories of835
mental test scores. Addison-Wesley.836
Martire, K. A., Growns, B., & Navarro, D. J. (2018). What do the experts837
know? Calibration, precision, and the wisdom of crowds among forensic838
handwriting experts. Psychonomic Bulletin & Review,25, 2346–2355.839
Mayer, M., & Heck, D. W. (2021). Sequential collaboration: About the accuracy841
of dependent, incremental judgments.
Mayer, M., & Heck, D. W. (2022). Cultural consensus theory for843
two-dimensional data: Expertise-weighted aggregation of location judgments.844
Merkle, E. C., Saw, G., & Davis-Stober, C. (2020). Beating the average forecast:846
Regularization based on forecaster attributes. Journal of Mathematical847
Psychology,98, 102419.
Merkle, E. C., & Steyvers, M. (2011). A psychological model for aggregating849
judgments of magnitude. In J. Salerno, S. J. Yang, D. Nau, & S.-K. Chai850
(Eds.), Social computing, behavioral-cultural modeling and prediction (pp.851
236–243). Springer.
Mussweiler, T., Englich, B., & Strack, F. (2004). Anchoring effect. In R. F. Pohl853
(Ed.), Cognitive illusions (1st ed., pp. 183–199). Psychology Press.854
Pinheiro, J. C., & Bates, D. M. (Eds.). (2000). Linear mixed-effects models:855
Basic concepts and examples. In Mixed-effects models in S and S-PLUS (pp.856
3–56). Springer.
Prelec, D., Seung, H. S., & McCoy, J. (2017). A solution to the single-question858
crowd wisdom problem. Nature,541, 532–535.859
Schunn, C. D., & Anderson, J. R. (1999). The generality/specificity of expertise861
in scientific reasoning. Cognitive Science,23, 337–370.862
Stewart, D. D., & Stasser, G. (1995). Expert role assignment and information864
sampling during collective recall and decision making. Journal of Personality865
and Social Psychology,69, 619–628.866
Steyvers, M., Miller, B., Hemmer, P., & Lee, M. (2009). The wisdom of crowds868
in the recollection of order information. In Y. Bengio, D. Schuurmans, J.869
Lafferty, C. Williams, & A. Culotta (Eds.), Advances in neural information870
processing systems (Vol. 22, pp. 1785--1793). Curran Associates, Inc.871
Surowiecki, J. (2004). The wisdom of crowds (1. ed). Anchor Books.874
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics875
and biases. Science,185, 1124–1131.876
Wang, J., Liu, Y., & Chen, Y. (2021). Forecast aggregation via peer prediction.878
Proceedings of the AAAI Conference on Human Computation and879
Crowdsourcing,9, 131–142.880
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis.882
Springer-Verlag New York. https://ggplot2.tidyverse.org883
Zhang, H., & Malczewski, J. (2017). Accuracy evaluation of the canadian884
OpenStreetMap road networks. International Journal of Geospatial and885
Environmental Research,5(2).
Zielstra, D., & Zipf, A. (2010). Quantitative studies on the data quality of887
OpenStreetMap in Germany. AGILE 2010. The 13th AGILE international888
conference on geographic information science.889
Cities selected for different maps
Table A1
Table of items for Experiment 1 using map material.
Study phase Map Cities
Expertise measurement Austria and Switzerland Zurich, Basel
Expertise measurement France Lyon, Nice
Expertise measurement Italy Venice
Expertise measurement Spain and Portugal Seville, Lisbon
Expertise measurement United Kingdom and Ireland Glasgow
Expertise measurement Poland, Czech, Hungary and Slovenia Budap est
Expertise measurement Germany Berlin, Nuremberg, Bonn, Münster, Mannheim, Augsburg,
Braunschweig, Munich
Sequential collaboration Austria and Switzerland Geneva, Bern, Vienna, Graz, Linz, Salzburg
Sequential collaboration France Paris, Marseille, Toulouse
Sequential collaboration Italy Rome, Milan, Naples, Florence
Sequential collaboration Spain and Portugal Madrid, Barcelona, Porto
Sequential collaboration United Kingdom and Ireland London, Birmingham, Liverpool, Dublin
Sequential collaboration Poland, Czech, Hungary and Slovenia Warsaw, Prague, Bratislava
Sequential collaboration Germany Hamburg, Cologne, Frankfurt, Stuttgart, Düsseldorf, Leipzig,
Dortmund, Essen, Bremen, Dresden, Hannover, Duisburg,
Wuppertal, Bielefeld, Karlsruhe, Wiesbaden, Kiel
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Online collaborative projects in which users contribute to extensive knowledge bases such as Wikipedia or OpenStreetMap have become increasingly popular while yielding highly accurate information. Collaboration in such projects is organized sequentially with one contributor creating an entry and the following contributors deciding whether to adjust or to maintain the presented information. We refer to this process as sequential collaboration since individual judgments directly depend on the previous judgment. As sequential collaboration has not yet been examined systematically, we investigate whether dependent, sequential judgments become increasingly more accurate. Moreover, we test whether final sequential judgments are more accurate than the unweighted average of independent judgments from equally large groups. We conducted three studies with groups of four to six contributors who either answered general knowledge questions (Experiments 1 and 2) or located cities on maps (Experiment 3). As expected, individual judgments became more accurate across the course of sequential chains and final estimates were similarly accurate as unweighted averaging of independent judgments. These results show that sequential collaboration profits from dependent, incremental judgments, thereby shedding light on the contribution process underlying large-scale online collaborative projects.
Full-text available
Environmental uncertainty is at the core of much of human activity, ranging from daily decisions by individuals to long-term policy planning by governments. Yet, there is little quantitative evidence on the ability of non-expert individuals or populations to forecast climate-related events. Here we report on data from a 90-year old prediction game on a climate related event in Alaska: the Nenana Ice Classic (NIC). Participants in this contest guess to the nearest minute when the ice covering the Tanana River will break, signaling the start of spring. Previous research indicates a strong correlation between the ice breakup dates and regional weather conditions. We study betting decisions between 1955 and 2009. We find the betting distribution closely predicts the outcome of the contest. We also find a significant correlation between regional temperatures as well as past ice breakups and betting behavior, suggesting that participants incorporate both climate and historical information into their decision-making.
Full-text available
Averaging estimates is an effective way to improve accuracy when combining expert judgments, integrating group members' judgments, or using advice to modify personal judgments. If the estimates of two judges ever fall on different sides of the truth, which we term bracketing, averaging must outperform the average judge for convex loss functions, such as mean absolute deviation (MAD). We hypothesized that people often hold incorrect beliefs about averaging, falsely concluding that the average of two judges' estimates would be no more accurate than the average judge. The experiments confirmed that this misconception was common across a range of tasks that involved reasoning from summary data (Experiment 1), from specific instances (Experiment 2), and conceptually (Experiment 3). However, this misconception decreased as observed or assumed bracketing rate increased (all three studies) and when bracketing was made more transparent (Experiment 2). Experiment 4 showed that flawed inferential rules and poor extensional reasoning abilities contributed to the misconception. We conclude by describing how people may face few opportunities to learn the benefits of averaging and how misappreciating averaging contributes to poor intuitive strategies for combining estimates.
Full-text available
The online resource Wikipedia is increasingly used by students for knowledge acquisition and learning. However, the lack of a formal editorial review and the heterogeneous expertise of contributors often results in skepticism by educators whether Wikipedia should be recommended to students as an information source. In this study we systematically analyzed the accuracy and completeness of drug information in the German and English language versions of Wikipedia in comparison to standard textbooks of pharmacology. In addition, references, revision history and readability were evaluated. Analysis of readability was performed using the Amstad readability index and the Erste Wiener Sachtextformel. The data on indication, mechanism of action, pharmacokinetics, adverse effects and contraindications for 100 curricular drugs were retrieved from standard German textbooks of general pharmacology and compared with the corresponding articles in the German language version of Wikipedia. Quantitative analysis revealed that accuracy of drug information in Wikipedia was 99.7%±0.2% when compared to the textbook data. The overall completeness of drug information in Wikipedia was 83.8±1.5% (p<0.001). Completeness varied in-between categories, and was lowest in the category "pharmacokinetics" (68.0%±4.2%; p<0.001) and highest in the category "indication" (91.3%±2.0%) when compared to the textbook data overlap. Similar results were obtained for the English language version of Wikipedia. Of the drug information missing in Wikipedia, 62.5% was rated as didactically non-relevant in a qualitative re-evaluation study. Drug articles in Wikipedia had an average of 14.6±1.6 references and 262.8±37.4 edits performed by 142.7±17.6 editors. Both Wikipedia and textbooks samples had comparable, low readability. Our study suggests that Wikipedia is an accurate and comprehensive source of drug-related information for undergraduate medical education.
In a variety of real-world forecasting contexts, researchers have demonstrated that the unweighted average forecast is reasonably accurate and difficult to improve upon with more complex, model-based aggregation methods. We investigate this phenomenon by systematically examining the relationship between individual forecaster characteristics (e.g., bias, consistency) and aspects of the criterion being forecast (e.g., “signal strength”). To this end, we develop a model inspired by Cultural Consensus Theory (Batchelder and Romney, 1988) that (i) allows us to jointly estimate both forecaster characteristics and environmental characteristics and (ii) contains the unweighted average as a special case. This allows us to use the model as a regularization method for forecast aggregation, where restrictions on forecaster parameters make the model similar to use of an unweighted average. Relatedly, the model allows us to apply existing results on optimal forecaster weighting to real data. We show how the model provides guidance for identifying prediction environments where the average forecast can potentially be beaten. We also conduct two simulation studies and illustrate the model’s practical application using forecasts of Australian Football League point spreads.
We consider the recently-developed “surprisingly popular” method for aggregating decisions across a group of people (Prelec, Seung and McCoy, 2017). The method has shown impressive performance in a range of decision-making situations, but typically for situations in which the correct answer is already established. We consider the ability of the surprisingly popular method to make predictions in a situation where the correct answer does not exist at the time people are asked to make decisions. Specifically, we tested its ability to predict the winners of the 256 US National Football League (NFL) games in the 2017–2018 season. Each of these predictions used participants who self-rated as “extremely knowledgeable” about the NFL, drawn from a set of 100 participants recruited through Amazon Mechanical Turk (AMT). We compare the accuracy and calibration of the surprisingly popular method to a variety of alternatives: the mode and confidence-weighted predictions of the expert AMT participants, the individual and aggregated predictions of media experts, and a statistical Elo method based on the performance histories of the NFL teams. Our results are exploratory, and need replication, but we find that the surprisingly popular method outperforms all of these alternatives, and has reasonable calibration properties relating the confidence of its predictions to the accuracy of those predictions. © 2018, Society for Judgment and Decision making. All rights reserved.
The wisdom of the crowd refers to the finding that judgments aggregated over individuals are typically more accurate than the average individual’s judgment. Here, we examine the potential for improving crowd judgments by allowing individuals to choose which of a set of queries to respond to. If individuals’ metacognitive assessments of what they know is accurate, allowing individuals to opt in to questions of interest or expertise has the potential to create a more informed knowledge base over which to aggregate. This prediction was confirmed: crowds composed of volunteered judgments were more accurate than crowds composed of forced judgments. Overall, allowing individuals to use private metacognitive knowledge holds much promise in enhancing judgments, including those of the crowd.
Many common statistical models can be expressed as linear models that incorporate both fixed effects, which are parameters associated with an entire population or with certain repeatable levels of experimental factors, and random effects, which are associated with individual experimental units drawn at random from a population. A model with both fixed effects and random effects is called a mixed-effects model.
People tend to hold overly favorable views of their abilities in many social and intellectual domains. The authors suggest that this overestimation occurs, in part, because people who are unskilled in these domains suffer a dual burden: Not only do these people reach erroneous conclusions and make unfortunate choices, but their incompetence robs them of the metacognitive ability to realize it. Across 4 studies, the authors found that participants scoring in the bottom quartile on tests of humor, grammar, and logic grossly overestimated their test performance and ability. Although their test scores put them in the 12th percentile, they estimated themselves to be in the 62nd. Several analyses linked this miscalibration to deficits in metacognitive skill, or the capacity to distinguish accuracy from error. Paradoxically, improving the skills of the participants, and thus increasing their metacognitive competence, helped them recognize the limitations of their abilities. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Statistical aggregation is often used to combine multiple opinions within a group. Such aggregates outperform individuals, including experts, in various prediction and estimation tasks. This result is attributed to the "wisdom of crowds." We seek to improve the quality of such aggregates by eliminating poorly performing individuals from the crowd. We propose a new measure of contribution to assess the judges' performance relative to the group and use positive contributors to build a weighting model for aggregating forecasts. In Study 1, we analyze 1,233 judges forecasting almost 200 current events to illustrate the superiority of our model over unweighted models and models weighted by measures of absolute performance. In Study 2, we replicate our findings by using economic forecasts from the European Central Bank and show how the method can be used to identify smaller crowds of the top positive contributors. We show that the model derives its power from identifying experts who consistently outperform the crowd.