PreprintPDF Available

Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Although variation in effect sizes and predicted values among studies of similar phenomena is inevitable, such variation far exceeds what might be produced by sampling error alone. One possible explanation for variation among results is differences among researchers in the decisions they make regarding statistical analyses. A growing array of studies has explored this analytical variability in different (mostly social science) fields, and has found substantial variability among results, despite analysts having the same data and research question. We implemented an analogous study in ecology and evolutionary biology, fields in which there have been no empirical exploration of the variation in effect sizes or model predictions generated by the analytical decisions of different researchers. We used two unpublished datasets, one from evolutionary ecology (blue tit, Cyanistes caeruleus, to compare sibling number and nestling growth) and one from conservation ecology (Eucalyptus, to compare grass cover and tree seedling recruitment), and the project leaders recruited 174 analyst teams, comprising 246 analysts, to investigate the answers to prespecified research questions. Analyses conducted by these teams yielded 141 usable effects for the blue tit dataset, and 85 usable effects for the Eucalyptus dataset. We found substantial heterogeneity among results for both datasets, although the patterns of variation differed between them. For the blue tit analyses, the average effect was convincingly negative, with less growth for nestlings living with more siblings, but there was near continuous variation in effect size from large negative effects to effects near zero, and even effects crossing the traditional threshold of statistical significance in the opposite direction. In contrast, the average relationship between grass cover and Eucalyptus seedling number was only slightly negative and not convincingly different from zero, and most effects ranged from weakly negative to weakly positive, with about a third of effects crossing the traditional threshold of significance in one direction or the other. However, there were also several striking outliers in the Eucalyptus dataset, with effects far from zero. For both datasets, we found substantial variation in the variable selection and random effects structures among analyses, as well as in the ratings of the analytical methods by peer reviewers, but we found no strong relationship between any of these and deviation from the meta-analytic mean. In other words, analyses with results that were far from the mean were no more or less likely to have dissimilar variable sets, use random effects in their models, or receive poor peer reviews than those analyses that found results that were close to the mean. The existence of substantial variability among analysis outcomes raises important questions about how ecologists and evolutionary biologists should interpret published results, and how they should conduct analyses in the future.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
1
Same data, dierent analysts: variaon in eect sizes due to analycal 1
decisions in ecology and evoluonary biology. 2
Elliot Gould, School of Agriculture Food and Ecosystem Sciences, University of Melbourne, Australia 3
Hannah S. Fraser, School of Historical and Philosophical Studies, University of Melbourne, Australia 4
Timothy H. Parker, Department of Biology, Whitman College, USA. Author for Correspondence: 5
parkerth@whitman.edu 6
Shinichi Nakagawa, School of Biological, Earth & Environmental Sciences, University of New South 7
Wales, Australia 8
Simon C. Grith, School of Natural Sciences, Macquarie University, Australia 9
Peter A. Vesk, School of Agriculture Food and Ecosystem Sciences, University of Melbourne, Australia 10
Fiona Fidler, School of Historical and Philosophical Studies, University of Melbourne, Australia 11
Daniel G. Hamilton, School of Public Health and Prevenve Medicine, Monash University, Australia 12
Robin N Abbey-Lee, Länsstyrelsen Östergötland, Sweden 13
Jessica K. Abbo, Biology Department, Lund University, Sweden 14
Luis A. Aguirre, Department of Biology, University of Massachuses, USA 15
Carles Alcaraz, Marine and Connental Waters, IRTA, Spain 16
Irith Aloni, Deptartment of Life Sciences, Ben Gurion University of the Negev, Israel 17
Drew Altschul, Department of Psychology, The University of Edinburgh, UK 18
Kunal Arekar, Centre for Ecological Sciences, Indian Instute of Science, India 19
Je W. Atkins, Southern Research Staon, USDA Forest Service, USA 20
Joe Atkinson, Center for Ecological Dynamics in a Novel Biosphere (ECONOVO), Department of 21
Biology, Aarhus University, Denmark 22
Christopher M. Baker, School of Mathemacs and Stascs, University of Melbourne, Australia 23
Meghan Barre, Biology, Indiana University Purdue University Indianapolis, USA 24
Krisan Bell, School of Life and Environmental Sciences, Deakin University, Australia 25
Suleiman Kehinde Bello, Department of Arid Land Agriculture, King Abdulaziz University, Kingdom of 26
Saudi Arabia 27
Iván Beltrán, Department of Biological Sciences, Macquarie University, Australia 28
Bernd J. Berauer, Department of Plant Ecology, University of Hohenheim, Instute of Landscape and 29
Plant Ecology, Germany 30
Michael Grant Bertram, Department of Wildlife, Fish, and Environmental Studies, Swedish University 31
of Agricultural Sciences, Sweden 32
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
2
Peter D. Billman, Department of Ecology and Evoluonary Biology, University of Conneccut, USA 33
Charlie K. Blake, STEM Center, Southern Illinois University Edwardsville, USA 34
Shannon Blake, University of Guelph, Canada 35
Louis Bliard, Department of Evoluonary Biology and Environmental Studies, University of Zurich, 36
Switzerland 37
Andrea Bonisoli-Alqua, Department of Biological Sciences, California State Polytechnic University, 38
Pomona, USA 39
Timothée Bonnet, Centre d'Études Biologiques de Chizé, UMR 7372 Université de la Rochelle - Centre 40
Naonal de la Recherche Scienque, France 41
Camille Nina Marion Bordes, Faculty of Life Sciences, Bar Ilan University, Israel 42
Aneesh P. H. Bose, Department of Wildlife, Fish, and Environmental Studies, Swedish University of 43
Agricultural Sciences, Sweden 44
Thomas Boerill-James, School of Natural Sciences, University of Tasmania, Australia 45
Melissa Anna Boyd, Whitebark Instute, USA 46
Sarah A. Boyle, Department of Biology, Rhodes College, USA 47
Tom Bradfer-Lawrence, Centre for Conservaon Science, RSPB, UK 48
Jennifer Bradham, Environmental Studies, Woord College, USA 49
Jack A. Brand, Department of Wildlife, Fish and Environmental Studies, Swedish University of 50
Agricultural Sciences, Sweden 51
Marn I. Brengdahl, IFM Biology, Linköping University, Sweden 52
Marn Bulla, Faculty of Environmental Sciences, Czech University of Life Sciences Prague, Czech 53
Republic 54
Luc Bussière, Biological and Environmental Sciences & Gothenburg Global Biodiversity Centre, 55
University of Gothenburg, Sweden 56
Eore Camerlenghi, School of Biological Sciences, Monash University, Australia 57
Sara E. Campbell, Ecology and Evoluonary Biology, University of Tennessee Knoxville, USA 58
Leonardo L. F. Campos, Departamento de Ecologia e Zoologia, Universidade Federal de Santa 59
Catarina, Brazil 60
Anthony Caravaggi, School of Biological and Forensic Sciences, University of South Wales, UK 61
Pedro Cardoso, Centre for Ecology, Evoluon and Environmental Changes (cE3c) & CHANGE - Global
62
Change and Sustainability Instute, Faculdade de Ciências, Universidade de Lisboa, Portugal 63
Charles J.W. Carroll, Forest and Rangeland Stewardship, Colorado State University, USA 64
Therese A. Catanach, Department of Ornithology, Academy of Natural Sciences of Drexel University, 65
USA 66
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
3
Xuan Chen, Biology, Salisbury University, USA 67
Heung Ying Janet Chik, Groningen Instute for Evoluonary Life Sciences, University of Groningen, 68
Netherlands 69
Emily Sarah Choy, Department of Biology, McMaster University, Canada 70
Alec Philip Chrise, Department of Zoology, University of Cambridge, UK 71
Angela Chuang, Entomology and Nematology, University of Florida, USA 72
Amanda J. Chunco, Environmental Studies, Elon University, USA 73
Bethany L. Clark, BirdLife Internaonal, UK 74
Andrea Conna, School of Integrave Biological and Chemical Sciences, The University of Texas Rio 75
Grande Valley, USA 76
Garth A. Covernton, Department of Ecology and Evoluonary Biology, University of Toronto, Canada 77
Murray P. Cox, Department of Stascs, University of Auckland, New Zealand 78
Kimberly A. Cressman, Catbird Stats, LLC, USA 79
Marco Cro, School of Biodiversity, One Health & Veterinary Medicine, University of Glasgow, UK 80
Connor Davidson Crouch, School of Forestry, Northern Arizona University, USA 81
Pietro B. D'Amelio, Department of Behavioural Neurobiology, Max Planck Instute for Biological 82
Intelligence, Germany 83
Alexandra Allison de Sousa, School of Sciences: Center for Health and Cognion, Bath Spa University, 84
UK 85
Timm Fabian Döbert, Department of Biological Sciences, University of Alberta, Canada 86
Ralph Dobler, Applied Zoology, TU Dresden, Germany 87
Adam J. Dobson, School of Molecular Biosciences, College of Medical Veterinary & Life Sciences, 88
University of Glasgow, UK 89
Tim S. Doherty, School of Life and Environmental Sciences, The University of Sydney, Australia 90
Szymon Marian Drobniak, Instute of Environmental Sciences, Jagiellonian University, Poland 91
Alexandra Grace Duy, Biology Department, Brigham Young University, USA 92
Alison B. Duncan, Instute of Evoluonary Sciences Montpellier, University of Montpellier, CNRS, 93
IRD., France 94
Robert P. Dunn, Baruch Marine Field Laboratory, University of South Carolina, USA 95
Jamie Dunning, Department of Life Sciences, Imperial College London, UK 96
Trishna Dua, European Forest Instute, Germany 97
Luke Eberhart-Hertel, Department of Ornithology, Max Planck Instute for Biological Intelligence, 98
Germany 99
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
4
Jared Alan Elmore, Forestry and Environmental Conservaon, Naonal Bobwhite and Grassland 100
Iniave, Clemson University, USA 101
Mahmoud Medhat Elsherif, Department of Psychology and Vision Science, University of Birmingham, 102
Baily Thomas Grant, UK 103
Holly M. English, School of Biology and Environmental Science, University College Dublin, Ireland 104
David C. Ensminger, Department of Biological Sciences, San José State University, USA 105
Ulrich Rainer Ernst, Apicultural State Instute, University of Hohenheim, Germany 106
Stephen M. Ferguson, Department of Biology, St. Norbert College, USA 107
Esteban Fernandez-Juricic, Department of Biological Sciences, Purdue University, USA 108
Thalita Ferreira-Arruda, Biodiversity, Macroecology & Biogeography, Faculty of Forest Sciences and 109
Forest Ecology, University of Göngen, Germany 110
John Fieberg, Department of Fisheries, Wildlife, and Conservaon Biology, University of Minnesota, 111
USA 112
Elizabeth A. Finch, CABI, UK 113
Evan A. Fiorenza, Department of Ecology and Evoluonary Biology, School of Biological Sciences, 114
University of California, Irvine, USA 115
David N. Fisher, School of Biological Sciences, University of Aberdeen, UK 116
Amélie Fontaine, Department of Natural Resource Sciences, McGill University, Canada 117
Wolfgang Forstmeier, Department of Ornithology, Max Planck Instute for Biological Intelligence, 118
Germany 119
Yoan Fourcade, Instute of Ecology and Environmental Sciences (iEES), Univ. Paris-Est Creteil, France 120
Graham S. Frank, Department of Forest Ecosystems and Society, Oregon State University, USA 121
Cathryn A. Freund, Wake Forest University, USA 122
Eduardo Fuentes-Lillo, Laboratorio de Invasiones Biológicas (LIB), Instuto de Ecología y 123
Biodiversidad, Chile 124
Sara L. Gandy, Instute for Biodiversity, Animal Health and Comparave Medicine, University of 125
Glasgow, UK 126
Dusn G. Gannon, Department of Forest Ecosystems and Society, College of Forestry, Oregon State 127
University, USA 128
Ana I. García-Cervigón, Biodiversity and Conservaon Area, Rey Juan Carlos University, Spain 129
Alexis C. Garretson, Graduate School of Biomedical Sciences, Tus University, USA 130
Xuezhen Ge, Department of Integrave Biology, University of Guelph, Canada 131
William L. Geary, School of Life and Environmental Sciences (Burwood Campus), Deakin University, 132
Australia 133
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
5
Charly Géron, CNRS, University of Rennes, France 134
Marc Gilles, Department of Behavioural Ecology, Bielefeld University, Germany 135
Antje Girndt, Fakultät für Biologie, Arbeitsgruppe Evoluonsbiologie, Universität Bielefeld, Germany 136
Daniel Gliksman, Chair of Meteorology, Instute for Hydrology and Meteorology, Faculty of 137
Environmental Sciences, Technische Universität Dresden, Germany 138
Harrison B. Goldspiel, Department of Wildlife, Fisheries, and Conservaon Biology, University of 139
Maine, USA 140
Dylan G. E. Gomes, Department of Biological Sciences, Boise State University, USA 141
Megan Kate Good, School of Agriculture, Food and Ecosystem Sciences, The University of Melbourne, 142
Australia 143
Sarah C. Goslee, Pastures Systems and Watershed Management Research Unit, USDA Agricultural 144
Research Service, USA 145
J. Stephen Gosnell, Department of Natural Sciences, Baruch College, City University of New York, USA 146
Eliza M. Grames, Department of Biological Sciences, Binghamton University, USA 147
Paolo Graon, Diparmento di Biologia, Università di Roma "Tor Vergata", Italy 148
Nicholas M. Grebe, Department of Anthropology, University of Michigan, USA 149
Skye M. Greenler, College of Forestry, Oregon State University, USA 150
Maaike Grioen, University of Antwerp, Belgium 151
Daniel M. Grith, Earth & Environmental Sciences, Wesleyan University, USA 152
Frances J. Grith, Yale School of Medicine, Department of Psychiatry, Yale University, USA 153
Jake J. Grossman, Biology Department and Environmental Studies Department, St. Olaf College, USA 154
Ali Güncan, Department of Plant Protecon, Faculty of Agriculture, Ordu University, Turkey 155
Stef Haesen, Department of Earth and Environmental Sciences, KU Leuven, Belgium 156
James G. Hagan, Department of Marine Sciences, University of Gothenburg, Sweden 157
Heather A. Hager, Department of Biology, Wilfrid Laurier University, Canada 158
Jonathan Philo Harris, Natural Resource Ecology and Management, Iowa State University, USA 159
Natasha Dean Harrison, School of Biological Sciences, University of Western Australia, Australia 160
Sarah Syedia Hasnain, Department of Biological Sciences, Middle East Technical University, Turkey 161
Jusn Chase Havird, Dept. of Integrave Biology, University of Texas at Ausn, USA 162
Andrew J. Heaton, Grand Bay Naonal Estuarine Research Reserve, USA 163
María Laura Herrera-Chaustre, Universidad de los Andes, Colombia 164
Tan ner J. Howard 165
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
6
Bin-Yan Hsu, Department of Biology, University of Turku, Finland 166
Fabiola Iannarilli, Dept of Fisheries, Wildlife and Conservaon Biology, University of Minnesota, USA 167
Esperanza C. Iranzo, Instuto de Ciencia Animal. Facultad de Ciencias Veterinarias, Universidad 168
Austral de Chile, Chile 169
Erik N. K. Iverson, Department of Integrave Biology, The University of Texas at Ausn, USA 170
Saheed Olaide Jimoh, Department of Botany, University of Wyoming, USA 171
Douglas H. Johnson, Department of Fisheries, Wildlife, and Conservaon Biology, University of 172
Minnesota, USA 173
Marn Johnsson, Department of Animal Breeding and Genecs, Swedish University of Agricultural 174
Sciences, Sweden 175
Jesse Jorna, Department of Biology, Brigham Young University, Brigham Young University, USA 176
Tommaso Jucker, School of Biological Sciences, University of Bristol, UK 177
Marn Jung, Internaonal Instute for Applied Systems Analysis (IIASA), Austria 178
Ineta Kačergytė, Department of Ecology, Swedish University of Agricultural Sciences, Sweden 179
Oliver Kaltz, Université de Montpellier, France 180
Alison Ke, Department of Wildlife, Fish, and Conservaon Biology, University of California, Davis, USA 181
Clint D. Kelly, Département des Sciences biologiques, Université du Québec à Montréal, Canada 182
Katharine Keogan, Instute of Evoluonary Biology, University of Edinburgh, UK 183
Friedrich Wolfgang Keppeler, Center for Limnology, Center for Limnology, University of Wisconsin - 184
Madison, USA 185
Alexander K. Killion, Center for Biodiversity and Global Change, Yale University, USA 186
Dongmin Kim, Department of Ecology, Evoluon, and Behavior, University of Minnesota, St. Paul, USA 187
David P. Kochan, Instute of Environment and Department of Biological Sciences, Florida 188
Internaonal University, USA 189
Peter Korsten, Department of Life Sciences, Aberystwyth University, UK 190
Shan Kothari, Instut de recherche en biologie végétale, Université de Montréal, Canada 191
Jonas Kuppler, Instute of Evoluonary Ecology and Conservaon Genomics, Ulm University, 192
Germany 193
Jillian M. Kusch, Department of Biology, Memorial University of Newfoundland, Canada 194
Malgorzata Lagisz, Evoluon & Ecology Research Centre and School of Biological, Earth & 195
Environmental Sciences, University of New South Wales, Australia 196
Kristen Marianne Lalla, Department of Natural Resource Sciences, McGill University, Canada 197
Daniel J. Larkin, Department of Fisheries, Wildlife and Conservaon Biology, University of Minnesota-198
Twin Cies, USA 199
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
7
Courtney L. Larson, The Nature Conservancy, USA 200
Katherine S. Lauck, Department of Wildlife, Fish, and Conservaon Biology, University of California, 201
Davis, USA 202
M. Elise Lauterbur, Ecology and Evoluonary Biology, University of Arizona, USA 203
Alan Law, Biological and Environmental Sciences, University of Srling, UK 204
Don-Jean Léandri-Breton, Department of Natural Resource Sciences, McGill University, Canada 205
Jonas J. Lembrechts, Department of Biology, University of Antwerp, Belgium 206
Kiara L'Herpiniere, Natural sciences, Macquarie University, Australia 207
Eva J. P. Lievens, Aquac Ecology and Evoluon Group, Limnological Instute, University of Konstanz, 208
Germany 209
Daniela Oliveira de Lima, Campus Cerro Largo, Universidade Federal da Fronteira Sul, Brazil 210
Shane Lindsay, School of Psychology and Social Work, University of Hull, UK 211
Marn Luquet, UMR 1224 ECOBIOP, Université de Pau et des Pays de lʹAdour, France 212
Ross MacLeod, School of Biological & Environmental Sciences, Liverpool John Moores University, UK 213
Kirsty H. Macphie, Instute of Ecology and Evoluon, University of Edinburgh, UK 214
Kit Magellan, Cambodia 215
Magdalena M. Mair, Stascal Ecotoxicology, Bayreuth Center of Ecology and Environmental 216
Research (BayCEER), University of Bayreuth, Germany 217
Lisa E. Malm, Ecology and Environmental Science, Umeå University, Sweden 218
Stefano Mammola, Molecular Ecology Group (MEG), Water Research Instute (IRSA), Naonal 219
Research Council of Italy (CNR), Italy 220
Caitlin P. Mandeville, Department of Natural History, Norwegian University of Science and 221
Technology, Norway 222
Michael Manhart, Center for Advanced Biotechnology and Medicine, Rutgers University Robert 223
Wood Johnson Medical School, USA 224
Laura Milena Manrique-Garzon, Departamento de Ciencias Biológicas, Universidad de los Andes, 225
Colombia 226
Elina Mäntylä, Department of Biology, University of Turku, Finland 227
Philippe Marchand, Instut de recherche sur les forêts, Université du Québec en Abibi-228
misc a m ingu e , Can a da
229
Benjamin Michael Marshall, Biological and Environmental Sciences, University of Srling, UK 230
Charles A. Marn, Université du Québec à Trois-Rivières, Canada 231
Dominic Andreas Marn, Instute of Plant Sciences, University of Bern, Switzerland 232
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
8
Jake Mitchell Marn, Department of Wildlife, Fish, and Environmental Studies, Swedish University of 233
Agricultural Sciences, Sweden 234
April Robin Marnig, School of Biological, Earth and Environmental Sciences, University of New South 235
Wales, Australia 236
Erin S. McCallum, Department of Wildlife, Fish and Environmental Studies, Swedish University of 237
Agricultural Sciences, Sweden 238
Mark McCauley, Whitney Laboratory for Marine Bioscience, University of Florida, USA 239
Sabrina M. McNew, Ecology and Evoluonary Biology, University of Arizona, USA 240
Sco J. Meiners, Biological Sciences, Eastern Illinois University, USA 241
Thomas Merkling, Centre d'Invesgaons Clinique Plurithémaque - Instut Lorrain du Coeur et des 242
Vaisseaux, Université de Lorraine, Inserm1433 CIC-P CHRU de Nancy, France 243
Marcus Michelangeli, Department of Wildlife, Fish and Environmental Studies, Swedish University of 244
Agricultural Sciences, Sweden 245
Maria Moiron, Evoluonary biology department, Bielefeld University, Germany 246
Bruno Moreira, Department of Ecology and global change, Centro de Invesgaciones sobre 247
Desercación, Consejo Superior de Invesgaciones Ciencas (CIDE-CSIC/UV/GV), Spain 248
Jennifer Mortensen, Department of Biological Sciences, University of Arkansas, USA 249
Benjamin Mos, School of the Environment, Faculty of Science, The University of Queensland, 250
Australia 251
Taofeek Olatunbosun Muraina, Department of Animal Health and Producon, Oyo State College of 252
Agriculture and Technology, Nigeria 253
Penelope Wrenn Murphy, Department of Forest & Wildlife Ecology, University of Wisconsin-Madison, 254
USA 255
Luca Nelli, School of Biodiversity, One Health and Veterinary Medicine, University of Glasgow, UK 256
Petri Niemelä, Organismal and Evoluonary Biology Research Programme, Faculty of Biological and 257
Environmental Sciences, University of Helsinki, Finland 258
Josh Nighngale, South Iceland Research Centre, University of Iceland, Iceland 259
Gustav Nilsonne, Department of Clinical Neuroscience, Karolinska Instutet, Sweden 260
Sergio Nolazco, School of Biological Sciences, Monash University, Australia
261
Sabine S. Nooten, Animal Ecology and Tropical Biology, University of Würzburg, Germany 262
Jessie Lanterman Novotny, Biology, Hiram College, USA 263
Agnes Birgia Olin, Department of Aquac Resources, Swedish University of Agricultural Sciences, 264
Sweden 265
Chris L. Organ, Department of Earth Sciences, Montana State University, USA 266
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
9
Kate L. Ostevik, Department of Evoluon, Ecology, and Organismal Biology, University of California, 267
Riverside, USA 268
Facundo Xavier Palacio, Sección Ornitología, Universidad Nacional de La Plata, Argenna 269
Mahieu Paquet, Department of Ecology, Swedish University of Agricultural Sciences, Sweden 270
Darren James Parker, Bangor University, UK 271
David J. Pascall, MRC Biostascs Unit, University of Cambridge, UK 272
Valerie J. Pasquarella, Harvard Forest, Harvard University, USA 273
John Harold Paterson, Biological and Environmental Sciences, University of Srling, Scotland 274
Ana Payo-Payo, Departamento de Biodiversidad, Ecología y Evolución., Universidad Complutense de 275
Madrid, Spain 276
Karen Marie Pedersen, Biology Department, Technische Universität Darmstadt, Germany 277
Grégoire Perez, UMR 1309 ASTRE, CIRAD, France 278
Kayla I. Perry, Department of Entomology, The Ohio State University, USA 279
Patrice Poer, Evoluon & Ecology Research Centre, School of Biological, Earth and Environmental 280
Sciences, The University of New South Wales, Australia 281
Michael J. Proulx, Department of Psychology, University of Bath, UK 282
Raphaël Proulx, Chaire de recherche en intégrité écologique, Université du Québec à Trois-Rivières, 283
Canada 284
Jessica L Prue, Mississippi Based RESTORE Act Center of Excellence, University of Southern 285
Mississippi, USA 286
Veronarindra Ramananjato, Department of Integrave Biology, University of California, Berkeley, USA 287
Finaritra Tolotra Randimbiarison, Menon Zoologie et Biodiversité Animale, Université 288
d'Antananarivo, Madagascar 289
Onja H. Razandratsima, Department of Integrave Biology, University of California, Berkeley, USA 290
Diana J. Rennison, Department of Ecology, Behavior and Evoluon, University of California, San 291
Diego, USA 292
Federico Riva, Instute for Environmental Sciences, VU Amsterdam, The Netherlands 293
Sepand Riyahi, Department of Evoluonary Anthropology, University of Vienna, Austria 294
Michael James Roast, Konrad Lorenz Instute for Ethology, University of Veterinary Medicine, Austria 295
Felipe Pereira Rocha, School of Biological Sciences, The University of Hong Kong, China
296
Dominique G. Roche, Instut de biologie, Université de Neuchâtel, Switzerland 297
Crisan Román-Palacios, School of Informaon, University of Arizona, USA 298
Michael S. Rosenberg, Center for Biological Data Science, Virginia Commonwealth University, USA 299
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
10
Jessica Ross, University of Wisconsin, USA 300
Freya E. Rowland, School of the Environment, Yale University, USA 301
Deusdedith Rugemalila, Instute of the Environment, Florida Internaonal University, USA 302
Avery L. Russell, Department of Biology, Missouri State University, USA 303
Suvi Ruuskanen, Department of Biological and Environmental Science, University of Jyväskylä, 304
Finland 305
Patrick Saccone, Instute for Interdisciplinary Mountain Research, OeAW (Austrian Academy of 306
Sciences), Austria 307
Asaf Sadeh, Department of Natural Resources, Newe Ya'ar Research Center, Agricultural Research 308
Organizaon (Volcani Instute), Israel 309
Stephen M. Salazar, Department of Animal Behaviour, Bielefeld University, Germany 310
Kris Sales, Oce for Naonal Stascs, UK 311
Pablo Salmón, Instute of Avian Research "Vogelwarte Helgoland", Germany 312
Alfredo Sánchez-Tójar, Department of Evoluonary Biology, Bielefeld University, Germany 313
Lecia Pereira Santos, Ecology Department, Universidade Federal de Goiás, Brazil 314
Francesca Santostefano, University of Exeter, University of Exeter, UK 315
Hayden T. Schilling, New South Wales Department of Primary Industries Fisheries, Australia 316
Marcus Schmidt, Research Data Management, Leibniz Centre for Agricultural Landscape Research 317
(ZALF), Germany 318
Tim Schmoll, Evoluonary Biology, Bielefeld University, Germany 319
Adam C. Schneider, Biology Department, University of Wisconsin-La Crosse, USA 320
Allie E. Schrock, Department of Evoluonary Anthropology, Duke University, USA 321
Julia Schroeder, Department of Life Sciences, Imperial College London, UK 322
Nicolas Schckzelle, Earth and Life Instute, Ecology and Biodiversity, UCLouvain, Belgium 323
Nick L. Schultz, Future Regions Research Centre, Federaon University Australia, Australia 324
Drew A. Sco, United States Department of Agriculture- Agricultural Research Service-, USA 325
Michael Peter Scroggie, Arthur Rylah Insitute for Environmental Research, Australia 326
Julie Teresa Shapiro, Epidemiology and Surveillance Support Unit, University of Lyon - French Agency 327
for Food, Environmental and Occupaonal Health and Safety (ANSES), France 328
Nika Sharma, UCLA Anderson Center for Impact, University of California, Los Angeles, USA 329
Caroline L. Shearer, Department of Evoluonary Anthropology, Duke University, USA 330
Diego Simón, Facultad de Ciencias, Universidad de la República, Uruguay 331
Michael I. Sitvarin, Independent researcher, USA 332
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
11
Fabrício Luiz Skupien, Programa de Pós-Graduação em Ecologia, Instuto de Biologia, Centro de 333
Ciências da Saúde, Universidade Federal do Rio de Janeiro, Brazil 334
Heather Lea Slinn, Vive Crop Protecon, Canada 335
Grania Polly Smith, University of Cambridge, UK 336
Jeremy A. Smith, Brish Trust for Ornithology, UK 337
Rahel Sollmann, Department of Wildlife, Fish, and Conservaon Biology, University of California, 338
Davis, USA 339
Kaitlin Stack Whitney, Science, Technology & Society Department, Rochester Instute of Technology, 340
USA 341
Shannon Michael Sll, Nomad Ecology, USA 342
Erica F. Stuber, Wildland Resources Department, Utah State University, USA 343
Guy F. Suon, Center for Biological Control, Department of Zoology and Entomology, Rhodes 344
University, South Africa 345
Ben Swallow, School of Mathemacs and Stascs and Centre for Research in Ecological and 346
Environmental Modelling, University of St Andrews, UK 347
Conor Claverie Ta, Department of Ecology and Evoluonary Biology, Cornell University, USA 348
Elina Takola, Department of Computaonal Landscape Ecology, Helmholtz Centre for Environmental 349
Research – UFZ, Germany 350
Andrew J. Tanentzap, Ecosystems and Global Change Group, School of the Environment, Trent 351
University, Canada 352
Rocío Tarjuelo, Instuto Universitario de Invesgación en Gesón Forestal Sostenible (iuFOR), 353
Universidad de Valladolid, Spain 354
Richard J. Telford, Department of Biological Sciences, University of Bergen, Norway 355
Christopher J. Thawley, Department of Biological Science, University of Rhode Island, USA 356
Hugo Thierry, Department of Geography, McGill University, Canada 357
Jacqueline Thomson, Integrave Biology, University of Guelph, Canada 358
Svenja Tidau, School of Biological and Marine Sciences, University of Plymouth, UK 359
Emily M. Tompkins, Biology Deptartment, Wake Forest University, USA 360
Claire Marie Tortorelli, Plant Sciences, University of California, Davis, USA 361
Andrew Trlica, College of Natural Resources, North Carolina State University, USA 362
Biz R. Turnell, Instute of Zoology, Technische Universität Dresden, Germany 363
Lara Urban, Helmholtz AI, Helmholtz Zentrum Muenchen, Germany 364
Sjn Van de Vondel, Department of Biology, University of Antwerp, Belgium 365
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
12
Jessica Eva Megan van der Wal, FitzPatrick Instute of African Ornithology, University of Cape Town, 366
South Africa 367
Jens Van Eeckhoven, Department of Cell & Developmental Biology, Division of Biosciences, University 368
College London, UK 369
Francis van Oordt, Natural Resource Sciences, McGill University, Canada 370
K. Michelle Vanderwel, Biology, University of Saskatchewan, Canada 371
Mark C. Vanderwel, Department of Biology, University of Regina, Canada 372
Karen J. Vanderwolf, Biology, University of Waterloo, Canada 373
Juliana Vélez, Department of Fisheries, Wildlife and Conservaon Biology, University of Minnesota, 374
USA 375
Diana Carolina Vergara-Florez, Department of Ecology & Evoluonary Biology, University of Michigan, 376
USA 377
Brian C. Verrelli, Center for Biological Data Science, Virginia Commonwealth University, USA 378
Marcus Vinícius Vieira, Dept. Ecologia, Instuto de Biologia, Universidade Federal do Rio de Janeiro, 379
Brazil 380
Nora Villamil, Lothian Analycal Services, Public Health Scotland, UK 381
Valerio Vitali, Instute for Evoluon and Biodiversity, University of Muenster, Germany 382
Julien Vollering, Department of Environmental Sciences, Western Norway University of Applied 383
Sciences, Norway 384
Jerey Walker, Department of Biological Sciences, University of Southern Maine, USA 385
Xanthe J. Walker, Center for Ecosystem Science and Society, Northern Arizona University, USA 386
Jonathan A. Walter, Center for Watershed Sciences, University of California, Davis, USA 387
Pawel Waryszak, School of Agriculture and Environmental Science, University of Southern 388
Queensland, Australia 389
Ryan J. Weaver, Department of Ecology, Evoluon, and Organismal Biology, Iowa State University, 390
USA 391
Ronja E. M. Wedegärtner, Fram Project AS, Norway 392
Daniel L. Weller, Department of Food Science & Technology, Virginia Polytechnic Instute and State 393
University, USA 394
Shannon Whelan, Department of Natural Resource Sciences, McGill University, Canada 395
Rachel Louise White, School of Applied Sciences, University of Brighton, UK 396
David William Wolfson, Department of Fisheries, Wildlife and Conservaon Biology, University of 397
Minnesota, USA 398
Andrew Wood, Department of Biology, University of Oxford, UK 399
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
13
Sco W. Yanco, Department of Integrave Biology, University of Colorado, Denver, USA 400
Jian D. L. Yen, Arthur Rylah Instute for Environmental Research, Australia 401
Casey Youngesh, Ecology, Evoluon, and Behavior Program, Michigan State University, USA 402
Giacomo Zilio, ISEM, University of Montpellier, CNRS, France 403
Cédric Zimmer, Laboratoire d’Ethologie Expérimentale et Comparée, LEEC, UR4443, Université 404
Sorbonne Paris Nord, USA 405
Gregory Mark Zimmerman, Department of Science and Environment, Lake Superior State University, 406
USA 407
Rachel A. Zitomer, Department of Forest Ecosystems and Society, Oregon State University, USA 408
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
14
Abstract 409
Although variaon in eect sizes and predicted values among studies of similar phenomena is 410
inevitable, such variaon far exceeds what might be produced by sampling error alone. One possible 411
explanaon for variaon among results is dierences among researchers in the decisions they make 412
regarding stascal analyses. A growing array of studies has explored this analycal variability in 413
dierent (mostly social science) elds, and has found substanal variability among results, despite 414
analysts having the same data and research queson. We implemented an analogous study in 415
ecology and evoluonary biology, elds in which there have been no empirical exploraon of the 416
variaon in eect sizes or model predicons generated by the analycal decisions of dierent 417
researchers. We used two unpublished datasets, one from evoluonary ecology (blue t, Cyanistes 418
caeruleus, to compare sibling number and nestling growth) and one from conservaon ecology 419
(Eucalyptus, to compare grass cover and tree seedling recruitment), and the project leaders recruited 420
174 analyst teams, comprising 246 analysts, to invesgate the answers to prespecied research 421
quesons. Analyses conducted by these teams yielded 141 usable eects for the blue t dataset, and 422
85 usable eects for the Eucalyptus dataset. We found substanal heterogeneity among results for 423
both datasets, although the paerns of variaon diered between them. For the blue t analyses, 424
the average eect was convincingly negave, with less growth for nestlings living with more siblings, 425
but there was near connuous variaon in eect size from large negave eects to eects near zero, 426
and even eects crossing the tradional threshold of stascal signicance in the opposite direcon. 427
In contrast, the average relaonship between grass cover and Eucalyptus seedling number was only 428
slightly negave and not convincingly dierent from zero, and most eects ranged from weakly 429
negave to weakly posive, with about a third of eects crossing the tradional threshold of 430
signicance in one direcon or the other. However, there were also several striking outliers in 431
the Eucalyptus dataset, with eects far from zero. For both datasets, we found substanal variaon 432
in the variable selecon and random eects structures among analyses, as well as in the rangs of 433
the analycal methods by peer reviewers, but we found no strong relaonship between any of these 434
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
15
and deviaon from the meta-analyc mean. In other words, analyses with results that were far from 435
the mean were no more or less likely to have dissimilar variable sets, use random eects in their 436
models, or receive poor peer reviews than those analyses that found results that were close to the 437
mean. The existence of substanal variability among analysis outcomes raises important quesons 438
about how ecologists and evoluonary biologists should interpret published results, and how they 439
should conduct analyses in the future. 440
Key Words 441
credibility revoluon, heterogeneity, meta-analysis, metascience, Replicability, reproducibility 442
Introducon 443
One value of science derives from its production of replicable, and thus reliable, results. When we 444
repeat a study using the original methods we should be able to expect a similar result. However, 445
perfect replicability is not a reasonable goal. Effect sizes will vary, and even reverse in sign, by 446
chance alone [1]. Observed patterns can differ for other reasons as well. It could be that we do not 447
sufficiently understand the conditions that led to the original result so when we seek to replicate it, 448
the conditions differ due to some ‘hidden moderator’. This hidden moderator hypothesis is 449
described by meta-analysts in ecology and evolutionary biology as ‘true biological heterogeneity’ [2]. 450
This idea of true heterogeneity is popular in ecology and evolutionary biology, and there are good 451
reasons to expect it in the complex systems in which we work [3]. However, despite similar 452
expectations in psychology, recent evidence in that discipline contradicts the hypothesis that 453
moderators are common obstacles to replicability, as variability in results in a large ‘many labs’ 454
collaboration was mostly unrelated to commonly hypothesized moderators such as the conditions 455
under which the studies were administered [4]. Another possible explanation for variation in effect 456
sizes is that researchers often present biased samples of results, thus reducing the likelihood that 457
later studies will produce similar effect sizes [59]. It also may be that although researchers did 458
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
16
successfully replicate the conditions, the experiment, and measured variables, analytical decisions 459
differed sufficiently among studies to create divergent results [10, 11]. 460
Analytical decisions vary among studies because researchers have many options. Researchers need 461
to decide how to exclude possibly anomalous or unreliable data, how to construct variables, which 462
variables to include in their models, and which statistical methods to use. Depending on the dataset, 463
this short list of choices could encompass thousands or millions of possible alternative 464
specifications [10]. However, researchers making these decisions presumably do so with the goal of 465
doing the best possible analysis, or at least the best analysis within their current skill set. Thus it 466
seems likely that some specification options are more probable than others, possibly because they 467
have previously been shown (or claimed) to be better, or because they are more well known. Of 468
course, some of these different analyses (maybe many of them) may be equally valid alternatives. 469
Regardless, on probably any topic in ecology and evolutionary biology, we can encounter differences 470
in choices of data analysis. The extent of these differences in analyses and the degree to which these 471
differences influence the outcomes of analyses and therefore studies’ conclusions are important 472
empirical questions. These questions are especially important given that many papers draw 473
conclusions after applying a single method, or even a single statistical model, to analyze a dataset. 474
The possibility that different analytical choices could lead to different outcomes has long been 475
recognized [12], and various efforts to address this possibility have been pursued in the literature. 476
For instance, one common method in ecology and evolutionary biology involves creating a set of 477
candidate models, each consisting of a different (though often similar) set of predictor variables, and 478
then, for the predictor variable of interest, averaging the slope across all models (i.e. model 479
averaging) [13, 14]. This method reduces the chance that a conclusion is contingent upon a single 480
model specification, though use and interpretation of this method is not without challenges [14]. 481
Further, the models compared to each other typically differ only in the inclusion or exclusion of 482
certain predictor variables and not in other important ways, such as methods of parameter 483
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
17
estimation. More explicit examination of outcomes of differences in model structure, model type, 484
data exclusion, or other analytical choices can be implemented through sensitivity 485
analyses [e.g., 15]. Sensitivity analyses, however, are typically rather narrow in scope, and are 486
designed to assess the sensitivity of analytical outcomes to a particular analytical choice rather than 487
to a large universe of choices. Recently, however, analysts in the social sciences have proposed 488
extremely thorough sensitivity analysis, including ‘multiverse analysis’ [16] and the ‘specification 489
curve’ [10], as a means of increasing the reliability of results. With these methods, researchers 490
identify relevant decision points encountered during analysis and conduct the analysis many times 491
to incorporate many plausible decisions made at each of these points. The study’s conclusions are 492
then based on a broad set of the possible analyses and so allow the analyst to distinguish between 493
robust conclusions and those that are highly contingent on particular model specifications. These are 494
useful outcomes, but specifying a universe of possible modelling decisions is not a trivial 495
undertaking. Further, the analyst’s knowledge and biases will influence decisions about the 496
boundaries of that universe, and so there will always be room for disagreement among analysts 497
about what to include. Including more specifications is not necessarily better. Some analytical 498
decisions are better justified than others, and including biologically implausible specifications may 499
undermine this process. Regardless, these powerful methods have yet to be adopted, and even 500
more limited forms of sensitivity analyses are not particularly widespread. Most studies publish a 501
small set of analyses and so the existing literature does not provide much insight into the degree to 502
which published results are contingent on analytical decisions. 503
Despite the potential major impacts of analytical decisions on variance in results, the outcomes of 504
different individuals’ data analysis choices have received limited empirical attention. The only formal 505
exploration of this that we were aware of when we submitted our Stage 1 manuscript were (1) an 506
analysis in social science that asked whether male professional football (soccer) players with darker 507
skin tone were more likely to be issued red cards (ejection from the game for rule violation) than 508
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
18
players with lighter skin tone [11] and (2) an analysis in neuroimaging which evaluated nine separate 509
hypotheses involving the neurological responses detected with fMRI in 108 participants divided 510
between two treatments in a decision making task [17]. Several others have been published 511
since [e.g., 18, 1921]. In the red card study, twenty-nine teams designed and implemented analyses 512
of a dataset provided by the study coordinators [11]. Analyses were peer reviewed (results blind) by 513
at least two other participating analysts; a level of scrutiny consistent with standard pre-publication 514
peer review. Among the final 29 analyses, odds-ratios varied from 0.89 to 2.93, meaning point 515
estimates varied from having players with lighter skin tones receive more red cards (odds ratio < 1) 516
to a strong effect of players with darker skin tones receiving more red cards (odds ratio > 1). Twenty 517
of the 29 teams found a statistically-significant effect in the predicted direction of players with 518
darker skin tones being issued more red cards. This degree of variation in peer-reviewed analyses 519
from identical data is striking, but the generality of this finding has only just begun to be formally 520
investigated. 521
In the neuroimaging study, 70 teams evaluated each of the nine different hypotheses with the 522
available fMRI data [17]. These 70 teams followed a divergent set of workflows that produced a wide 523
range of results. The rate of reporting of statistically significant support for the nine hypotheses 524
ranged from 21% to 84%, and for each hypothesis on average, 20% of research teams observed 525
effects that differed substantially from the majority of other teams. Some of the variability in results 526
among studies could be explained by analytical decisions such as choice of software package, 527
smoothing function, and parametric versus non-parametric corrections for multiple comparisons. 528
However, substantial variability among analyses remained unexplained, and presumably emerged 529
from the many different decisions each analyst made in their long workflows. Such variability in 530
results among analyses from this dataset and from the very different red-card dataset suggests that 531
sensitivity of analytical outcome to analytical choices may characterize many distinct fields, as 532
several more recent many-analyst studies also suggest [1820]. 533
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
19
To further develop the empirical understanding of the effects of analytical decisions on study 534
outcomes, we chose to estimate the extent to which researchers’ data analysis choices drive 535
differences in effect sizes, model predictions, and qualitative conclusions in ecology and evolutionary 536
biology. This is an important extension of the meta-research agenda of evaluating factors influencing 537
replicability in ecology, evolutionary biology, and beyond [22]. To examine the effects of analytical 538
decisions, we used two different datasets and recruited researchers to analyze one or the other of 539
these datasets to answer a question we defined. The first question was “To what extent is the 540
growth of nestling blue tits (Cyanistes caeruleus) influenced by competition with siblings?” To 541
answer this question, we provided a dataset that includes brood size manipulations from 332 broods 542
conducted over three years at Wytham Wood, UK. The second question was “How does grass cover 543
influence Eucalyptus spp. seedling recruitment?” For this question, analysts used a dataset that 544
includes, among other variables, number of seedlings in different size classes, percentage cover of 545
different life forms, tree canopy cover, and distance from canopy edge from 351 quadrats spread 546
among 18 sites in Victoria, Australia. 547
We explored the impacts of data analysts’ choices with descripve stascs and with a series of tests 548
to aempt to explain the variaon among eect sizes and predicted values of the dependent variable 549
produced by the dierent analysis teams for both datasets separately. To describe the variability, we 550
present forest plots of the standardized eect sizes and predicted values produced by each of the 551
analysis teams, esmate heterogeneity (both absolute, τ2, and proporonal, I2) in eect size and 552
predicted values among the results produced by these dierent teams, and calculate a similarity 553
index that quanes variability among the predictor variables selected for the dierent stascal 554
models constructed by the dierent analysis teams. These descripve stascs provide the rst 555
esmates of the extent to which explanatory stascal models and their outcomes in ecology and 556
evoluonary biology vary based on the decisions of dierent data analysts. We then quaned the 557
degree to which the variability in eect size and predicted values could be explained by (1) variaon 558
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
20
in the quality of analyses as rated by peer reviewers and (2) the similarity of the choices of predictor 559
variables between individual analyses. 560
Methods 561
This project involved a series of steps (1-6) that began with idenfying datasets for analyses and 562
connued through recruing independent groups of sciensts to analyze the data, allowing the 563
sciensts to analyze the data as they saw t, generang peer review rangs of the analyses (based 564
on methods, not results), evaluang the variaon in eects among the dierent analyses, and 565
producing the nal manuscript. 566
Step 1: Select Datasets 567
We used two previously unpublished datasets, one from evoluonary ecology and the other from 568
ecology and conservaon. 569
Evoluonary Ecology 570
Our evoluonary ecology dataset is relevant to a sub-discipline of life-history research which focuses 571
on idenfying costs and trade-os associated with dierent phenotypic condions. 572
These data were derived from a brood-size manipulaon experiment imposed on wild birds nesng 573
in boxes provided by researchers in an intensively studied populaon. 574
Understanding how the growth of nestlings is inuenced by the numbers of siblings in the nest can 575
give researchers insights into factors such as the evoluon of clutch size, determinaon of 576
provisioning rates by parents, and opmal levels of sibling compeon (Vander Werf 1992; DeKogel 577
1997; Royle et al. 1999; Verhulst, Holveck, and Riebel 2006; Nicolaus et al. 2009). Data analysts were 578
provided this dataset and instructed to answer the following queson: “To what extent is the growth 579
of nestling blue ts (Cyanistes caeruleus) inuenced by compeon with siblings?” 580
581
Researchers conducted brood size manipulaons and populaon monitoring of blue ts at Wytham 582
Wood, a 380ha woodland in Oxfordshire, U.K (1º 20’W, 51º 47’N). Researchers regularly checked 583
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
21
approximately 1100 arcial nest boxes at the site and monitored the 330 to 450 blue t pairs 584
occupying those boxes in 2001-2003 during the experiment. Nearly all birds made only one breeding 585
aempt during the April to June study period in a given year. At each blue t nest, researchers 586
recorded the date the rst egg appeared, clutch size, and hatching date. For all chicks alive at age 14 587
days, researchers measured mass and tarsus length and ed a uniquely numbered, Brish Trust for 588
Ornithology (BTO) aluminium leg ring. Researchers aempted to capture all adults at their nests 589
between day 6 and day 14 of the chick-rearing period. For these captured adults, researchers 590
measured mass, tarsus length, and wing length and ed a uniquely numbered BTO leg ring. During 591
the 2001-2003 breeding seasons, researchers manipulated brood sizes using cross fostering. They 592
matched broods for hatching date and brood size and moved chicks between these paired nests one 593
or two days aer hatching. They sought to either enlarge or reduce all manipulated broods by 594
approximately one fourth. To control for eects of being moved, each reduced brood had a poron 595
of its brood replaced by chicks from the paired increased brood, and vice versa. Net manipulaons 596
varied from plus or minus four chicks in broods of 12 to 16 to plus or minus one chick in broods of 4 597
or 5. Researchers le approximately one third of all broods unmanipulated. These unmanipulated 598
broods were not selected systemacally to match manipulated broods in clutch size or laying date. 599
We have mass and tarsus length data from 3720 individual chicks divided among 167 experimentally 600
enlarged broods, 165 experimentally reduced broods, and 120 unmanipulated broods. The full list of 601
variables included in the dataset is publicly available (hps://osf.io/hdv8m), along with the data 602
(hps://osf.io/qjzby).
603
604
Addional explanaon:
Shortly aer beginning to recruit analysts, several analysts noted a small set of related errors in
the blue t dataset. We corrected the errors, replaced the dataset on our OSF site, and emailed
the analysts on 19 April 2020 to instruct them to use the revised data. The email to analysts is
available here (hps://osf.io/4h53z). The errors are explained in that email.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
22
Ecology and Conservaon 605
Our ecology and conservaon dataset is relevant to a sub-discipline of conservaon research which 606
focuses on invesgang how best to revegetate private land in agricultural landscapes. These data 607
were collected on private land under the Bush Returns program, an incenve system where 608
parcipants entered into a contract with the Goulburn Broken Catchment Management Authority 609
and received annual payments if they executed predetermined restoraon acvies. This parcular 610
dataset is based on a passive regeneraon iniave, where livestock grazing was removed from the 611
property in the hopes that the Eucalyptus spp. overstorey would regenerate without acve (and 612
expensive) planng. Analyses of some related data have been published (Miles 2008; Vesk et al. 613
2016) but those analyses do not address the queson analysts answered in our study. Data analysts 614
were provided this dataset and instructed to answer the following queson: “How does grass cover 615
inuence Eucalyptus spp. seedling recruitment?”. 616
Researchers conducted three rounds of surveys at 18 sites across the Goulburn Broken catchment in 617
northern Victoria, Australia in winter and spring 2006 and autumn 2007. In each survey period, a 618
dierent set of 15 x 15 m quadrats were randomly allocated across each site within 60 m of exisng 619
tree canopies. The number of quadrats at each site depended on the size of the site, ranging from 620
four at smaller sites to 11 at larger sites. The total number of quadrats surveyed across all sites and 621
seasons was 351. The number of Eucalyptus spp. seedlings was recorded in each quadrat along with 622
informaon on the GPS locaon, aspect, tree canopy cover, distance to tree canopy, and posion in 623
the landscape. Ground layer plant species composion was recorded in three 0.5 x 0.5 m sub-624
quadrats within each quadrat. Subjecve cover esmates of each species as well as bare ground, 625
lier, rock and moss/lichen/soil crusts were recorded. Subsequently, this was augmented with 626
informaon about the precipitaon and solar radiaon at each GPS locaon. The full list of variables 627
included in the dataset is publicly available (hps://osf.io/r5gbn), along with the data 628
(hps://osf.io/qz5cu).
629
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
23
Step 2: Recruitment and inial survey of analysts 630
The lead team (TP, HF, SN, EG, SG, PV, DH, FF) created a publicly available document providing a 631
general descripon of the project (hps://osf.io/mn5aj/). The project was adversed at conferences, 632
via Twier, using mailing lists for ecological sociees (including Ecolog, Evoldir, and lists for the 633
Environmental Decisions Group, and Transparency in Ecology and Evoluon), and via word of mouth. 634
The target populaon was acve ecology, conservaon, or evoluonary biology researchers with a 635
graduate degree (or currently studying for a graduate degree) in a relevant discipline. Researchers 636
could choose to work independently or in a small team. For the sake of simplicity, we refer to these 637
as ‘analysis teams’ though some comprised one individual. We aimed for a minimum of 12 analysis 638
teams independently evaluang each dataset (see sample size juscaon below). We 639
simultaneously recruited volunteers to peer review the analyses conducted by the other volunteers 640
through the same channels. Our goal was to recruit a similar number of peer reviewers and analysts, 641
and to ask each peer reviewer to review a minimum of four analyses. If we were unable to recruit at 642
least half the number of reviewers as analysis teams, we planned to ask analysts to serve also as 643
reviewers (aer they had completed their analyses), but this was unnecessary. All analysts and 644
reviewers were oered the opportunity to share co-authorship on this manuscript and we planned to 645
invite them to parcipate in the collaborave process of producing the nal manuscript. All analysts 646
signed [digitally] a consent (ethics) document (hps://osf.io/xyp68/) approved by the Whitman 647
College Instuonal Review Board prior to being allowed to parcipate. 648
649
Preregistraon Deviaon:
Due to the large number of recruited analysts and reviewers and the ancipated challenges of
receiving and integrang feedback from so many authors, we limited analyst and reviewer
parcipaon in the producon of the nal manuscript to an invitaon to call aenon to serious
problems with the manuscript dra.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
24
We idened our minimum number of analysts per dataset by considering the number of eects 650
needed in a meta-analysis to generate an esmate of heterogeneity (τ2) with a 95% condence 651
interval that does not encompass zero. This minimum sample size is invariant regardless of τ2. This is 652
because the same t-stasc value will be obtained by the same sample size regardless of variance 653
(τ2). We see this by rst examining the formula for the standard error, SE for variance, (τ2) or SE(τ2) 654
assuming normality in an underlying distribuon of eect sizes [30]: 655
𝑆𝐸󰇛𝜏2󰇜𝑡
𝑛1 656
and then rearranging the above formula to show how the t-stasc is independent of τ2, as seen 657
below. 658
𝑡 𝜏
𝑆𝐸 𝜏 𝑛1
2 659
We then nd a minimum n = 12 according to this formula. 660
Step 3: Primary Data Analysis 661
Analysis teams registered and answered a demographic and experse survey (hps://osf.io/seqzy/). 662
We then provided them with the dataset of their choice and requested that they answer a specic 663
research queson. For the evoluonary ecology dataset that queson was “To what extent is the 664
growth of nestling blue ts (Cyanistes caeruleus) inuenced by compeon with siblings?” and for 665
the conservaon ecology dataset it was “How does grass cover inuence Eucalyptus spp. seedling 666
recruitment?” Once their analysis was complete, they answered a structured survey 667
(hps://osf.io/neyc7/), providing analysis technique, explanaons of their analycal choices, 668
quantave results, and a statement describing their conclusions. They also were asked to upload 669
their analysis les (including the dataset as they formaed it for analysis and their analysis code [if 670
applicable]) and a detailed journal-ready stascal methods secon. 671
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
25
672
Step 4: Peer Review of Analysis 673
At minimum, each analysis was evaluated by four dierent reviewers, and each volunteer peer 674
reviewer was randomly assigned methods secons from at least four analyst teams (the exact 675
number varied). Each peer reviewer registered and answered a demographic and experse survey 676
idencal to that asked of the analysts, except we did not ask about ‘team name’ since reviewers did 677
not work in teams. Reviewers evaluated the methods of each of their assigned analyses one at a me 678
in a sequence determined by the project leaders. We systemacally assigned the sequence so that, if 679
possible, each analysis was allocated to each posion in the sequence for at least one reviewer. For 680
instance, if each reviewer were assigned four analyses to review, then each analysis would be the 681
rst analysis assigned to at least one reviewer, the second analysis assigned to another reviewer, the 682
third analysis assigned to yet another reviewer, and the fourth analysis assigned to a fourth reviewer. 683
Balancing the order in which reviewers saw the analyses controls for order eects, e.g. a reviewer 684
might be less crical of the rst methods secon they read than the last. 685
The process for a single reviewer was as follows. First, the reviewer received a descripon of the 686
methods of a single analysis. This included the narrave methods secon, the analysis team’s 687
answers to our survey quesons regarding their methods, including analysis code, and the dataset. 688
The reviewer was then asked, in an online survey (hps://osf.io/4t36u/), to rate that analysis on a 689
Preregistraon Deviaon:
We originally planned to have analysts complete a single survey (hps://osf.io/neyc7/), but aer
we evaluated the results of that survey, we realized we would need a second survey
(hps://osf.io/8w3v5/) to adequately collect the informaon we needed to evaluate
heterogeneity of results (step 5). We provided a set of detailed instrucons with the follow-up
survey, and these instrucons are publicly available and can be found within the following les
(blue t: hps://osf.io/kr2g9, Eucalyptus: hps://osf.io/dfvym).
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
26
scale of 0-100 based on this prompt: “Rate the overall appropriateness of this analysis to answer the 690
research queson (one of the two research quesons inserted here) with the available data. To help 691
you calibrate your rang, please consider the following guidelines: 692
693
100: A perfect analysis with no conceivable improvements from the reviewer 694
75: An imperfect analysis but the needed changes are unlikely to dramacally alter outcomes 695
50: A awed analysis likely to produce either an unreliable esmate of the relaonship or an over-696
precise esmate of uncertainty 697
25: A awed analysis likely to produce an unreliable esmate of the relaonship and an over-precise 698
esmate of uncertainty 699
0: A dangerously misleading analysis, certain to produce both an esmate that is wrong and a 700
substanally over-precise esmate of uncertainty that places undue condence in the incorrect 701
esmate. 702
*Please note that these values are meant to calibrate your rangs. We welcome rangs of any 703
number between 0 and 100.704
705
Aer providing this rang, the reviewer was presented with this prompt, in mulple-choice format: 706
“Would the analycal methods presented produce an analysis that is (a) publishable as is, (b) 707
publishable with minor revision, (c) publishable with major revision, (d) deeply awed and 708
unpublishable?” The reviewer was then provided with a series of text boxes and the following 709
prompts: “Please explain your rangs of this analysis. Please evaluate the choice of stascal analysis 710
type. Please evaluate the process of choosing variables for and structuring the stascal model. 711
Please evaluate the suitability of the variables included in (or excluded from) the stascal model. 712
Please evaluate the suitability of the structure of the stascal model. Please evaluate choices to 713
exclude or not exclude subsets of the data. Please evaluate any choices to transform data (or, if there 714
were no transformaons, but you think there should have been, please discuss that choice).” Aer 715
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
27
subming this review, a methods secon from a second analysis was then made available to the 716
reviewer. This same sequence was followed unl all analyses allocated to a given reviewer were 717
provided and reviewed. Aer providing the nal review, the reviewer was simultaneously provided 718
with all four (or more) methods secons the reviewer had just completed reviewing, the opon to 719
revise their original rangs, and a text box to provide an explanaon. The invitaon to revise the 720
original rangs was as follows: “If, now that you have seen all the analyses you are reviewing, you 721
wish to revise your rangs of any of these analyses, you may do so now.” The text box was prefaced 722
with this prompt: “Please explain your choice to revise (or not to revise) your rangs.” 723
724
Step 5: Evaluate Variaon 725
The lead team conducted the analyses outlined in this secon. We described the variaon in model 726
specicaon in several ways. We calculated summary stascs describing variaon among analyses, 727
including mean, SD, and range of number of variables per model included as xed eects, the 728
number of interacon terms, the number of random eects, and the mean, SD, and range of sample 729
sizes. We also present the number of analyses in which each variable was included. We summarized 730
the variability in standardized eect sizes and predicted values of dependent variables among the 731
individual analyses using standard random eects meta-analyc techniques. First, we derived 732
standardized eect sizes from each individual analysis. We did this for all linear models or 733
generalized linear models by converng the t value and the degree of freedom (df) associated with 734
Addional Explanaon:
To determine how consistent peer reviewers were in their rangs, we assessed inter-rater
reliability among reviewers for both the categorical and quantave rangs combining blue t
and Eucalyptus data using Krippendor’s alpha for ordinal and connuous data respecvely. This
provides a value that is between -1 (total disagreement between reviewers) and 1 (total
agreement between reviewers).
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
28
regression coecients (e.g. the eect of the number of siblings [predictor] on growth [response] or 735
the eect of grass cover [predictor] on seedling recruitment [response]) to the correlaon coecient 736
(r), using the following: 737
𝑟𝑡
𝑡𝑑𝑓 738
This formula can only be applied if t and df values originate from linear or generalized linear models 739
[GLMs; [31]]. If, instead, linear mixed-eects models (LMMs) or generalized linear mixed-eects 740
models (GLMMs) were used by a given analysis, the exact df cannot be esmated. However, adjusted 741
df can be esmated, for example, using the Saerthwaite approximaon of df, dfs, [note that SAS 742
uses this approximaon to obtain df for LMMs and GLMMs; [32]]. For analyses using either LMMs or 743
GLMMs that do not produce dfs we planned to obtain dfs by rerunning the same (G)LMMs using the 744
lmer() or glmer() funcon in the lmerTest package in R [33, 34]. 745
746
We then used the t values and dfs from the models to obtain r as per the formula above. All r and 747
accompanying df (or dfs) were converted to Zr and it’s sampling variance 1/(n-3) where n=df+1. Any 748
analyses from which we could not derive a signed Zr, for instance one with a quadrac funcon in 749
which the slope changed sign, were excluded from the analyses of Fisher’s Zr. We expected such 750
analyses would be rare. In fact, most submied analyses excluded from our meta-analysis of Zr were 751
excluded because of a lack of sucient informaon provided by the analyst team rather than due to 752
the use of eects that could not be converted to Zr. Regardless, as we describe below, we generated 753
Preregistraon Deviaon
Rather than re-run these analyses ourselves, we sent a follow-up survey (referenced above under
“Primary data analyses”) to analysts and asked them to follow our instrucons for producing this
informaon. The instrucons are publicly available and can be found within the following les
(blue t: hps://osf.io/kr2g9, Eucalyptus: hps://osf.io/dfvym).
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
29
a second set of standardized eects (predicted values) that could (in principle) be derived from any 754
explanatory model produced by these data. 755
Besides Zr, which describes the strength of a relaonship based on the amount of variaon in a 756
dependent variable explained by variaon in an independent variable, we also examined dierences 757
in the shape of the relaonship between the independent and dependent variables. To accomplish 758
this, we derived a point esmate (out-of-sample predicted value) for the dependent variable of 759
interest for each of three values of our primary independent variable. We originally described these 760
three values as associated with the 25th percenle, median, and 75th percenle of the independent 761
variable and any covariates. 762
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
30
763
Preregistraon Deviaon
The original descripon of the out-of-sample specicaons did not account for the facts that (a)
some variables are not distributed in a way that allowed division in percenles and that (b)
variables could be either posively or negavely correlated with the dependent variable. We
provide a more thorough descripon here: We derived three point-esmates (out-of-sample
predicted values) for the dependent variable of interest; one for each of three values of our
primary independent variable that we specied. We also specied values for all other variables
that could have been included as independent variables in analysts’ models so that we could
derive the predicted values from a fully specied version of any model produced by analysts. For
all potenal independent variables, we selected three values or categories. Of the three we
selected, one was associated with small, one with intermediate, and one with large values of one
typical dependent variable (day 14 chick weight for the blue t data and total number of
seedlings for the Eucalyptus data; analysts could select other variables as their dependent
variable, but the others typically correlated with the two idened here). For connuous
variables, this means we idened the 25th percenle, median, and 75th percenle and, if the
slope of the linear relaonship between this variable and the typical dependent variable was
posive, we le the quarles ordered as is. If, instead, the slope was negave, we reversed the
order of the independent variable quarles so that the ‘lower’ quarle value was the one
associated with the lower value for the dependent variable. In the case of categorical variables,
we idened categories associated with the 25th percenle, median, and 75th percenle values
of the typical dependent variable aer averaging the values for each category. However, for some
connuous and categorical predictors, we also made selecons based on the principle of internal
consistency between certain related variables, and we xed a few categorical variables as
idencal across all three levels where doing so would simplify the modelling process
(specicaon tables available: blue t: hps://osf.io/86akx; Eucalyptus: hps://osf.io/jh7g5).
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
31
We used the 25th and 75th percenles rather than minimum and maximum values to reduce the 764
chance of occupying unrealisc parameter space. We planned to derive these predicted values from 765
the model informaon provided by the individual analysts. All values (predicons) were rst 766
transformed to the original scale along with their standard errors (SE); we used the delta method 767
(Ver Hoef 2012) for the transformaon of SE. We used the square of the SE associated with predicted 768
values as the sampling variance in the meta-analyses described below, and we planned to analyze 769
these predicted values in exactly the same ways as we analyzed Zr in the following analyses. 770
771
We ploed individual eect size esmates (Zr) and predicted values of the dependent variable (yi) 772
and their corresponding 95% condence / credible intervals in forest plots to allow visualizaon of 773
the range and precision of eect size and predicted values. Further, we included these esmates in 774
random eects meta-analyses [36, 37] using the metafor package in R [34, 38]: 775
Zr ~ 1 + 1|analysisId 776
yi ~ 1 + 1|analysisId 777
Preregistraon Deviaon
Because analysts of blue t data chose dierent dependent variables on dierent scales, aer
transforming out-of-sample values to the original scales, we standardized all values as z scores
(‘standard scores’) to put all dependent variables on the same scale and make them comparable.
This involved taking each relevant value on the original scale (whether a predicted point esmate
or a SE associated with that esmate) and subtracng the value in queson from the mean value
of that dependent variable derived from the full dataset and then dividing this dierence by the
standard deviaon, SD, corresponding to the mean from the full dataset. Thus, all our out-of-
sample predicon values from the blue t data are from a distribuon with the mean of 0 and SD
of 1. We did not add this step for the Eucalyptus data because (a) all responses were on the same
scale (counts of Eucalyptus stems) and were thus comparable and (b) these data, with many zeros
and high skew, are poorly suited for z scores.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
32
where yi is the predicted value for the dependent variable at the 25th percenle, median, or 75th 778
percenle of the independent variables. The individual Zr eect sizes were weighted with the inverse 779
of sampling variance for Zr. The individual predicted values for dependent variable (yi) were weighted 780
by the inverse of the associated SE2 original registraon omied “inverse of the” in error). These 781
analyses provided an average Zr score or an average yi with corresponding 95% condence interval 782
and allowed us to esmate two heterogeneity indices, τ2 and I2. The former, τ2, is he absolute 783
measure of heterogeneity or the between-study variance (in our case, between-eect variance) 784
whereas I2 is a relave measure of heterogeneity. We obtained the esmate of relave heterogeneity 785
(I2) by dividing the between-eect variance by the sum of between-eect and within-eect variance 786
(sampling error variance). I2 is thus, in a standard meta-analysis, the proporon of variance that is 787
due to heterogeneity as opposed to sampling error. When calculang I2, within-study variance is 788
amalgamated across studies to create a “typical” within-study variance which serves as the sampling 789
error variance [36, 37]. Our goal here was to visualize and quanfy the degree of variaon among 790
analyses in eect size esmates [31]. We did not test for stascal signicance. 791
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
33
792
Addional Explanaon
Our use of I2 to quanfy heterogeneity violates an important assumpon, but this violaon does
not invalidate our use of I2 as a metric of how much heterogeneity can derive from analycal
decisions. In standard meta-analysis, the stasc I2 quanes the proporon of variance that is
greater than we would expect if dierences among esmates were due to sampling error alone
[39]. However, it is clear that this interpretaon does not apply to our value of I2 because I2
assumes that each esmate is based on an independent sample (although these analyses can
account for non-independence via hierarchical modelling), whereas all our eects were derived
from largely or enrely overlapping subsets of the same dataset. Despite this, we believe that I2
remains a useful stasc for our purposes. This is because, in calculang I2, we are sll seng a
benchmark of expected variaon due to sampling error based on the variance associated with
each separate eect size esmate, and we are assessing how much (if it all) the variability among
our eect sizes exceeds what would be expected had our eect sizes been based on independent
data. In other words, our esmates can tell us how much proporonal heterogeneity is possible
from analycal decisions alone when sample sizes (and therefore meta-analyc within-esmate
variance) are similar to the ones in our analyses. Among other implicaons, our violaon of the
independent sample assumpon means that we (dramacally) over-esmate the variance
expected due to sampling error, and because I2 s a proporonal esmate, we thus underesmate
the actual proporon of variance due to dierences among analyses other than sampling error.
However, correcng this underesmaon would create a trivial value since we designed the study
so that much of the variance would derive from analyc decisions as opposed to dierences in
sampled data. Instead, retaining the I2 value as typically calculated provides a useful comparison
to I2 values from typical meta-analyses.
Interpretaon of τ2 also diers somewhat from tradional meta-analysis, and we discuss this
further in the Results.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
34
Finally, we assessed the extent to which deviaons from the meta-analyc mean by individual eect 793
sizes (Zr) or the predicted values of the dependent variable (yi) were explained by the peer rang of 794
each analysis team’s method secon, by a measurement of the disncveness of the set of predictor 795
variables included in each analysis, and by the choice of whether or not to include random eects in 796
the model. The deviaon score, which served as the dependent variable in these analyses, is the 797
absolute value of the dierence between the meta-analyc mean Zr (or yi) and the individual Zr (or 798
yi) esmate for each analysis. We used the Box-Cox transformaon on the absolute values of 799
deviaon scores to achieve an approximately normal distribuon [c.f. 40, 41]. We described variaon 800
in this dependent variable with both a series of univariate analyses and a mulvariate analysis. All 801
these analyses were general linear (mixed) models. These analyses were secondary to our esmaon 802
of variaon in eect sizes described above. We wished to quanfy relaonships among variables, but 803
we had no a priori expectaon of eect size and made no dichotomous decisions about stascal 804
signicance. 805
When examining the extent to which reviewer rangs (on a scale from 0 to 100) explained deviaon 806
from the average eect (or predicted value), each analysis had been rated by mulple peer 807
reviewers, so for each reviewer score to be included, we include each deviaon score in the analysis 808
mulple mes. To account for the non-independence of mulple rangs of the same analysis, we 809
planned to include analysis identy as a random eect in our general linear mixed model in the lme4 810
package in R [34, 42]. To a ccount f o r pote n al dierences among reviewers in their scoring of 811
analyses, we also planned to include reviewer identy as a random eect: 812
DeviaonScorej = BoxCox(abs(DeviaonFromMeanj)) 813
DeviaonScoreij ~ Rangij + ReviewerIDi + AnalysisIDj 814
ReviewerIDi ~ ϗ(0, σ2) 815
AnalysisIDi ~ ϗ(0, σ2) 816
Where DeviaonFromMeanj is the deviaon from the meta-analyc mean for the jth analysis, 817
ReviewerIDi is the random intercept assigned to each i reviewer, and AnalysisIDj is the random 818
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
35
intercept assigned to each j analysis, both of which are assumed to be normally distributed with a 819
mean of 0 and a variance of σ2 Absolute deviaon scores were Box-Cox transformed using the 820
step_box_cox() funcon from the metk package in R [34, 43]. 821
We conducted a similar analysis with the four categories of reviewer rangs ((1) deeply awed and 822
unpublishable, (2) publishable with major revision, (3) publishable with minor revision, (4) 823
publishable as is) set as ordinal predictors numbered as shown here. As with the analyses above, we 824
planned for these analyses to also include random eects of analysis identy and reviewer identy. 825
Both of these analyses (1: 1-100 rangs as the xed eect, 2: categorical rangs as the xed eects) 826
were planned to be conducted eight mes for each dataset. Each of the four responses (Zr, y25, y50, 827
y75) were to be compared once to the inial rangs provided by the peer reviewers, and again based 828
on the revised rangs provided by the peer reviewers. 829
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
36
830
The next set of univariate analyses sought to explain deviaons from the mean eects based on a 831
measure of the disncveness of the set of variables included in each analysis. As a ‘disncveness832
score, we used Sorensen’s Similarity Index (an index typically used to compare species composion 833
across sites), treang variables as species and individual analyses as sites. To generate an individual 834
Sorensen’s value for each analysis required calculang the pairwise Sorensen’s value for all pairs of 835
analyses (of the same dataset), and then taking the average across these Sorensen’s values for each 836
analysis. We calculated the Sorensen’s index values using the betapart package [44] in R: 837
Preregistraon Deviaon
1. We planned to include random eects of both analysis identy and reviewer identy in
these models comparing reviewer rangs with deviaon scores. However, aer we
received the analyses, we discovered that a subset of analyst teams had either conducted
mulple analyses and/or idened mulple eects per analysis as answering the target
queson. We therefore faced an even more complex potenal set of random eects. We
decided that including team ID, analysis ID, and eect ID along with reviewer ID as
random eects in the same model would almost certainly lead to model t problems, and
so we started with simpler models including just eect ID and reviewer ID. However, even
with this simpler structure, our dataset was sparse, with reviewers rang a small number
of analyses, resulng in models with singular t (Secon C.2). Removing one of the
random eects was necessary for the models to converge. The models that included the
categorical quality rang converged when including reviewer ID, and the models that
included the connuous quality rang converged when including eect ID.
2. We conducted analyses only with the nal peer rangs aer the opportunity for revision,
not with the inial rangs. This was because when we recorded the nal rangs, they
over-wrote the inial rangs, and so we did not have access to those inial values.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
37
𝛽𝑆𝑜𝑟𝑒𝑛𝑠𝑒𝑛 𝑏𝑐
2𝑎𝑏𝑐 838
Where a is the number of variables common to both analyses, b is the number of variables that 839
occur in the rst analysis but not in the second and c is he number of variables that occur in the 840
second analysis. We then used the per-model average Sorensen’s index value as an independent 841
variable to predict the deviaon score in a general linear model, and included no random eect since 842
each analysis is included only once, in R [34]: 843
𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑆𝑐𝑜𝑟𝑒 ~ 𝛽𝑆𝑜𝑟𝑒𝑛𝑠𝑒𝑛 844
845
Finally, we conducted a mulvariate analysis with the ve predictors described above (peer rangs 0-846
100 and peer rangs of publishability 1-4; both original and revised and Sorensen’s index, plus a 847
sixth, presence /absence of random eects) with random eects of analysis identy and reviewer 848
identy in the lme4 package in R [34, 42]. We had stated here in the text that we would use only the 849
revised (nal) peer rangs in this analysis, so the absence of the inial rangs is not a deviaon from 850
our plan: 851
Addional Explanaon
When we planned this analysis, we ancipated that analysts would idenfy a single primary eect
from each model, so that each model would appear in the analysis only once. Our expecaon was
incorrect because some analysts idened >1 eect per analysis, but we sll chose to specify our
model as registered and not use a random eect. This is because most models produced only one
eect and so we expected that specifying a random eect to account for the few cases where >1
eect was included for a given model would prevent model convergence.
Note that this analysis contrasts with the analyses in which we used reviewer rangs as predictors
because in the analyses with reviewer rangs, each eect appeared in the analysis approximately
four mes due to mulple reviews of each analysis, and so it was much more important to
account for that variance through a random eect.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
38
DeviaonScorej ~ RangsConnuousij + RangsCategoricalij + βSorensenj + AnalysisIDj + ReviewerIDi
852
ReviewerIDi ~ ϗ (0, σ2) 853
AnalysisIDj ~ ϗ (0, σ2) 854
We conducted all the analyses described above eight mes; for each of the four responses (Zr, y25, 855
y50, y75) one me for each of the two datasets. 856
We have publicly archived all relevant data, code, and materials on the Open Science Framework 857
(hps://osf.io/mn5aj/). Archived data includes the original datasets distributed to all analysts, any 858
edited versions of the data analyzed by individual groups, and the data we analyzed with our meta-859
analyses, which include the eect sizes derived from separate analyses, the stascs describing 860
variaon in model structure among analyst groups, and the anonymized answers to our surveys of 861
analysts and peer reviewers. Similarly, we have archived both the analysis code used for each 862
individual analysis (where available) and the code from our meta-analyses. We have also archived 863
copies of our survey instruments from analysts and peer reviewers. 864
Our rules for excluding data from our study were as follows. We excluded from our synthesis any 865
individual analysis submied aer we had completed peer review or those unaccompanied by 866
analysis les that allow us to understand what the analysts did. We also excluded any individual 867
analysis that did not produce an outcome that could be interpreted as an answer to our primary 868
queson (as posed above) for the respecve dataset. For instance, this means that in the case of the 869
data on blue t chick growth, we excluded any analysis that did not include something that can be 870
interpreted as growth or size as a dependent (response) variable, and in the case of the Eucalyptus 871
establishment data, we excluded any analysis that did not include a measure of grass cover among 872
the independent (predictor) variables. Also, as described above, any analysis that could not produce 873
an eect that could be converted to a signed Zr was excluded from analyses of Zr. 874
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
39
875
Preregistraon Deviaon
Some analysts had diculty implemenng our instrucons to derive the out-of-sample
predicons, and in some cases (especially for the Eucalyptus data), they submied predicons
with implausibly extreme values. We believed these values were incorrect and thus made the
conservave decision to exclude out-of-sample predicons where the esmates were > 3
standard deviaons from the mean value from the full dataset.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
40
876
Addional Explanaon
1. Evaluang model t.
We evaluated all ed models using the performance() funcon from
the performance package [45] and the glance() funcon from the broom.mixed package [46]. For
all models, we calculated the square root of the residual variance (Sigma) and the root mean
squared error (RMSE). For GLMMs performance ()calculates the marginal and condional R2
values as well as the contribuon of random eects (ICC), based on Nakagawa et al. [47]. The
condional R2 accounts for both the xed and random eects, while the marginal R2 considers
only the variance of the xed eects. The contribuon of random eects is obtained by
subtracng the marginal R2 from the condional R2.
2. Exploring outliers and analysis quality.
Aer seeing the forest plots of Zr values and nocing the existence of a small number of extreme
outliers, especially from the Eucalyptus analyses, we wanted to understand the degree to which
our heterogeneity esmates were inuenced by these outliers. To explore this queson, we
removed the highest two and lowest two values of Zr in each dataset and re-calculated our
heterogeneity esmates.
To help understand the possible role of the quality of analyses in driving the heterogeneity we
observed among esmates of Zr, we recalculated our heterogeneity esmates aer removing all
eects from analysis teams that had received at least one rang of “deeply awed and
unpublishable” and then again aer removing all eects from analysis teams with at least one
rang of either “deeply awed and unpublishable” or “publishable with major revisions”. We also
used self-idened levels of stascal experse to examine heterogeneity when we retained
analyses only from analysis teams that contained at least one member who rated themselves as
“highly procient” or “expert” (rather than “novice” or “moderately procient”) in conducng
stascal analyses in their research area in our intake survey.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
41
877
Step 6: Facilitated Discussion and Collaborave Write-Up of Manuscript 878
We planned for analysts and iniang authors to discuss the limitaons, results, and implicaons of 879
the study and collaborate on wring the nal manuscript for review as a stage-2 Registered Report. 880
881
Results 882
Summary Stascs 883
In total, 173 analyst teams, comprising 246 analysts, contributed 182 usable analyses of the two 884
datasets examined in this study which yielded 215 eects. Analysts produced 135 disnct eects that 885
met our criteria for inclusion in at least one of our meta-analyses for the blue t dataset. Analysts 886
Addional Explanaon
3. Exploring possible impacts of lower quality esmates of degrees of freedom.
Our meta-analyses of variaon in Zr required variance esmates derived from esmates of the
degrees of freedom in original analyses from which Zr esmates were derived. While processing
the esmates of degrees of freedom submied by analysts, we idened a subset of these
esmates in which we had lower condence because two or more eects from the same analysis
were submied with idencal degrees of freedom. We therefore conducted a second set of
(more conservave) meta-analyses that excluded these Zr esmates with idencal esmates of
degrees of freedom and we present these analyses in the supplement.
Preregistraon Deviaon
As described above, due to the large number of recruited analysts and reviewers and the
ancipated challenges of receiving and integrang feedback from so many authors, we limited
analyst and reviewer parcipaon in the producon of the nal manuscript to an invitaon to call
aenon to serious problems with the manuscript dra.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
42
produced 81 disnct eects meeng our criteria for inclusion for the Eucalyptus dataset. Excluded 887
analyses and eects either did not answer our specied biological quesons, were submied with 888
insucient informaon for inclusion in our meta-analyses, or were incompable with producon of 889
our eect size(s). We expected this nal scenario (incompable analyses), for instance we cannot 890
extract a Zr from random forest models, which is why we analyzed two disnct types of eects, Zr 891
and out-of-sample (yi). Eects included in only a subset of our meta-analyses provided sucient 892
informaon for inclusion in only that subset (see Table A.1). For both datasets, most submied 893
analyses incorporated mixed eects. Submied analyses of the blue t dataset typically specied 894
normal error and analyses of the Eucalyptus dataset typically specied a non-normal error 895
distribuon (Supplementary Table A.1).
896
For both datasets, the composion of models varied substanally in regards to the number of xed 897
and random eects, interacon terms, and the number of data points used, and these paerns 898
diered somewhat between the blue t and Eucalyptus analyses (See Supplementary Table A.2).
899
Focusing on the models included in the Zr analyses (because this is the larger sample), blue t 900
models included a similar number of xed eects on average (mean 5.2 ± 2.92 SD) as Eucalyptus 901
models (mean 5.01 ± 3.83 SD), but the standard deviaon in number of xed eects was somewhat 902
larger in the Eucalyptus models. The average number of interacon terms was much larger for the 903
blue t models (mean 0.44 ± 1.11 SD) than for the Eucalyptus models (mean 0.16 ± 0.65 SD), but sll 904
under 0.5 for both, indicang that most models did not contain interacon terms. Blue t models 905
also contained more random eects (mean 3.53 ± 2.08 SD) than Eucalyptus models (mean 1.41 ± 906
1.09 SD). The maximum possible sample size in the blue t dataset (3720 nestlings) was an order of 907
magnitude larger than the maximum possible in the Eucalyptus dataset (351 plots), and the means 908
and standard deviaons of the sample size used to derive the eects eligible for our study were also 909
an order of magnitude greater for the blue t dataset (mean 2622.07 ± 939.28 SD) relave to the 910
Eucalyptus models (mean 298.43 ± 106.25 SD). However, the standard deviaon in sample size from 911
the Eucalyptus models was heavily inuenced by a few cases of dramac sub-seng (described 912
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
43
below). Approximately three quarters of Eucalyptus models used sample sizes within 3% of the 913
maximum. In contrast, fewer than 20% of blue t models relied on sample sizes within 3% of the 914
maximum, and approximately 50% of blue t models relied on sample sizes 29% or more below the 915
maximum. 916
Analysts provided qualitave descripons of the conclusions of their analyses. Each analysis team 917
provided one conclusion per dataset. These conclusions could take into account the results of any 918
formal analyses completed by the team as well as exploratory and visual analyses of the data. Here 919
we summarize all qualitave responses, regardless of whether we had sucient informaon to use 920
the corresponding model results in our quantave analyses below. We classied these conclusions 921
into the categories summarized below (Table 1): 922
923
Mixed: some evidence supporng a posive eect, some evidence supporng a negave eect 924
Conclusive negave: negave relaonship described without caveat 925
Qualied negave: negave relaonship but only in certain circumstances or where analysts express 926
uncertainty in their result 927
Conclusive none: analysts interpret the results as conclusive of no eect 928
None qualied: analysts describe nding no evidence of a relaonship but they describe the 929
potenal for an undetected eect
930
Qualied posive: posive relaonship described but only in certain circumstances or where analysts 931
express uncertainty in their result 932
Conclusive posive: posive relaonship described without caveat 933
934
For the blue t dataset, most analysts concluded that there was negave relaonship between 935
measures of sibling compeon and nestling growth, though half the teams expressed qualicaons 936
or described eects as mixed or absent. For the Eucalyptus dataset, there was a broader spread of 937
conclusions with at least one analyst team providing conclusions consistent with each conclusion 938
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
44
category. The most common conclusion for the Eucalyptus dataset was that there was no 939
relaonship between grass cover and Eucalyptus recruitment (either conclusive or qualied 940
descripon of no relaonship), but more than half the teams concluded that there were eects; 941
negave, posive, or mixed. 942
Table 1: Tallies of analysts’ qualitave answers to the research quesons addressed by their 943
analyses. 944
Dataset Mixed Negave
Conclusive
Negave
Qualied
None
Conclusive
None
Qualied
Posive
Qualied
Posive
Conclusive
blue t 5 37 27 4 1 0 0
Eucalytpus 8 6 12 19 12 4 2
945
Distribuon of Eects 946
Eect Size Zr 947
Although the majority (111 of 132) of the usable Zr eects from the blue t dataset found nestling 948
growth decreased with sibling compeon, and the meta-analyc mean Zr (Fisher’s transformaon 949
of the correlaon coecient) was convincingly negave (-0.35 ± 0.06 95% CI), there was substanal 950
variability in the strength and the direcon of this eect. Zr ranged approximately connuously from 951
-0.93 to 0.19, (Figure 1a and Table 4) and of the 111 eects with negave slopes, 92 had condence 952
intervals excluding 0. Of the 20 with posive slopes indicang increased nestling growth in the 953
presence of more siblings, 3 had condence intervals excluding zero (Figure 1a). 954
Meta-analysis of the Eucalyptus dataset also showed substanal variability in the strength of eects 955
as measured by Zr, and unlike with the blue ts, a notable lack of consistency in the direcon of 956
eects (Figure 1b, Table 4). Zr ranged from -4.47 (Supplementary Figure A.2), indicang a strong
957
tendency for reduced Eucalyptus seedling success as grass cover increased, to 0.39, indicang the 958
opposite. Although the range of reported eects skewed strongly negave, this was due to a small 959
number of substanal outliers. Most values of Zr were relavely small with values < 0.2 and the 960
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
45
meta-analyc mean eect size was close to zero (-0.09 ± 0.12 95% CI). Of the 79 eects, y-three 961
had condence intervals overlapping zero, approximately a quarter (een) crossed the tradional 962
threshold of stascal signicance indicang a negave relaonship between grass cover and 963
seedling success, and eleven crossed the signicance threshold indicang a posive relaonship 964
between grass cover and seedling success (Figure 1b). 965
966
967
Figure 1: Forest plots of meta-analyc esmated standardized eect sizes (Zr) and their 95% 968
condence intervals for each eect size included in the meta-analysis model for a) blue t and b) 969
Eucalytpus. The meta-analyc mean eect size is noted in black and as a dashed vercal line, with 970
error bars also represenng the 95% condence interval. The solid black vercal line demarcates 971
eect size of 0, indicang no relaonship between the test variable and the response variable. Note 972
that the Eucalyptus plot omits one extreme outlier with the value of -4.47 (Figure A.2) in order to 973
standardize the x-axes on these two panels. 974
975
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
46
Out-of-sample predicons (yi) 976
As with the eect size Zr, we observed substanal variability in the size of out-of-sample predicons 977
derived from the analysts’ models. Blue t predicons (Figure 2a), which were z-score-standardised 978
to accommodate the use of dierent response variables, always ranged far in excess of one standard 979
deviaon. In the y25 scenario, model predicons ranged from -1.85 to 0.42 (a range of 2.68 standard 980
deviaons), in the y50 scenario, they ranged from -0.53 to 1.11 (a range of 1.63 standard deviaons), 981
and in the y75 scenario they ranged from -0.03 to 1.58 (a range of 1.9 standard deviaons). As should 982
be expected given the existence of both negave and posive Zr values, all three out-of-sample 983
scenarios produced both negave and posive predicons, although as with the Zr values, there is a 984
clear trend for scenarios with more siblings to be associated with smaller nestlings. This is supported 985
by the meta-analyc means of these three sets of predicons which were -0.66 (95% CI -0.82,–0.5) 986
for the y25, 0.34 (95% CI 0.2-0.48) for the y50, and 0.67 (95% CI 0.57-0.77) for the y75. 987
Eucalyptus out-of-sample predicons also varied substanally (Figure 2b), but because they were not 988
z-score-standardised and are instead on the original count scale, the types of interpretaons we can 989
make dier. The predicted Eucalyptus seedling counts per 15 x 15 m plot for the y25 scenario ranged 990
from 0.04 to 33.66, for the y50 scenario ranged from 0.03 to 13.02, and for the y75 scenario they 991
ranged from 0.05 to 21.93. The meta-analyc mean predicons for these three scenarios were 992
similar; 0.58 (95% CI, 0.21,-1.37) for the y25, 0.92 (95% CI 0.36-1.65) for the y50, and 1.67 (95% CI 0.8-993
2.83) for the y75 scenarios respecvely. 994
995
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
47
996
Figure 2: Forest plot of meta-analyc esmated standardized (z-score) blue t out-of-sample 997
predicons, y
i
. for a) blue t and b) Eucalyptus. Triangles represent individual esmates, circles 998
represent the meta-analyc mean for each predicon scenario. Error bars are 95% condence 999
intervals. 1000
Quanfying Heterogeneity
1001
Eect Size (Zr)
1002
We quaned both absolute (τ
2
) and relave (I
2
) heterogeneity resulng from analycal variaon. 1003
Both measures suggest that substanal variability among eect sizes was aributable to the 1004
analycal decisions of analysts. 1005
The total absolute level of variance beyond what would typically be expected due to sampling error, 1006
τ
2
(Table 2), among all usable blue t eects was 0.088 and for Eucalyptus eects was 0.267. This is 1007
similar to or exceeding the median value (0.105) of τ
2
found across 31 recent meta-analyses 1008
(calculated from the data in 48]). The similarity of our observed values to values from meta-analyses 1009
of dierent studies based on dierent data suggest the potenal for a large poron of heterogeneity 1010
to arise from analycal decisions. For further discussion of interpretaon of τ
2
in our study, please 1011
consult discussion of post hoc analyses below. 1012
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
48
Table 2: Heterogeneity in the esmated eects Zr for meta-analyses of the full dataset, as well as 1013
from post hoc analyses including the dataset with outliers removed, the dataset excluding eects 1014
from analysis teams with at least one “unpublishable” rang, the dataset excluding eects from 1015
analysis teams with at least one “major revisions” rang or worse, or the dataset including only 1016
analyses from teams in which at least one analyst rated themselves as "highly procient" or "expert" 1017
in stascal analysis. τTeam2 is the absolute heterogeneity for the random eect Team, τEectID2 is 1018
the absolute heterogeneity for the random eect EectID, nested under Team, and τtotal2 is the total 1019
absolute heterogeneity. I2Total is the proporonal heterogeneity; the proporon of the variance 1020
among eects not aributable to sampling error, I2Team is the subset of the proporonal 1021
heterogeneity due to dierences among Teams and I2Team,EectID is subset of the proporonal 1022
heterogeneity aributable to among-EectID dierences. 1023
Dataset τ2To ta l τ2Team τ2EectID I2Tot a l I2Team I2Team , EectID N. Obs
All Analyses
blue t 0.09 0.04 0.05 97.732% 40.11% 57.63% 131
Eucalyptus 0.27 0.02 0.25 98.589% 6.88% 91.71% 79
All analyses, outliers Removed
blue t 0.07 0.05 0.02 97.030% 66.90% 30.13% 127
Eucalyptus 0.01 0.00 0.01 66.193% 19.27% 46.93% 75
Analyses receiving at least one 'Unpublishable' rang removed
blue t 0.08 0.03 0.05 97.601% 38.10% 59.50% 109
Eucalyptus 0.01 0.01 0.01 79.741% 28.32% 51.42% 55
Analyses receiving at least one 'Unpublishable' and or 'Major Revisions' rang removed
blue t 0.14 0.01 0.13 98.718% 5.17% 93.55% 32
Eucalyptus 0.03 0.03 0.00 88.915% 88.91% 0.00% 13
Analyses from teams that include highly procient or expert data analysts
blue t 0.10 0.04 0.06 98.058% 36.27% 61.78% 89
Eucalyptus 0.58 0.02 0.56 99.412% 3.49% 95.93% 34
1024
In our analyses, I2 is a plausible index of how much more variability among eect sizes we have 1025
observed, as a proporon, than we would have observed if sampling error were driving variability. 1026
We discuss our interpretaon of I2 further in the methods, but in short, it is a useful metric for 1027
comparison to values from published meta-analyses and provides a plausible value for how much 1028
heterogeneity could arise in a normal meta-analysis with similar sample sizes due to analycal 1029
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
49
variability alone. In our study, total I2 for the blue t Zr esmates was extremely large, at 97.73%, as 1030
was the Eucalyptus esmate (98.59% Table 2). 1031
Although the overall I2 values were similar for both Eucalyptus and blue t analyses, the relave 1032
composion of that heterogeneity diered. For both datasets, the majority of heterogeneity in Zr 1033
was driven by dierences among eects as opposed to dierences among teams, though this was 1034
more prominent for the Eucalyptus dataset, where nearly all of the total heterogeneity was driven by 1035
dierences among eects (91.71%) as opposed to dierences among teams (6.88%) (Table 2). 1036
Out-of-sample predicons (yi) 1037
We observed substanal heterogeneity among out-of-sample esmates, but the paern diered 1038
somewhat from the Zr values (Table 3). Among the blue t predicons, I2 ranged from medium-high 1039
for the y25 scenario (68.36) to low (27.02) for the y75 scenario. Among the Eucalyptus predicons, I2 1040
values were uniformly high (>82%). For both datasets, most of the exisng heterogeneity among 1041
predicted values was aributable to among-team dierences, with the excepon of the y50 analysis 1042
of the Eucalyptus dataset. We are limited in our interpretaon of τ2 for these esmates because, 1043
unlike for the Zr esmates, we have no benchmark for comparison with other meta-analyses. 1044
1045
Table 3: Heterogeneity among the out-of-sample predicons yi for both blue t and Eucalyptus 1046
datasets. τTeam2 is the absolute heterogeneity for the random eect Tea m , τEectID2 is the absolute 1047
heterogeneity for the random eect EectID, nested under Te a m , and τtotal2 is the total absolute 1048
heterogeneity. I2Tot a l is the proporonal heterogeneity; the proporon of the variance among 1049
eects not aributable to sampling error, I2Team is the subset of the proporonal heterogeneity due 1050
to dierences among Teams and I2Team,EectID is subset of the proporonal heterogeneity 1051
aributable to among-EectID dierences. 1052
Dataset Scenario N.
Obs
Τ2To ta l τ2Team τ2EectID I2Tot a l I 2Te a m I 2Te a m,EectID
blue t y25 62 0.14 0.11 0.03 68.36% 51.82% 16.54%
y50 59 0.07 0.06 0.01 50.37% 45.66% 4.71%
y75 62 0.02 0.02 0.00 27.02% 25.57% 1.45%
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
50
Eucalyptus y25 22 3.05 1.95 1.10 88.76% 56.76% 32.00%
y50 24 1.61 0.53 1.08 83.26% 27.52% 55.73%
y75 24 1.69 1.41 0.28 79.76% 66.52% 13.25%
1053
Post-hoc Analysis: Exploring outlier characteriscs and the eect of outlier removal on 1054
heterogeneity 1055
Eect Sizes (Zr) 1056
The outlier Eucalyptus Zr values were striking and merited special examinaon. The three negave 1057
outliers had very low sample sizes were based on either small subsets of the dataset or, in one case, 1058
extreme aggregaon of data. The outliers associated with small subsets had sample sizes (n= 117, 90) 1059
that were less than half of the total possible sample size of 351. The case of extreme aggregaon 1060
involved averaging all values within each of the 18 sites in the dataset. 1061
Surprisingly, both the largest and smallest eect sizes in the blue t analyses (Figure 1a) come from 1062
the same analyst (anonymous ID: Adelong), with idencal models in terms of the explanatory 1063
variable structure, but with dierent response variables. However, the radical change in eect was 1064
primarily due to collinearity with covariates. The primary predictor variable (brood count aer 1065
manipulaon) was accompanied by several collinear variables, including the highly collinear 1066
(correlaon of approximately 0.9 (Supplementary Figure D.2)) covariate (brood size at day 14) in both
1067
analyses. In the analysis of nestling weight, brood count aer manipulaon showed a strong posive 1068
paral correlaon with weight aer controlling for brood count at day 14 and treatment category 1069
(increased, decreased, unmanipulated). In that same analysis, the most collinear covariate (the day 1070
14 count) had a negave paral correlaon with weight. In the analysis with tarsus length as the 1071
response variable, these paral correlaons were almost idencal in absolute magnitude, but 1072
reversed in sign and so brood count aer manipulaon was now the collinear predictor with the 1073
negave relaonship. The two models were therefore very similar, but the two collinear predictors 1074
simply switched roles, presumably because a subtle dierence in the distribuon of weight and 1075
tarsus length data. 1076
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
51
When we dropped the Eucalyptus outliers, I2 decreased from high (98.59%), using Higgins’ [36] 1077
suggested benchmark, to between moderate and high (66.19%, Table 2). However, more notably, τ2 1078
dropped from 0.27 to 0.01, indicang that, once outliers were excluded, the observed variaon in 1079
eects was similar to what we would expect if sampling error were driving the dierences among 1080
eects (since τ2 is the variance in addion to that driven by sampling error). The interpretaon of this 1081
value of τ2 in the context of our many-analyst study is somewhat dierent than a typical meta-1082
analysis, however, since in our study (especially for Eucalyptus, where most analyses used almost 1083
exactly the same data points), there is almost no role for sampling error in driving the observed 1084
dierences among the esmates. Thus, rather than concluding that the variability we observed 1085
among esmates (aer removing outliers) was due only to sampling error (because τ2 became small: 1086
10% of the median from 48), we instead conclude that the observed variability, which must be due to 1087
the divergent choices of analysts rather than sampling error, is approximately of the same magnitude 1088
as what we would have expected if, instead, sampling error, and not analycal heterogeneity, were at 1089
work. Presumably, if sampling error had actually also been at work, it would have acted as an 1090
addional source of variability and would have led total variability among esmates to be higher. 1091
With total variability higher and thus greater than expected due to sampling error alone, τ2 would 1092
have been noceably larger. Conversely, dropping outliers from the set of blue t eects did not 1093
meaningfully reduce I2, and only modestly reduced τ2 (Table 2). Thus, eects at the extremes of the 1094
distribuon were much stronger contributors to total heterogeneity for eects from analyses of the 1095
Eucalyptus than for the blue t dataset. 1096
Tab le 4: Esmated mean value of the standardised correlaon coecient, Zr, along with its standard 1097
error and 95% condence intervals. We re-computed the meta-analysis for dierent post-hoc subsets 1098
of the data: All eligible eects, removal of eects from analysis teams that received at least one peer 1099
rang of ‘deeply awed and unpublishable’, removal of any eects from analysis teams that received 1100
at least one peer rang of either ‘deeply awed and unpublishable’ or ‘publishable with major 1101
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
52
revisions’ , inclusion of only eects from analysis teams that included at least one member who rated 1102
themselves as "highly procient" or "expert" at conducng stascal analyses in their research area.. 1103
Dataset 𝝁
SE[𝝁
󰇠 95% CI stasc p-value
All Analyses
blue t 0.35 0.03 [0.41,0.28] 10.49 <0.001
Eucalyptus 0.09 0.06 [0.22,0.03] 1.47 0.14
Analyses receiving at least one 'Unpublishable' rang removed
blue t 0.36 0.03 [0.43,0.29] 10.49 <0.001
Eucalyptus 0.02 0.02 [0.07,0.02] 1.15 0.3
Analyses receiving at least one 'Unpublishable' and or 'Major Revisions' rang removed
blue t 0.37 0.07 [0.51,0.23] 5.34 <0.001
Eucalyptus 0.04 0.05 [0.15,0.07] 0.77 0.4
All analyses - outliers removed
blue t 0.35 0.03 [0.42,0.29] 10.95 <0.001
Eucalyptus 0.03 0.01 [0.06,0.00] 2.23 0.026
Analyses from teams with highly procient or expert data analysts
blue t 0.35 0.04 [0.44,0.27] 8.31 <0.001
Eucalyptus 0.17 0.13 [0.43,0.10] 1.24 0.2
1104
Out-of-sample predicons (yi) 1105
We did not conduct these post hoc analyses on the out-of-sample predicons as the number of 1106
eligible eects was smaller and the paern of outliers diered. 1107
Post-hoc analysis: Exploring the eect of removing analyses with poor peer rangs on 1108
heterogeneity 1109
Eect Size (Zr) 1110
Removing poorly rated analyses had limited impact on the meta-analyc means (Supplementary
1111
Figure B.3). For the Eucalyptus dataset, the meta-analyc mean shied from -0.09 to -0.02 when
1112
eects from analyses rated as unpublishable were removed, and to -0.04 when eects from analyses 1113
rated, at least once, as unpublishable or requiring major revisions were removed. Further, the 1114
condence intervals for all of these means overlapped each of the other means (Table 4). We saw 1115
similar paerns for the blue t dataset, with only small shis in the meta-analyc mean, and 1116
condence intervals of all three means overlapping each other mean (Table 4). Reng the meta-1117
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
53
analysis with a xed eect for categorical rangs also showed no indicaon of dierences in group 1118
meta-analyc means due to peer rangs (Supplementary Figure B.1).
1119
For the blue t dataset, removing poorly-rated analyses led to only negligible changes in I2Total and 1120
relavely minor impacts on τ2. However, for the Eucalyptus dataset, removing poorly-rated analyses 1121
led to notable reducons in I2Total and substanal reducons in τ2. When including all analyses, the 1122
Eucalyptus I2Total was 98 . 5 9 % and τ2 was 0.27, but eliminang analyses with rangs of 1123
“unpublishable” reduced I2Total to 79.74% and τ2 to 0.01, and removing also those analyses “needing 1124
major revisions” le I2Total at 88.91% and τ2 at 0.03 (Table 2). Addionally, the allocaons of I2 to the 1125
team versus individual eect were altered for both blue t and Eucalyptus meta-analyses by 1126
removing poorly rated analyses, but in dierent ways. For blue t meta-analysis, between a third and 1127
two-thirds of the total I2 was aributable to among-team variance in most analyses unl both 1128
analyses rated “unpublishable” and analyses rated in need of “major revision” were eliminated, in 1129
which case almost all remaining heterogeneity was aributable to among-eect dierences. In 1130
contrast, for Eucalyptus meta-analysis, the among-team component of I2 was less than third unl 1131
both analyses rated “unpublishable” and analyses rated in need of “major revision” were eliminated, 1132
in which case almost 90% of heterogeneity was aributable to dierences among teams. 1133
Out-of-sample predicons (yi) 1134
We did not conduct these post hoc analyses on the out-of-sample predicons as the number of 1135
eligible eects was smaller and our ability to interpret heterogeneity values for these analyses was 1136
limited. 1137
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
54
Post-hoc analysis: Exploring the eect of including only analyses conducted by analysis 1138
teams with at least one member self-rated as “highly procient” or “expert” in 1139
conducng stascal analyses in their research area 1140
Eect sizes (Zr) 1141
Including only analyses conducted by teams that contained at least one member who rated 1142
themselves as “highly procient” or “expert” in conducng the relevant stascal methods had 1143
negligible impacts on the meta-analyc means (Table 4), the distribuon of Zr eects 1144
(Supplementary Figure B.4), or heterogeneity esmates (Table 2), which remained extremely high.
1145
Out-of- sample predicons (yi) 1146
We did not conduct these post hoc analyses on the out-of-sample predicons as the number of 1147
eligible eects was smaller. 1148
Post-hoc analysis: Exploring the eect of excluding esmates of Zr in which we had 1149
reduced condence 1150
As described in our addendum to the methods, we idened a subset of esmates of Zr in which we 1151
had less condence because of features of the submied degrees of freedom. Excluding these eects 1152
in which we had lower condence had minimal impact on the meta-analyc mean and the esmates 1153
of total I2 and τ2 for both blue t and Eucalyptus meta-analyses, regardless of whether outliers were 1154
also excluded (Supplementary Table B.1).
1155
Explaining Variaon in Deviaon Scores 1156
None of the pre-registered predictors explained substanal variaon in deviaon among submied 1157
stascal eects from the meta-analyc mean (Table 5, Table 6). Note that the extremely high 1158
R2Condional values from the analyses of connuous peer rangs as predictors of deviaon scores are a 1159
funcon of the random eects, not the xed eect of interest. These high values of R2Condional result 1160
from the fact that each eect size was included in the analysis mulple mes, to allow comparison 1161
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
55
with rangs from the mulple peer reviewers who reviewed each analysis, and therefore when we 1162
included Eect ID as a random eect, the observaons within each random eect category were 1163
idencal. 1164
Table 5: Summary metrics for registered models seeking to explain deviaon (Box-Cox transformed 1165
absolute deviaon scores) from the mean Zr as a funcon of Sorensen’s Index, categorical peer 1166
rangs, and connuous peer rangs for blue t and Eucalyptus analyses, and as a funcon of the 1167
presence or absence of random eects (in the analyst’s models) for Eucalyptus analyses. We report 1168
coecient of determinaon, R2, for our models including only xed eects as predictors of deviaon, 1169
and we report R2Condional, R2Marginal and the intra-class correlaon (ICC) from our models that included 1170
both xed and random eects. For all our models, we calculated the residual standard deviaon σ 1171
and root mean squared error (RMSE). 1172
Dataset R2 R
2Condional R2Marginal ICC σ RMSE N. Obs.
Deviaon explained by categorical rangs
blue t 0.0903 0.0067 0.0842 6.52e-01 6.32e-01 473
Eucalyptus 0.1319 0.0124 0.1209 1.06e+00 1.02e+00 346
Deviaon explained by connuous rangs
blue t 1.0000 2.00e-26 1.0000 1.63e-05 1.56e-12 473
Eucalyptus 0.9998 6.57e-30 0.9998 7.93e-03 7.09e-14 346
Deviaon explained by Sorensen's index
blue t 0.0011 0.681 0.676 124
Eucalyptus 0.0005 1.14 1.120 72
Deviaon explained by inclusion of random eects
blue t 0.0268 0.658 0.653 131
Eucalyptus 8.67e-08 1.12 1.100 79
1173
Table 6: Parameter esmates from models of Box-Cox transformed deviaon scores as a funcon of 1174
connuous and categorical peer rangs, Sorensen scores, and the inclusion of random eects. 1175
Standard Errors (SE), 95% condence intervals (95%CI) are reported for all esmates, while t values, 1176
degrees of freedom and p-values are presented for xed-eects. Note that posive parameter 1177
esmates mean that as the predictor variable increases, so does the absolute value of the deviaon 1178
from the meta-analyc mean. 1179
Dataset Parameter Eect Coe. SE 95% CI t df p-value
Deviaon explained by inclusion of random eects
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
56
Eucalyptus Intercept 2.53 0.27 -3.06,-1.99 -9.31 77 <0.001
Random eects 0.00 0.31 -0.60, 0.60 0.00 77 >0.9
Deviaon explained by mean Sorensen’s index
Eucalyptus Intercept -2.75 1.07 -4.85,-0.65 -2.57 70 0.010
Sorensen Index 0.29 1.54 -2.74, 3.32 0.19 70 0.9
blue t Intercept -1.56 0.38 -2.30,-0.82 -4.12 122 <0.001
Mean Sorensen
Index
0.23 0.63 -1.00, 1.46 0.37 122 0.7
Deviaon explained by connuous rangs
Eucalyptus Intercept Fixed -2.52 0.06 -2.63,-2.40 -42.58 342 <0.001
Connuous
Rang
Fixed 6e-17 2e-
10
-4e-10, 4e-
10
-3e-07 342 >0.9
SD (Intercept) Random
(EectID)
0.53 0.04 0.45, 0.62
SD
(Observaons)
Random
(Residual)
0.01 3e-
04
0.01,0.01
blue t Intercept Fixed -1.41 0.03 -1.47,-1.35 -46.54 469 <0.001
Connuous
Rang
Fixed -3e-15 1e-
09
-2e-09,2e-09 -2e-06 469 >0.9
SD (Intercept) Random
(EectID)
0.34 0.02 0.30, 0.39
SD
(Observaons)
Random
(Residual)
2e-05 6e-
07
2e-05,2e-05
Deviaon explained by categorical rangs
Eucalyptus Intercept Fixed -2.66 0.27 -3.18,-2.13 -9.97 340 <0.001
Publishable with
major revisions
Fixed 0.29 0.29 -0.27, 0.85 1.02 340 0.3
Publishable with
minor revisions
Fixed 0.01 0.28 -0.54, 0.56 0.04 340 >0.9
Publishable as is Fixed 0.05 0.31 -0.55, 0.66 0.17 340 0.9
SD (Intercept) Random
(ReviewerID)
0.39 0.09 0.25, 0.61
SD (Observaons Random
(Residual)
1.06 0.04 0.98,1.15
blue t Intercept Fixed -1.21 0.15 -1.50,-0.93 -8.29 467 <0.001
Publishable with
major revisions
Fixed -0.23 0.15 -0.53, 0.07 -1.50 467 0.13
Publishable with
minor revisions
Fixed -0.23 0.15 -0.53, 0.07 -1.52 467 0.13
Publishable as is Fixed -0.15 0.17 -0.48, 0.18 -0.89 467 0.4
SD (Intercept) Random
(ReviewerID)
0.20 0.05 0.13, 0.31
SD (Observaons Random
(Residual)
0.65 0.02 0.61,0.7
1180
Deviaon Scores as explained by reviewer rangs 1181
Eect Sizes (Zr) 1182
We obtained reviews from 128 reviewers who reviewed analyses for a mean of 3.27 (range 1 - 11) 1183
analysis teams. Analyses of the blue t dataset received a total of 240 reviews, each was reviewed by 1184
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
57
a mean of 3.87 (SD 0.71, range 3-5) reviewers. Analyses of the Eucalyptus dataset received a total of 1185
178 reviews, each was reviewed by a mean of 4.24 (SD 0.79, range 3-6) reviewers. We tested for 1186
inter-rater reliability to examine how similarly reviewers reviewed each analysis and found 1187
approximately no agreement among reviewers. When considering connuous rangs, IRR was 0.01, 1188
and for categorical rangs, IRR was -0.14. 1189
Many of the models of deviance as a funcon of peer rangs faced issues of failure to converge or 1190
singularity due to sparse design matrices with our pre-registered random eects (EectID and 1191
ReviewerID) (see Supplementary Table C.1). These issues persisted aer increasing the tolerance and
1192
changing the opmizer. For both Eucalyptus and blue t datasets, models with connuous rangs as 1193
a predictor were singular when both pre-registered random eects were included. 1194
When using only categorical rangs as predictors, models converged only when specifying reviewer 1195
ID as a random eect. That model had a R2C of 0.09 and a R2M of 0.01. The model using the 1196
connuous rangs converged for both random eects (in isolaon), but not both. We present results 1197
for the model using study ID as a random eect because we expected it would be a more important 1198
driver of variaon in deviaon scores. That model had a R2C of 1 and a R2M of 0.01 for the blue t 1199
dataset and a R2C of 1 and a R2M of 0.01 for the Eucalyptus dataset. Neither connuous or categorical 1200
reviewer rangs of the analyses meaningfully predicted deviance from the meta-analyc mean 1201
(Table 6, Figure 3). We re-ran the mul-level meta-analysis with a xed-eect for the categorical 1202
publishability rangs and found no dierence in mean standardised eect sizes among publishability 1203
rangs (Supplementary Figure B.1).
1204
1205
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
58
1206
Figure 3: Violin plot of Box-Cox transformed deviaon from meta-analyc mean as a funcon of 1207
categorical peer rang for a) blue t and b) Eucalyptus. Grey points for each rang group denote 1208
model-esmated marginal mean deviaon, and error bars denote 95% CI of the esmate. 1209
1210
Out-of-sample predicons (yi)
1211
Some models of the inuence of reviewer rangs on out-of-sample predicons (y
i
) had issues with 1212
convergence and singularity of t (see Supplementary Table C.2) and those models that converged
1213
and were not singular showed no strong relaonship (Supplementary Figures C.2, Figure C.3), as with 1214
the Zr analyses. 1215
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
59
Deviaon scores as explained by the disncveness of variables in each analysis
1216
Eect Size (Zr)
1217
We employed Sorensen’s index to calculate the disncveness of the set of predictor variables used 1218
in each model (Figure 5). The mean Sorensen’s score for blue t analyses was 0.69 (range 0.55-0.98), 1219
and for Eucalyptus analyses was 0.59 (range 0.43-0.86). 1220
We found no meaningful relaonship between disncveness of variables selected and deviaon 1221
from the meta-analyc mean (Table 6, Figure 5) for either blue t (mean 0.23, 95% CI -1,1.46) or 1222
Eucalyptus eects (mean 0.29, 95% CI -2.74,3.32). 1223
1224
1225
Figure 4: Fied model of the Box-Cox-transformed deviaon score (deviaon in eect size from 1226
meta-analyc mean) as a funcon of the mean Sorensen’s index showing disncveness of the set of 1227
predictor variables for a) blue t, and b) Eucalyptus. Grey ribbons on predicted values are 95% CI’s. 1228
1229
Out-of-sample predicons
1230
As with the Zr esmates, we did not observe any convincing relaonships between deviaon scores 1231
of out-of-sample predicons and Sorensen’s index values. Please see Supplementary Material C.4.2.
1232
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
60
Deviaon scores as explained by the inclusion of random eects
1233
Eect Size (Zr)
1234
There were only three blue t analyses that did not include random eects, which is below the pre-1235
registered threshold for ng a model of the Box-Cox transformed deviaon from the meta-analyc 1236
mean as a funcon of whether the analysis included random-eects. However, 17 Eucalyptus 1237
analyses included only xed eects, which crossed our pre-registered threshold. Consequently, we 1238
performed this analysis for the Eucalyptus dataset only. There was no relaonship between random-1239
eect inclusion and deviaon from meta-analyc mean among the Eucalyptus analyses (Table 6, 1240
Figure 5). 1241
1242
Figure 5: Violin plot of mean Box-Cox transformed deviaon from meta-analyc mean as a funcon 1243
of random-eects inclusion in Eucalyptus analyses. ‘1’ indicates random-eects were included in 1244
analyst’s model, while 0 indicates no random-eects were included. White points for each group of 1245
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
61
analyses denote model-esmated marginal mean deviaon, and error bars denote 95% CI of the 1246
esmate. 1247
Out-of-sample predicons 1248
As with the Zr esmates, we did not examine the possibility of a relaonship between the inclusion 1249
of random eects and the deviaon scores of the blue t out-of-sample predicons. When we 1250
examined the possibility of this relaonship for the Eucalyptus eects, we found consistent evidence 1251
of somewhat higher Box-Cox-transformed deviaon values for models including a random eect, 1252
meaning the models including random eects averaged slightly higher deviaon from the meta-1253
analyc means (Supplementary Figure C.5).
1254
Mulvariate Analysis Eect size (Zr) and out-of-sample predicons (yi) 1255
Like the univariate models, the mulvariate models did a poor job of explaining deviaons from the 1256
meta-analyc mean. Because we pre-registered a mulvariate model that contained collinear 1257
predictors that produce results which are not readily interpretable, we present these models in the 1258
supplement. We also had diculty with convergence and singularity for mulvariate models of out-1259
of-sample (yi) result, and had to adjust which random eects we included (Supplementary Table C.7). 1260
However, no mulvariate analyses of Eucalyptus out-of-sample results avoided problems of 1261
convergence or singularity, no maer which random eects we included (Supplementary Table C.7).
1262
We therefore present no mulvariate Eucalyptus yi models. We present parameter esmates from 1263
mulvariate Zr models for both datasets (Supplementary Tables C.5, C.6) and from yi models from 1264
the blue t dataset (Supplementary Tables C.8, C.9). We include interpretaon of the results from
1265
these models in the supplement, but the results do not change the interpretaons we present above 1266
based on the univariate analyses. 1267
Discussion 1268
When a large pool of ecologists and evoluonary biologists analyzed the same two datasets to 1269
answer the corresponding two research quesons, they produced substanally heterogeneous sets 1270
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
62
of answers. Although the variability in analycal outcomes was high for both datasets, the paerns 1271
of this variability diered disnctly between them. For the blue t dataset, there was nearly 1272
connuous variability across a wide range of Zr values. In contrast, for the Eucalyptus dataset, there 1273
was less variability across most of the range, but more striking outliers at the tails. Among out-of-1274
sample predicons, there was again almost connuous variaon across a wide range (2 SD) among 1275
blue t esmates. For Eucalyptus, out-of-sample predicons were also notably variable, with about 1276
half the predicted stem count values at <2 but the other half being much larger, and ranging to 1277
nearly 40 stems per 15 m x 15 m plot. We invesgated several hypotheses for drivers of this 1278
variability within datasets, but found lile support for any of these. Most notably, even when we 1279
excluded analyses that had received one or more poor peer reviews, the heterogeneity in results 1280
largely persisted. Regardless of what drives the variability, the existence of such dramacally 1281
heterogeneous results when ecologists and evoluonary biologists seek to answer the same 1282
quesons with the same data should trigger conversaons about how ecologists and evoluonary 1283
biologists analyze data and interpret the results of their own analyses and those of others in the 1284
literature [e.g., 11, 20, 49, 50]. 1285
Our observaon of substanal heterogeneity due to analycal decisions is consistent with a growing 1286
body of work, much of it from the quantave social sciences [e.g., 11, 1721]. In all of these 1287
studies, when volunteers from the discipline analyzed the same data, they produced a worryingly 1288
diverse set of answers to a pre-set queson. This diversity always included a wide range of eect 1289
sizes, and in most cases, even involved eects in opposite direcons. Thus, our result should not be 1290
viewed as an anomalous outcome from two parcular datasets, but instead as evidence from 1291
addional disciplines regarding the heterogeneity that can emerge from analyses of complex 1292
datasets to answer quesons in probabilisc science. Not only is our major observaon consistent 1293
with other studies, it is, itself, robust because it derived primarily from simple forest plots that we 1294
produced based on a small set of decisions that were mostly registered before data gathering and 1295
which conform to widely accepted meta-analyc pracces. 1296
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
63
Unlike the strong paern we observed in the forest plots, our other analyses, both registered and 1297
post hoc, produced either inconsistent paerns, weak paerns, or the absence of paerns. Our 1298
registered analyses found that deviaons from the meta-analyc mean by individual eect sizes (Zr) 1299
or the predicted values of the dependent variable (yi) were poorly explained by our hypothesized 1300
predictors: peer rang of each analysis team’s method secon, a measurement of the disncveness 1301
of the set of predictor variables included in each analysis, or whether the model included random 1302
eects. However, in our post hoc analyses, we found that dropping analyses idened as 1303
unpublishable or in need of major revision by at least one reviewer modestly reduced the observed 1304
heterogeneity among the Zr outcomes, but only for Eucalyptus analyses, apparently because this led 1305
to the dropping of the major outlier. This limited role for peer review in explaining the variability in 1306
our results should be interpreted cauously because the inter-rater reliability among peer reviewers 1307
was extremely low, and at least some analyses that appeared awed to us were not marked as 1308
awed by reviewers. However, the hypothesis that poor quality analyses drove the heterogeneity we 1309
observed was also contradicted by our observaon that analysts’ self-declared stascal experse 1310
appeared unrelated to heterogeneity. When we retained only analyses from teams including at least 1311
one member with high self-declared levels of experse, heterogeneity among eect sizes remained 1312
high. Thus, our results suggest lack of stascal experse is not the primary factor responsible for 1313
the heterogeneity we observed, although further work is merited before rejecng a role for 1314
stascal experse. Not surprisingly, simply dropping outlier values of Zr for Eucalyptus analyses, 1315
which had more extreme outliers, led to less observable heterogeneity in the forest plots, and also 1316
reducons in our quantave measures of heterogeneity. We did not observe a similar eect in the 1317
blue t dataset because that dataset had outliers that were much less extreme and instead had more 1318
variability across the core of the distribuon. 1319
Our major observaons raise two broad quesons; why was the variability among results so high, 1320
and why did the paern of variability dier between our two datasets. One important and plausible 1321
answer to the rst queson is that much of the heterogeneity derives from the lack of a precise 1322
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
64
relaonship between the two biological research quesons we posed and the data we provided. This 1323
lack of a precise relaonship between data and queson creates many opportunies for dierent 1324
model specicaons, and so may inevitably lead to varied analycal outcomes [50]. However, we 1325
believe that the research quesons we posed are consistent with the kinds of research queson that 1326
ecologists and evoluonary biologists typically work from. When designing the two biological 1327
research quesons, we deliberately sought to represent the level of specicity we typically see in 1328
these disciplines. This level of specicity is evident when we look at the research quesons posed by 1329
some recent meta-analyses in these elds: 1330
“how [does] urbanisaon impact mean phenotypic values and phenotypic variaon … [in] paired 1331
urban and non-urban comparisons of avian life-history traits” [51] 1332
“[what are] the eects of ocean acidicaon on the crustacean exoskeleton, assessing both 1333
exoskeletal ion content (calcium and magnesium) and funconal properes (biomechanical 1334
resistance and cucle thickness)” [52] 1335
“[what is] the extent to which restoraon aects both the mean and variability of biodiversity 1336
outcomes … [in] terrestrial restoraon” [53] 1337
“[does] drought stress [have] a negave, posive, or null eect on aphid tness” [54] 1338
“[what is] the inuence of nitrogen-xing trees on soil nitrous oxide emissions” [55] 1339
There is not a single precise answer to any of these quesons, nor to the quesons we posed to 1340
analysts in our study. And this lack of single clear answers will obviously connue to cause 1341
uncertainty since ecologists and evoluonary biologists conceive of the dierent answers from the 1342
dierent stascal models as all being answers to the same general queson. A possible response 1343
would be a call to avoid these general quesons in favor of much more precise alternaves [50]. 1344
However, the research community rewards researchers who pose broad quesons [56], and so 1345
researchers are unlikely to narrow their scope without a change in incenves. Further, we suspect 1346
that even if individual studies specied narrow research quesons, other sciensts would group 1347
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
65
these more narrow quesons into broader categories, for instance in meta-analyses, because it is 1348
these broader and more general quesons that oen interest the research community. 1349
Although variability in stascal outcomes among analysts may be inevitable, our results raise 1350
quesons about why this variability diered between our two datasets. We are parcularly 1351
interested in the dierences in the distribuon of Zr since the distribuons of out-of-sample 1352
predicons were on dierent scales for the two datasets, thus liming the value of comparisons. The 1353
forest plots of Zr from our two datasets showed disnct paerns, and these dierences are 1354
consistent with several alternave hypotheses. The results submied by analysts of the Eucalyptus 1355
dataset showed a small average (close to zero) with most esmates also close to zero (± 0.2), though 1356
about a third far enough above or below zero to cross the tradional threshold of stascal 1357
signicance. There were a small number of striking outliers that were very far from zero. In contrast, 1358
the results submied by analysts of the blue t dataset showed an average much further from zero (- 1359
0.35) and a much greater spread in the core distribuon of esmates across the range of Zr values (± 1360
0.5 from the mean), with few modest outliers. So, why was there more spread in eect sizes (across 1361
the esmates that are not outliers) in the blue t analyses relave to the Eucalyptus analyses? 1362
One possible explanaon for the lower heterogeneity among most Eucalyptus Zr eects is that weak 1363
relaonships may limit the opportunies for heterogeneity in analycal outcome. Some evidence for 1364
this idea comes from two sets of “many labs” studies in psychology [4, 57]. In these studies, many 1365
independent lab groups each replicated a large set of studies, including, for each study, the 1366
experiment, data collecon, and stascal analyses. These studies showed that, when the meta-1367
analyc mean across the replicaons from dierent labs was small, there was much less 1368
heterogeneity among the outcomes than when the mean eect sizes were large [4, 57]. Of course, a 1369
weak average eect size would not prevent divergent eects in all circumstances. As we saw with the 1370
Eucalyptus analyses, taking a radically smaller subset of the data can lead to dramacally divergent 1371
eect sizes even when the mean with the full dataset is close to zero. 1372
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
66
Our observaon that dramac sub-seng in the Eucalyptus dataset was associated with 1373
correspondingly dramac divergence in eect sizes leads us towards another hypothesis to explain 1374
the dierences in heterogeneity between the Eucalyptus and blue t analysis sets. It may be that 1375
when analysts oen divide a dataset into subsets, the result will be greater heterogeneity in 1376
analycal outcome for that dataset. Although we saw sub-seng associated with dramac outliers in 1377
the Eucalyptus dataset, nearly all other analyses of Eucalyptus data used very close to the same set 1378
of 351 samples, and as we saw, these eects did not vary substanally. However, analysts oen 1379
analyzed only a subset of the blue t data, and as we observed, sample sizes were much more 1380
variable among blue t eects, and the eects themselves were also much more variable. Important 1381
to note here is that subsets of data may dier from each other for biological reasons, but they may 1382
also dier due to sampling error. Sampling error is a funcon of sample size, and sub-samples are, by 1383
denion, smaller samples, and so more subject to variability in eects due to sampling error [58]. 1384
Other features of datasets are also plausible candidates for driving heterogeneity in analycal 1385
outcomes, including features of covariates. In parcular, relaonships between covariates and the 1386
response variable as well as relaonships between covariates and the primary independent variable 1387
(collinearity) can strongly inuence the modeled relaonship between the independent variable of 1388
interest and the dependent variable [59, 60]. Therefore, inclusion or exclusion of these covariates 1389
can drive heterogeneity in eect sizes (Zr). Also, as we saw with the two most extreme Zr values from 1390
the blue t analyses, in mulvariate models with collinear predictors, extreme eects can emerge 1391
when esmang paral correlaon coecients due to high collinearity, and conclusions can dier 1392
dramacally depending on which relaonship receives the researcher’s aenon. Therefore, 1393
dierences between datasets in the presence of strong and/or collinear covariates could inuence 1394
the dierences in heterogeneity in results among those datasets. 1395
Although it is too early in the many-analyst research program to conclude which analycal decisions 1396
or which features of datasets are the most important drivers of heterogeneity in analycal outcomes, 1397
we must sll grapple with the possibility that analycal outcomes may vary substanally based on 1398
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
67
the choices we make as analysts. If we assume that, at least somemes, dierent analysts will 1399
produce dramacally dierent stascal outcomes, what should we do as ecologists and 1400
evoluonary biologists? We review some ideas below. 1401
The easiest path forward aer learning about this analycal heterogeneity would be simply to 1402
connue with “business as usual”, where researchers report results from a small number of stascal 1403
models. A case could be made for this path based on our results. For instance, among the blue t 1404
analyses, the precise values of the esmated Zr eects varied substanally, but the average eect 1405
was convincingly dierent from zero, and a majority of individual eects (84%) were in the same 1406
direcon. Arguably, many ecologists and evoluonary biologists appear primarily interested in the 1407
direcon of a given eect and the corresponding p-value[61], and so the variability we observed 1408
when analyzing the blue t dataset may not worry these researchers. Similarly, most eects from the 1409
Eucalyptus analyses were relavely close to zero, and about two-thirds of these eects did not cross 1410
the tradional threshold of stascal signicance. Therefore, a large proporon of people analyzing 1411
these data would conclude that there was no eect, and this is consistent with what we might 1412
conclude from the meta-analysis. 1413
However, we nd the counter arguments to “business as usual” to be compelling. For blue ts, there 1414
were a substanal minority of calculated eects that would be interpreted by many biologists as 1415
indicang the absence of an eect (28%), and there were three tradionally ‘signicant’ eects in 1416
the opposite direcon to the average. The qualitave conclusions of analysts also reected 1417
substanal variability, with fully half of teams drawing a conclusion disnct from the one we draw 1418
from the distribuon as a whole. These teams with dierent conclusion were either uncertain about 1419
the negave relaonship between compeon and nestling growth, or they concluded that eects 1420
were mixed or absent. For the Eucalyptus analyses, this issue is more concerning. Around two-thirds 1421
of eects had condence intervals overlapping zero, and of the third of analyses with condence 1422
intervals excluding zero, almost half were posive, and the rest were negave. Accordingly, the 1423
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
68
qualitave conclusions of the Eucalyptus teams were spread across the full range of possibilies. But 1424
even these problems are opmisc. 1425
A potenally larger argument against “business as usual” is that it provides the raw material for 1426
biasing the literature. When dierent model specicaons readily lead to dierent results, analysts 1427
may be tempted to report the result that appears most interesng, or that is most consistent with 1428
expectaon [7, 12]. There is growing evidence that researchers in ecology and evoluonary biology 1429
oen report a biased subset of the results they produce [62, 63], and that this bias exaggerates the 1430
average size of eects in the published literature between 30 and 150% [9, 48].The bias then 1431
accumulates in meta-analyses, apparently more than doubling the rate of conclusions of “stascal 1432
signicance” in published meta-analyses above what would have been found in the absence of bias 1433
[48]. Thus, “business as usual” does not just create noisy results, it helps create systemacally 1434
misleading results. 1435
Conclusions 1436
Overall, our results suggest to us that, where there is a diverse set of plausible analysis opons, no 1437
single analysis should be considered a complete or reliable answer to a research queson. We 1438
contend that ecologists and evoluonary biologists typically do mulple analyses (as many of our 1439
analyst teams did) however, some of these analyses dont make it into the published manuscript. 1440
Further, because of the evidence that ecologists and evoluonary biologists oen present a biased 1441
subset of the analyses they conduct [48, 62, 63], we do not expect that even a collecon of dierent 1442
eect sizes from dierent studies will accurately represent the true distribuon of eects 1443
[48]. Therefore, we believe that an increased level of skepcism of the outcomes of single analyses, 1444
or even single meta-analyses, is warranted going forward. We recognize that some researchers have 1445
long maintained a healthy level of skepcism of individual studies as part of sound and praccal 1446
scienc pracce, and it is possible that those researchers will be neither surprised nor concerned by 1447
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
69
our results. However, we doubt that many researchers are suciently aware of the potenal 1448
problems of analycal exibility to be appropriately skepcal. 1449
If we are skepcal of single analyses, the path forward may be mulple analyses per dataset. One 1450
possibility is the tradional robustness or sensivity check [e.g., 64, 65], in which the researcher 1451
presents several alternave versions of an analysis to demonstrate that the result is ‘robust’ [66]. 1452
Unfortunately, robustness checks are at risk of the same potenal biases of reporng found in other 1453
studies [11], especially given the relavely few models typically presented. However, these risks 1454
could be minimized by running more models and doing so with pre-registraon or registered report. 1455
Another opon is model averaging. Averages across models oen perform well [e.g., 67], and in 1456
some forms this may be a relavely simple soluon. As most oen pracced in ecology and 1457
evoluonary biology, model averaging involves rst idenfying a small suite of candidate models 1458
[see 13], then using Akaike weights, based on Akaike’s Informaon Criterion (AIC), to calculate 1459
weighted averages for parameter esmates from those models. Again, the small number of models 1460
limits the exploraon of specicaon space, but we can examine a larger number of models. 1461
However, there are more concerning limitaons. The largest of these limitaons is that averaging 1462
regression coecients is problemac when models dier in interacon terms or collinear variables 1463
[68]. Addionally, weighng by AIC may oen be inconsistent with our modelling goals. AIC balances 1464
the trade-o between model complexity and predicve ability, but penalizing models for complexity 1465
may not be suited for tesng hypotheses about causaon. So, AIC may oen not oer the weight we 1466
want to use for an average, and we may also not wish to just generate an average. Instead, if we 1467
hope to understand an extensive universe of possible modelling outcomes, we could conduct a 1468
mulverse analysis, possibly with a specicaon [10, 49]. This could mean running hundreds or 1469
thousands of models (or more!) to examine the distribuon of possible eects, and to see how 1470
dierent specicaon choices map onto these eects. However, there is a trade-o between 1471
eciently exploring large areas of specicaon space and liming the analyses to biologically 1472
plausible specicaons. Instead of simply idenfying modelling decisions and creang all possible 1473
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
70
combinaons for the mulverse, a researcher could aempt to prevent implausible combinaons, 1474
though the more variables in the dataset, the more dicult this becomes. To make this easier, one 1475
could recruit many analysts to each designate one or a few plausible specicaons, as with our 1476
‘many analyst’ study [11]. An alternave that may be more labor intensive for the primary analyst, 1477
but which may lead to a more plausible set of models, could involve hypothesizing about causal 1478
pathways with DAGs [directed acyclic graphs; [69]] to constrain the model set. Devong this eort to 1479
thoughul mulverse specicaons, possibly combined with pre-registraon to hinder undisclosed 1480
data dredging, seems worthy of consideraon. 1481
Although we have reviewed a variety of potenal responses to the existence of variability in 1482
analycal outcomes, we certainly do not wish to imply that this is a comprehensive set of possible 1483
responses. Nor do we wish to imply that the opinions we have expressed about these opons are 1484
correct. Determining how the disciplines of ecology and evoluonary biology should respond to 1485
knowledge of the variability in analycal outcome will benet from the contribuon and discussion 1486
of ideas from across these disciplines. We look forward to learning from these discussions and to 1487
seeing how these disciplines ulmately respond. 1488
Declaraons 1489
Ethics approval and consent to parcipate 1490
We obtained permission to conduct this research from the Whitman College Instuonal Review 1491
Board (IRB). As part of this permission, the IRB approved the consent form (hps://osf.io/xyp68/) 1492
that all parcipants completed prior to joining the study. 1493
Consent for publicaon 1494
Not applicable 1495
Availability of data and materials 1496
All data cleaning and preparaon for our analyses was conducted in R (R Core Team 2022) and is 1497
publicly archived at (hps://zenodo.org/doi/10.5281/zenodo.10046152). Please see session info for 1498
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
71
the full list of packages and their citaons used in our analysis pipeline. We built an R package, 1499
ManyEcoEvo to conduct the analyses described in this chapter. This same package can be used to 1500
reproduce our analyses or replicate the analyses described here using alternate datasets. 1501
Compeng interests 1502
The authors declare that they have no compeng interests 1503
Funding 1504
EG’s contribuons were supported by an Australian Government Research Training Program 1505
Scholarship, AIMOS top-up scholarship (2022) and Melbourne Centre of Data Science Doctoral 1506
Academy Fellowship (2021). FF’s contribuons were supported by ARC Future Fellowship 1507
FT150100297. 1508
Author’s contribuons 1509
HF, THP and FF conceptualized the project. PV provided raw data for Eucalyptus analyses and SG and 1510
THP provided raw data for blue t analyses. DGH, HF and THP prepared surveys for collecng 1511
parcipang analysts and reviewer’s data. EG, HF, THP, PV, SN and FF planned the analyses of the 1512
data provided by our analysts and reviewers, EG, HF, and THP curated the data, EG and HF wrote the 1513
soware code to implement the analyses and prepare data visualisaons. EG ensured that analyses 1514
were documented and reproducible. THP and HF administered the project, including coordinang 1515
with analysts and reviewers. FF provided funding for the project. THP, HF, and EG wrote the 1516
manuscript. Authors listed alphabecally contributed analyses of the primary datasets or reviews of 1517
analyses. All authors read and approved the nal manuscript. 1518
Acknowledgements 1519
Not applicable 1520
References 1521
1. Arif S, MacNeil MA. Applying the structural causal model framework for observaonal causal 1522
inference in ecology. Ecological Monographs. 2023;93:e1554. 1523
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
72
2. Atkinson J, Brudvig LA, Mallen-Cooper M, Nakagawa S, Moles AT, Bonser SP. Terrestrial ecosystem 1524
restoraon increases biodiversity and reduces its variability, but not to reference levels: A global 1525
meta-analysis. Ecology Leers. 2022;25:1725–37. 1526
3. Auspurg K, Brüderl J. Has the credibility of the social sciences been credibly destroyed? 1527
Reanalyzing the “many analysts, one data set” project. Socius. 2021;7:23780231211024421. 1528
4. Schloerke B, Cook D, Larmarange J, Briae F, Marbach M, Thoen E, et al. GGally: Extension to 1529
’ggplot2’. 2022. 1530
5. Baselga A, Orme D, Villeger S, De Bortoli J, Leprieur F, Logez M, et al. Package “betapart”. 2023. 1531
6. Bates D, Mächler M, Bolker B, Walker S. Fing linear mixed-eects models using lme4. 2015. 1532
2015;67:48. 1533
7. Bolker B, Robinson D, Menne D, Gabry J, Buerkner P, Hau C, et al. Package “broom.mixed”. 2022. 1534
8. Borenstein M, Higgins JPT, Hedges L, Rothstein H. Basics of meta-analysis: I2 is not an absolute 1535
measure of heterogeneity. Research Synthesis Methods. 2017;8:5–18. 1536
9. Botvinik-Nezer R, Holzmeister F, Camerer CF, Dreber A, Huber J, Johannesson M, et al. Variability in 1537
the analysis of a single neuroimaging dataset by many teams. Nature. 2020;582:84–8. 1538
10. Breznau N, Rinke EM, Wuke A, Nguyen HHV, Adem M, Adriaans J, et al. Observing many 1539
researchers using the same data and hypothesis reveals a hidden universe of uncertainty. 1540
Proceedings of the Naonal Academy of Sciences. 2022;119:e2203150119. 1541
11. Briga M, Verhulst S. Mosaic metabolic ageing: Basal and standard metabolic rates age in opposite 1542
direcons and independent of environmental quality, sex and life span in a passerine. Funconal 1543
Ecology. 2021;35:1055–68. 1544
12. Burnham KP, Anderson DR. Model selecon and mulmodel inference: A praccal informaon-1545
theorecal approach. 2nd edion. Book. New York: Springer-Verlag; 2002. 1546
13. Cade BS. Model averaging and muddled mulmodel inferences. Ecology. 2015;96:2370–82. 1547
14. Capilla-Lasheras P, Thompson MJ, Sánchez-Tójar A, Haddou Y, Branston CJ, Réale D, et al. A global 1548
meta-analysis reveals higher variaon in breeding phenology in urban birds than in their non-urban 1549
neighbours. Ecology Leers. 2022;25:2552–70. 1550
15. Corea S, Casillas JV, Roessig S, Franke M, Ahn B, Al-Hoorie AH, et al. Muldimensional signals 1551
and analyc exibility: Esmang degrees of freedom in human-speech analyses. Advances in 1552
Methods and Pracces in Psychological Science. 2023;6:25152459231162567. 1553
16. DeKogel CH. Long-term eects of brood size manipulaon on morphological development and 1554
sex-specic mortality of ospring. Journal of Animal Ecology. 1997;66:167–78. 1555
17. Deressa T, Stern D, Vangronsveld J, Minx J, Lizin S, Malina R, et al. More than half of stascally 1556
signicant research ndings in the environmental sciences are actually not. EcoEvoRxiv. 2023. 1557
hps://doi.org/hps://doi.org/10.32942/X24G6Z. 1558
18. Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G, et al. Collinearity: A review of 1559
methods to deal with it and a simulaon study evaluang their performance. Ecography. 1560
2013;36:27–46. 1561
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
73
19. Fanelli D, Costas R, Ioannidis JPA. Meta-assessment of bias in science. Proceedings of the Naonal 1562
Academy of Sciences. 2017;114:3714–9. 1563
20. Fanelli D, Ioannidis JPA. US studies may overesmate eect sizes in soer research. Proceedings 1564
of the Naonal Academy of Sciences. 2013;110:15031–6. 1565
21. Fidler F, Burgman MA, Cumming G, Burose R, Thomason N. Impact of cricism of null-1566
hypothesis signicance tesng on stascal reporng pracces in conservaon biology. Conservaon 1567
Biology. 2006;20:1539–44. 1568
22. Fidler F, Chee YE, Wintle BC, Burgman MA, McCarthy MA, Gordon A. Metaresearch for evaluang 1569
reproducibility in ecology and evoluon. BioScience. 2017;67:282–9. 1570
23. Forstmeier W, Wagenmakers E-J, Parker TH. Detecng and avoiding likely false-posive ndings – 1571
a praccal guide. Biological Reviews. 2017;92:1941–68. 1572
24. Fraser H, Parker T, Nakagawa S, Barne A, Fidler F. Quesonable research pracces in ecology 1573
and evoluon. PLOS ONE. 2018;13:e0200303. 1574
25. Gelman A, Weakliem D. Of beauty, sex, and power. American Scienst. 2009;97:310–6. 1575
26. Gelman A, Loken E. The garden of forking paths: Why mulple comparisons can be a problem, 1576
even when there is no “shing expedion” or “p-hacking” and the research hypothesis was posited 1577
ahead of me. Department of Stascs, Columbia University. 2013. 1578
27. Grueber CE, Nakagawa S, Laws RJ, Jamieson IG. Mulmodel inference in ecology and evoluon: 1579
Challenges and soluons. Journal of Evoluonary Biology. 2011;24:699–711. 1580
28. Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 1581
2003;327:557–60. 1582
29. Hunngton-Klein N, Arenas A, Beam E, Bertoni M, Bloem JR, Burli P, et al. The inuence of hidden 1583
researcher decisions in applied microeconomics. Economic Inquiry. 2021;59:944–60. 1584
30. Jennions MD, Lore CJ, Rosenberg MS, Rothstein HR. Publicaon and related biases. In: Koricheva 1585
J, Gurevitch J, Mengersen K, editors. Handbook of meta-analysis in ecology and evoluon. Princeton, 1586
USA: Princeton University Press; 2013. p. 207–36. 1587
31. Kimmel K, Avolio ML, Ferraro PJ. Empirical evidence of widespread exaggeraon bias and 1588
selecve reporng in ecology. Nature Ecology & Evoluon. 2023. hps://doi.org/10.1038/s41559-1589
023-02144-3. 1590
32. Klein RA, Ratli KA, Vianello M, Jr. RBA, Bahník Š, Bernstein MJ, et al. Invesgang variaon in 1591
replicability: A "many labs" replicaon project. Social Psychology. 2014;45:142–52. 1592
33. Klein RA, Vianello M, Hasselman F, Adams BG, Adams RB, Alper S, et al. Many labs 2: Invesgang 1593
variaon in replicability across samples and sengs. Advances in Methods and Pracces in 1594
Psychological Science. 2018;1:443–90. 1595
34. Knight K. Mathemacal stascs. Book. New York: Chapman; Hall; 2000. 1596
35. Kou-Giesbrecht S, Menge DNL. Nitrogen-xing trees increase soil nitrous oxide emissions: A
1597
meta-analysis. Ecology. 2021;102:e03415. 1598
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
74
36. Kuznetsova A, Brockho PB, Christensen RHB. lmerTest package: Tests in linear mixed eects 1599
models. Journal of Stascal Soware. 2017;82:1–26. 1600
37. Leybourne DJ, Preedy KF, Valenne TA, Bos JIB, Karley AJ. Drought has negave consequences on 1601
aphid tness and plant vigor: Insights from a meta-analysis. Ecology and Evoluon. 2021;11:11915–1602
29. 1603
38. Lu X, White H. Robustness checks and robustness tests in applied economics. Journal of 1604
Econometrics. 2014;178:194–206. 1605
39. Lüdecke D, Ben-Shachar MS, Pal I, Waggoner P, Makowski D. Performance: An r package for 1606
assessment, comparison and tesng of stascal models. Journal of Open Source Soware. 1607
2021;6:3139. 1608
40. Luke SG. Evaluang signicance in linear mixed-eects models in r. Behavior Research Methods. 1609
2017;49:1494–502. 1610
41. Miles C. Tesng market-based instruments for conservaon in northern victoria. In: Norton T, 1611
Lefroy T, Bailey K, Unwin G, editors. Biodiversity: Integrang conservaon and producon: Case 1612
studies from australian farms, forests and sheries. Melbourne, Australia: CSIRO Publishing; 2008. p. 1613
133–46. 1614
42. Morrissey MB, Ruxton GD. Mulple regression is not mulple regressions: The meaning of 1615
mulple regression and the non-problem of collinearity. Philosophy, Theory, and Pracce in Biology. 1616
2018;10. 1617
43. Nakagawa S, Cuthill IC. Eect size, condence interval and stascal signicance: A praccal 1618
guide for biologists. Biological Reviews. 2007;82:591–605. 1619
44. Nakagawa S, Noble DW, Senior AM, Lagisz M. Meta-evaluaon of meta-analysis: Ten appraisal 1620
quesons for biologists. BMC Biology. 2017;15:18. 1621
45. Nicolaus M, Michler SPM, Ubels R, Velde M van der, Komdeur J, Both C, et al. Sex-specic eects 1622
of altered compeon on nestling growth and survival: An experimental manipulaon of brood size 1623
and sex rao. Journal of Animal Ecology. 2009;78:414–26. 1624
46. Noble DWA, Lagisz M, O’Dea RE, Nakagawa S. Nonindependence and sensivity analyses in 1625
ecological and evoluonary meta-analyses. Molecular Ecology. 2017;26:2410–25. 1626
47. Open Science Collaboraon. Esmang the reproducibility of psychological science. Science. 1627
2015;349:aac4716. 1628
48. Parker TH, Forstmeier W, Koricheva J, Fidler F, Hadeld JD, Chee YE, et al. Transparency in ecology 1629
and evoluon: Real problems, real soluons. Trends in Ecology & Evoluon. 2016;31:711–9. 1630
49. Parker TH, Yang Y. Exaggerated eects in ecology. Nature Ecology & Evoluon. 2023. 1631
hps://doi.org/10.1038/s41559-023-02156-z. 1632
50. Pei Y, Forstmeier W, Wang D, Marn K, Rutkowska J, Kempenaers B. Proximate causes of inferlity 1633
and embryo mortality in capve zebra nches. The American Naturalist. 2020;196:577–96. 1634
51. R Core Team. R: A language and environment for stascal compung. Vienna, Austria: R 1635
Foundaon for Stascal Compung; 2022. 1636
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
75
52. Rosenberg MS. Moment and least-squares based approaches to metaanalyc inference. In: 1637
Koricheva J, Gurevitch J, Mengersen K, editors. Handbook of meta-analysis in ecology and evoluon. 1638
Princeton, USA: Princeton University Press; 2013. p. 108–24. 1639
53. Royle NJ, Hartley IR, Owens IPF, Parker GA. Sibling compeon and the evoluon of growth rates 1640
in birds. Proceedings of the Royal Society B-Biological Sciences. 1999;266:923–32. 1641
54. Schweinsberg M, Feldman M, Staub N, Akker OR van den, Aert RCM van, Assen M van, et al. 1642
Same data, dierent conclusions: Radical dispersion in empirical results when independent analysts 1643
operaonalize and test the same hypothesis. Organizaonal Behavior and Human Decision 1644
Processes. 2021;165:228–49. 1645
55. Senior AM, Grueber CE, Kamiya T, Lagisz M, O’Dwyer K, Santos ESA, et al. Heterogeneity in 1646
ecological and evoluonary meta-analyses: Its magnitude and implicaons. Ecology. 2016;97:3293–9. 1647
56. Shavit A, Ellison AM. Stepping in the same river twice: Replicaon in biological research. Edited 1648
Book. New Haven, Conneccut, USA: Yale University Press; 2017. 1649
57. Siegel KR, Kaur M, Grigal AC, Metzler RA, Dickinson GH. Meta-analysis suggests negave, but 1650
pCO2-specic, eects of ocean acidicaon on the structural and funconal properes of crustacean 1651
biomaterials. Ecology and Evoluon. 2022;12:e8922. 1652
58. Silberzahn R, Uhlmann EL, Marn DP, Anselmi P, Aust F, Awtrey E, et al. Many analysts, one data 1653
set: Making transparent how variaons in analyc choices aect results. Advances in Methods and 1654
Pracces in Psychological Science. 2018;1:337–56. 1655
59. Simons DJ, Shoda Y, Lindsay DS. Constraints on generality (COG): A proposed addion to all 1656
empirical papers. Perspecves on Psychological Science. 2017. 1657
hps://doi.org/10.1177/174569161770863. 1658
60. Simonsohn U, Simmons JP, Nelson LD. Specicaon curve: descripve and inferenal stascs on 1659
all reasonable specicaons. SSRN Electronic Journal. 2015. hps://doi.org/10.2139/ssrn.2694998 . 1660
61. Simonsohn U, Simmons JP, Nelson LD. Specicaon curve analysis. Nature Human Behaviour. 1661
2020;4:1208–14. 1662
62. Steegen S, Tuerlinckx F, Gelman A, Vanpaemel W. Increasing transparency through a mulverse 1663
analysis. Perspecves on Psychological Science. 2016;11:702–12. 1664
63. Taylor JW, Taylor KS. Combining probabilisc forecasts of COVID-19 mortality in the united states. 1665
European Journal of Operaonal Research. 2023;304:25–41. 1666
64. Dancho M, Vaughan D. Timetk: A tool kit for working with me series. 2023. 1667
65. Vander Werf E. Lack’s clutch size hypothesis: An examinaon of the evidence using meta-analysis. 1668
Ecology. 1992;73:1699–705. 1669
66. Ver Hoef JM. Who invented the delta method? The American Stascian. 2012;66:124–7. 1670
67. Verhulst S, Holveck MJ, Riebel K. Long-term eects of manipulated natal brood size on metabolic 1671
rate in zebra nches. Biology Leers. 2006;2:478–80. 1672
68. Vesk PA, Morris WK, McCallum W, Apted R, Miles C. Processes of woodland eucalypt 1673
regeneraon: Lessons from the bush returns trial. Proceedings of the Royal Society of Victoria. 1674
2016;128:54–63. 1675
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
76
69. Viechtbauer W. Conducng meta-analyses in r with the metafor package. 2010. 2010;36:48. 1676
70. Yang Y, Sánchez-Tójar A, O’Dea RE, Noble DWA, Koricheva J, Jennions MD, et al. Publicaon bias 1677
impacts on eect size, stascal power, and magnitude (type m) and sign (type s) errors in ecology 1678
and evoluonary biology. BMC Biology. 2023;21:71. 1679
... New publishing paradigms, such as Octopus, allow researchers to link multiple 'Analysis' and/or 'Interpretation' publications to a single 'Results' publication as alternative analyses and interpretations of the same data [159]. A more traditional research paper, on the other hand, is one realization of many possible assessments of the data that were originally collected, and a wide diversity of results can be obtained when many individuals analyse one dataset with the same research question in mind [160,161]. That is, publications are one version of an oversimplified projection through n-dimensional space which communicate stories that our human minds can comprehend. ...
... However, if future researchers are not granted access to our (past) data, future humans will have to repeat costly (e.g. time and resources) experiments, laboriously extract information directly from figures, tables and text in the articles themselves (assuming the relevant information is available and detailed enough, although there is evidence that this is not the case in at least some disciplines [55,162]) or will have to trust our analytical procedures and our intuitions and perceptions about the data we collected [160,161]. ...
Article
Full-text available
As the impacts of climate change continue to intensify, humans face new challenges to long-term survival. Humans will likely be battling these problems long after 2100, when many climate projections currently end. A more forward-thinking view on our science and its direction may help better prepare for the future of our species. Researchers may consider datasets the basic units of knowledge, whose preservation is arguably more important than the articles that are written about them. Storing data and code in long-term repositories offers insurance against our uncertain future. To ensure open data are useful, data must be FAIR (Findable, Accessible, Interoperable and Reusable) and be complete with all appropriate metadata. By embracing open science practices, contemporary scientists give the future of humanity the information to make better decisions, save time and other valuable resources, and increase global equity as access to information is made free. This, in turn, could enable and inspire a diversity of solutions, to the benefit of many. Imagine the collective science conducted, the models built, and the questions answered if all of the data researchers have collectively gathered were organized and immediately accessible and usable by everyone. Investing in open science today may ensure a brighter future for unborn generations.
... Möglicherweise profitiert eine Datenbeurteilung vom Hinzuziehen unbeteiligter Dritter (Holman et al. 2015). Im Experiment von Gould et al. (2023) führte dies jedenfalls zu einer Reduktion der unsachgemäßen Analysen. Diese unbeteiligten Personen unterliegen zwar ihrerseits gewissen kognitionspsychologischen Verzerrungen, aber bestenfalls nicht dem Einfluss von Erwartungshaltungen oder zumindest nicht denselben Einflüssen wie die zu kontrollierenden Datensammlerinnen und -sammler. ...
Article
Naturbeobachtungen basieren auf einer Reihe kognitiver Prozesse − vom Lernen artspezifischen Wissens, über die konkrete Beobachtung und Entscheidung (z. B. um welche Art es sich handelt) bis hin zu Meldungen von Beobachtungen an Datenbanken und deren Plausibilisierung. In allen Schritten können Verzerrungen und Fehler auftreten, die nicht auf Unwissenheit, mangelnde Anstrengung oder absichtliche Täuschung zurückzuführen sind, sondern auf den grundlegenden Mechanismen menschlicher Informationsverarbeitung beruhen. So kann bspw. das indi- viduelle Vorwissen für bestimmte Erwartungshaltungen sorgen, die den Blick einengen, unvollständige Wahrnehmungen werden typischerweise subjektiv ergänzt oder Entscheidungen werden meist nicht analytisch getroffen. Auch die Urteile von Expertinnen oder Experten und Gruppen- druck können den Entscheidungsspielraum zu schnell einschränken. In der Regel sind wir uns dieser Einflüsse gar nicht bewusst. Sie können aber die Validität von Naturbeobachtungen deutlich beeinträchtigen und auch zu folgenreichen Entscheidungen für den Naturschutz führen. Wir beschreiben in diesem Beitrag solche potenziell verfälschenden Einflüsse und stellen eine Reihe von Maßnahmen und Strategien dar, die diesen Verfälschungen entgegenwirken. Unser Anliegen und unsere Hoffnung ist es, dass diese Maßnahmen in die alltägliche Beobachtungspraxis integriert werden und somit zu einer verbesserten Datenlage auch im Naturschutz beitragen.
... cross-sample or cross-subject), visualizations and any other relevant transformations. By capturing this detailed record, researchers gain complete insight into the exact analytic procedures employed which is imperative for transparency [10], and allows researchers to evaluate the chosen analysis strategy against valid alternative approaches [26][27][28]. ...
Article
Full-text available
The high incidence of irreproducible research has led to urgent appeals for transparency and equitable practices in open science. For the scientific disciplines that rely on computationally intensive analyses of large datasets, a granular understanding of the analysis methodology is an essential component of reproducibility. This article discusses the guiding principles of a computational reproducibility framework that enables a scientist to proactively generate a complete reproducible trace as analysis unfolds, and share data, methods and executable tools as part of a scientific publication, allowing other researchers to verify results and easily re-execute the steps of the scientific investigation.
... Ecology and environmental modelling are certainly not immune to this crisis (Alston & Rick, 2021;Archmiller et al., 2020;Essawy et al., 2020). Recently, Gould et al. (2023) showed that even when given the same ecological data sets, scientists come to different conclusions due to variation in the method of analysis. This emphasises the need for data analysis decisions to be clear and transparent to other researchers trying to reproduce the work. ...
Article
Full-text available
There is a research reproducibility crisis, including in ecology. The research pipeline from conception to publication has many cracks, which means that it may not be possible to repeat and verify published results. Reproducibility means that the results of a study can be reproduced from the original data. It is a critical step in the quality assurance of a study; indeed, the re‐use and subsequent citation of methods from reproducible research can increase the impact of the work beyond the findings of the specific study. Given the original data, code and documentation, in theory, all research results could be reproduced. However, sufficient information must be available to understand and reproduce the data handling, analysis and modelling. Information should also be accessible, enabling reproduction with reasonable effort. Various open‐source software options exist that allow scientists to easily annotate their scripts in a way that makes it simple to produce dynamic documents that give a more accessible account of the analysis (html, pdf and various word processor file types). Popular software options—including Jupyter notebooks, the R markdown package and the new multi‐language Quarto application—produce documents that weave together the input code and software‐generated output (text, tables, and figures) with the author's explanatory text to produce a clear narrative of the analysis process. Therefore, we now encourage the submission of supplementary dynamic documents to the Journal of Ecology to improve the reproducibility and transparency of research published in the journal. Reproducibility can be assessed prior to the submission of the work for publication, during peer review and post‐publication. Authors are encouraged to provide three file types: the data, an executable dynamic document and a static reproducibility PDF file that integrates and annotates the input code with the statistical output. We provide some basic examples of dynamic documents for reproducibility.
... A robust modeling analysis also translates to replicable results. Since there is currently a growing need for reproducibility in ecological modeling (Gould et al., 2023), steps should be taken to make sure models have undergone a robustness analysis to prove that results can be replicated by other researchers. ...
Article
Full-text available
Chlorophyll-a (Chl-a) content in waterbodies is a primary indicator of algal biomass and is used to detect impending harmful algal blooms. This paper presents a methodology using 8 popular machine learning (ML) models for estimating Chl-a concentration from nutrient content in lakes. Different from previous works, we introduce 3 novel steps: (i) the use of Bayesian optimization for fine-tuning ML hyper-parameters to improve performance; (ii) the use of explainability methods to understand the most influential inputs to Chl-a prediction; and (iii) the use of robustness analysis to assess how models are affected by measurement noise. Two case studies were used to test our approach: Laguna Lake, Philippines, and various lakes from Japan, the United States of America, Canada, and Uganda. We found that fine-tuned Kernel Ridge Regression and Gaussian Process Regression are consistently the most accurate (>80%) and robust models in both case studies. In Laguna Lake, Shapley explanations revealed that phosphate and nitrate ions are the most important predictors of Chl-a, while total phosphorus is that for global lakes. Hence, these parameters are suggested to be monitored more closely for detecting algal blooms. By making our codes accessible, we hope that our methods can serve as a benchmark for the data-driven modeling of Chl-a content in lakes, and aid in their management through model deployment.
... Importantly, these collaborative initiatives have extended beyond research on human adults' cognition and behavior, encompassing studies across development 7,16,17 ; fundamental topics involving core theories 18 , tools 19,20 , ecological 21 and pedagogical research [22][23][24][25] ; as well as applied research for climate change and science communication 26,27 . While BTS offers significant advantages, it also presents challenges related to diversity, volunteer participation, and organizational capacity. ...
Preprint
The replication crisis in psychology and related sciences contributed to the adoption of large-scale research initiatives known as Big Team Science (BTS). BTS has made significant advances in addressing issues of replication, statistical power, and diversity through the use of larger samples and more representative cross-cultural data. However, while these collaborations hold great potential, they also introduce unique challenges related to their scale. Drawing on experiences from successful BTS projects, we identified and outlined key strategies for overcoming diversity, volunteering, and capacity challenges. We emphasize the need for the implementation of strong organizational practices and the distribution of responsibility to prevent common pitfalls. More fundamentally, BTS requires a shift in mindset toward prioritizing collaborative effort, diversity, transparency, and inclusivity. Ultimately, we call for reflection on the strengths and limitations of BTS to enhance the quality, generalizability, and impact of research across disciplines.
Preprint
Full-text available
Ecological communities, and especially metacommunities, are complex and dynamic entities. Resolving the processes and mechanisms that shape these systems remains a central challenge in ecology. This challenge is compounded by the increasing entanglement of mechanisms, processes, and emergent patterns of biodiversity as scales of space, time, and biological organization expand. Here, we define and contextualize key issues, recent progress, and remaining challenges in interpreting basic metacommunity data and using predictive models to link processes to patterns. We find substantial progress in connecting pattern and process through improved data repeatability and scaling, enhanced analytical tools to quantify patterns, and increasingly sophisticated theoretical models that address ecological complexity. However, accurately matching observable patterns with process-oriented theory remains a persistent challenge. We identify potential pipelines connecting process and pattern and highlight areas for future progress.
Article
Full-text available
Animals need to recognize different individuals, both con- and heterospecifics, to make appropriate decisions. In the wild, responses to familiar individuals may vary depending on the context, which can be beneficial. However, differing responses towards human experimenters can influence experimental outcomes. Such effects might be particularly overlooked in reptiles which are frequently viewed as cognitively less advanced. We tested Tokay geckos’ (Gekko gecko) ability to distinguish between familiar and unfamiliar handlers in two situations: in a novel situation (exerting physical constraint) and a routine situation (feeding from forceps as during regular husbandry). Geckos showed sex-specific differences towards familiar and unfamiliar handlers in a routine situation, but not in a novel situation, in which they showed individual repeatability. Our results further advance our understanding of reptile cognition revealing important insights into context specific responses in relation to handler identity with implications for experimental animal studies that are rarely considered. Supplementary Information The online version contains supplementary material available at 10.1038/s41598-025-95936-5.
Article
In this study, we carried out a detailed analysis of the seasonality of ticks of the genus Amblyomma in the Neotropical region by means of a meta-analysis. Our aim was to identify temporal patterns and factors that influence the seasonality of these ectoparasites, considering different developmental stages (larvae, nymphs and adults) to provide a comprehensive understanding of the population dynamics of this group. To do this, we carried out a systematic review in databases such as Scopus and Web of Science. Studies that assessed the prevalence, abundance, or seasonality of Amblyomma ticks in the Neotropical region were included in our data collection. Despite our focus on the Neotropics, we only obtained data on South American populations. Multilevel meta-analytical models were used to test whether larvae, nymphs and adults of Amblyomma ticks show higher mean abundance, mean intensity and prevalence in the dry or in the rainy season. Our results validate that Amblyomma larvae, nymphs, and adults show distinct patterns of seasonality. During the dry season, larvae and nymphs occur more frequently, while adults predominate in the rainy season. We also observed variations in seasonal occurrence at the species level, highlighting the complexity and variability of these patterns. Thus, in summary, the data we provide here contributes to a better understanding on the temporal distribution of these ectoparasites, as well as the factors contributing for such distribution, thus providing subsidies for the development of more effective strategies for the control and prevention of tick-borne diseases. By showing research gaps in the literature, we also argue that further research on the seasonal patterns of ticks is needed to improve our understanding of the factors that influence the population dynamics of these parasites and to implementing more effective public health measures.
Article
Full-text available
Zusammenfassung: Immer wieder wird die Qualität von Naturbeobachtungen, die durch Bürgerinnen und Bürger erfasst werden, infrage gestellt, besonders dann, wenn diese Beobachtungen nicht von Experten verifiziert wurden. Ich zeige Möglichkeiten und Limitierungen einer Citizen-Science-Datenanalyse auf und beschreibe ein methodisches Vorgehen, wie sich Citizen-Science-Daten auswerten lassen, ohne ausschließlich von Fachexperten vorab geprüfte Daten zu verwenden. Dabei zeigt sich, dass Naturbeobachtungen ähnlich wie Zeugenaussagen zu charakterisieren sind, deren Plausibilität (Glaubhaftigkeit) an bestimmte Hinweise, wie in diesem Beispiel Fotobelege oder Beschreibungen von Verhaltensweisen der Nosferatu-Spinne, geknüpft sind. Bürgerwissenschaftler sind zu einem hohen Grad in der Lage, Nosferatu-Spinnen zu erkennen, die mit nur wenigen Arten verwechselt werden. Es zeigte sich, dass die verwechselten Arten zu unterschiedlichen Zeiten unterschiedlich sein können. Zusätzlich wurde die Nosferatu-Spinne weniger verwechselt, wenn während des Meldeprozesses die Verwechslungsarten als solche ebenfalls zur Meldung angeboten wurden. Die präsentierten Meldungen wurden im Wesentlichen von Interessierten gemacht, die nicht zu regelmäßig Meldenden auf dem Meldeportal NABU-naturgucker.de gehörten. Weiterhin ergab die Analyse, dass die meisten Meldungen der Nosferatu-Spinne aus dem Siedlungsbereich stammen und diese Art von dort weiter verschleppt wird. Schlüsselwörter: Validität, Plausibilität, unstrukturierte Daten, Neozoon, Bürgerwissenschaften Summary: The quality of nature observations is often a subject of debates, especially if citizen scientists perform data acquisition and experts do not double-check these observations. Here, I show the potential and limits of a data analysis uniquely based on citizen science data and describe a methodical workfl ow how such data can be used, without solely relying on previously expert-validated data. The plausibility of nature observations, which bear some characteristics in common with witness statements, can be linked to proofs such as photographs or descriptions of behaviour. Citizen scientists are highly capable of recognising Zoropsis spinimama spiders, which get confused with only a few species. These confusing species differ during the year. In addition, Zoropsis spinimana spiders are getting less confused, if the confused species are presented during data acquisition. The presented observations were mainly gathered at synanthropic habitats by citizen scientist, who do not regularly contribute to NABU-naturgucker.de observations.
ResearchGate has not been able to resolve any references for this publication.