Content uploaded by Timothy Parker
Author content
All content in this area was uploaded by Timothy Parker on Dec 28, 2023
Content may be subject to copyright.
Content uploaded by Karen J Vanderwolf
Author content
All content in this area was uploaded by Karen J Vanderwolf on Oct 04, 2023
Content may be subject to copyright.
Content uploaded by Sergio Nolazco
Author content
All content in this area was uploaded by Sergio Nolazco on Oct 05, 2023
Content may be subject to copyright.
Content uploaded by Frances J. Griffith
Author content
All content in this area was uploaded by Frances J. Griffith on Oct 05, 2023
Content may be subject to copyright.
Content uploaded by Kiara L'Herpiniere
Author content
All content in this area was uploaded by Kiara L'Herpiniere on Oct 27, 2023
Content may be subject to copyright.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
1
Same data, different analysts: variaon in effect sizes due to analycal 1
decisions in ecology and evoluonary biology. 2
Elliot Gould, School of Agriculture Food and Ecosystem Sciences, University of Melbourne, Australia 3
Hannah S. Fraser, School of Historical and Philosophical Studies, University of Melbourne, Australia 4
Timothy H. Parker, Department of Biology, Whitman College, USA. Author for Correspondence: 5
parkerth@whitman.edu 6
Shinichi Nakagawa, School of Biological, Earth & Environmental Sciences, University of New South 7
Wales, Australia 8
Simon C. Griffith, School of Natural Sciences, Macquarie University, Australia 9
Peter A. Vesk, School of Agriculture Food and Ecosystem Sciences, University of Melbourne, Australia 10
Fiona Fidler, School of Historical and Philosophical Studies, University of Melbourne, Australia 11
Daniel G. Hamilton, School of Public Health and Prevenve Medicine, Monash University, Australia 12
Robin N Abbey-Lee, Länsstyrelsen Östergötland, Sweden 13
Jessica K. Abbo, Biology Department, Lund University, Sweden 14
Luis A. Aguirre, Department of Biology, University of Massachuses, USA 15
Carles Alcaraz, Marine and Connental Waters, IRTA, Spain 16
Irith Aloni, Deptartment of Life Sciences, Ben Gurion University of the Negev, Israel 17
Drew Altschul, Department of Psychology, The University of Edinburgh, UK 18
Kunal Arekar, Centre for Ecological Sciences, Indian Instute of Science, India 19
Jeff W. Atkins, Southern Research Staon, USDA Forest Service, USA 20
Joe Atkinson, Center for Ecological Dynamics in a Novel Biosphere (ECONOVO), Department of 21
Biology, Aarhus University, Denmark 22
Christopher M. Baker, School of Mathemacs and Stascs, University of Melbourne, Australia 23
Meghan Barre, Biology, Indiana University Purdue University Indianapolis, USA 24
Krisan Bell, School of Life and Environmental Sciences, Deakin University, Australia 25
Suleiman Kehinde Bello, Department of Arid Land Agriculture, King Abdulaziz University, Kingdom of 26
Saudi Arabia 27
Iván Beltrán, Department of Biological Sciences, Macquarie University, Australia 28
Bernd J. Berauer, Department of Plant Ecology, University of Hohenheim, Instute of Landscape and 29
Plant Ecology, Germany 30
Michael Grant Bertram, Department of Wildlife, Fish, and Environmental Studies, Swedish University 31
of Agricultural Sciences, Sweden 32
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
2
Peter D. Billman, Department of Ecology and Evoluonary Biology, University of Conneccut, USA 33
Charlie K. Blake, STEM Center, Southern Illinois University Edwardsville, USA 34
Shannon Blake, University of Guelph, Canada 35
Louis Bliard, Department of Evoluonary Biology and Environmental Studies, University of Zurich, 36
Switzerland 37
Andrea Bonisoli-Alqua, Department of Biological Sciences, California State Polytechnic University, 38
Pomona, USA 39
Timothée Bonnet, Centre d'Études Biologiques de Chizé, UMR 7372 Université de la Rochelle - Centre 40
Naonal de la Recherche Scienfique, France 41
Camille Nina Marion Bordes, Faculty of Life Sciences, Bar Ilan University, Israel 42
Aneesh P. H. Bose, Department of Wildlife, Fish, and Environmental Studies, Swedish University of 43
Agricultural Sciences, Sweden 44
Thomas Boerill-James, School of Natural Sciences, University of Tasmania, Australia 45
Melissa Anna Boyd, Whitebark Instute, USA 46
Sarah A. Boyle, Department of Biology, Rhodes College, USA 47
Tom Bradfer-Lawrence, Centre for Conservaon Science, RSPB, UK 48
Jennifer Bradham, Environmental Studies, Wofford College, USA 49
Jack A. Brand, Department of Wildlife, Fish and Environmental Studies, Swedish University of 50
Agricultural Sciences, Sweden 51
Marn I. Brengdahl, IFM Biology, Linköping University, Sweden 52
Marn Bulla, Faculty of Environmental Sciences, Czech University of Life Sciences Prague, Czech 53
Republic 54
Luc Bussière, Biological and Environmental Sciences & Gothenburg Global Biodiversity Centre, 55
University of Gothenburg, Sweden 56
Eore Camerlenghi, School of Biological Sciences, Monash University, Australia 57
Sara E. Campbell, Ecology and Evoluonary Biology, University of Tennessee Knoxville, USA 58
Leonardo L. F. Campos, Departamento de Ecologia e Zoologia, Universidade Federal de Santa 59
Catarina, Brazil 60
Anthony Caravaggi, School of Biological and Forensic Sciences, University of South Wales, UK 61
Pedro Cardoso, Centre for Ecology, Evoluon and Environmental Changes (cE3c) & CHANGE - Global
62
Change and Sustainability Instute, Faculdade de Ciências, Universidade de Lisboa, Portugal 63
Charles J.W. Carroll, Forest and Rangeland Stewardship, Colorado State University, USA 64
Therese A. Catanach, Department of Ornithology, Academy of Natural Sciences of Drexel University, 65
USA 66
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
3
Xuan Chen, Biology, Salisbury University, USA 67
Heung Ying Janet Chik, Groningen Instute for Evoluonary Life Sciences, University of Groningen, 68
Netherlands 69
Emily Sarah Choy, Department of Biology, McMaster University, Canada 70
Alec Philip Chrise, Department of Zoology, University of Cambridge, UK 71
Angela Chuang, Entomology and Nematology, University of Florida, USA 72
Amanda J. Chunco, Environmental Studies, Elon University, USA 73
Bethany L. Clark, BirdLife Internaonal, UK 74
Andrea Conna, School of Integrave Biological and Chemical Sciences, The University of Texas Rio 75
Grande Valley, USA 76
Garth A. Covernton, Department of Ecology and Evoluonary Biology, University of Toronto, Canada 77
Murray P. Cox, Department of Stascs, University of Auckland, New Zealand 78
Kimberly A. Cressman, Catbird Stats, LLC, USA 79
Marco Cro, School of Biodiversity, One Health & Veterinary Medicine, University of Glasgow, UK 80
Connor Davidson Crouch, School of Forestry, Northern Arizona University, USA 81
Pietro B. D'Amelio, Department of Behavioural Neurobiology, Max Planck Instute for Biological 82
Intelligence, Germany 83
Alexandra Allison de Sousa, School of Sciences: Center for Health and Cognion, Bath Spa University, 84
UK 85
Timm Fabian Döbert, Department of Biological Sciences, University of Alberta, Canada 86
Ralph Dobler, Applied Zoology, TU Dresden, Germany 87
Adam J. Dobson, School of Molecular Biosciences, College of Medical Veterinary & Life Sciences, 88
University of Glasgow, UK 89
Tim S. Doherty, School of Life and Environmental Sciences, The University of Sydney, Australia 90
Szymon Marian Drobniak, Instute of Environmental Sciences, Jagiellonian University, Poland 91
Alexandra Grace Duffy, Biology Department, Brigham Young University, USA 92
Alison B. Duncan, Instute of Evoluonary Sciences Montpellier, University of Montpellier, CNRS, 93
IRD., France 94
Robert P. Dunn, Baruch Marine Field Laboratory, University of South Carolina, USA 95
Jamie Dunning, Department of Life Sciences, Imperial College London, UK 96
Trishna Dua, European Forest Instute, Germany 97
Luke Eberhart-Hertel, Department of Ornithology, Max Planck Instute for Biological Intelligence, 98
Germany 99
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
4
Jared Alan Elmore, Forestry and Environmental Conservaon, Naonal Bobwhite and Grassland 100
Iniave, Clemson University, USA 101
Mahmoud Medhat Elsherif, Department of Psychology and Vision Science, University of Birmingham, 102
Baily Thomas Grant, UK 103
Holly M. English, School of Biology and Environmental Science, University College Dublin, Ireland 104
David C. Ensminger, Department of Biological Sciences, San José State University, USA 105
Ulrich Rainer Ernst, Apicultural State Instute, University of Hohenheim, Germany 106
Stephen M. Ferguson, Department of Biology, St. Norbert College, USA 107
Esteban Fernandez-Juricic, Department of Biological Sciences, Purdue University, USA 108
Thalita Ferreira-Arruda, Biodiversity, Macroecology & Biogeography, Faculty of Forest Sciences and 109
Forest Ecology, University of Göngen, Germany 110
John Fieberg, Department of Fisheries, Wildlife, and Conservaon Biology, University of Minnesota, 111
USA 112
Elizabeth A. Finch, CABI, UK 113
Evan A. Fiorenza, Department of Ecology and Evoluonary Biology, School of Biological Sciences, 114
University of California, Irvine, USA 115
David N. Fisher, School of Biological Sciences, University of Aberdeen, UK 116
Amélie Fontaine, Department of Natural Resource Sciences, McGill University, Canada 117
Wolfgang Forstmeier, Department of Ornithology, Max Planck Instute for Biological Intelligence, 118
Germany 119
Yoan Fourcade, Instute of Ecology and Environmental Sciences (iEES), Univ. Paris-Est Creteil, France 120
Graham S. Frank, Department of Forest Ecosystems and Society, Oregon State University, USA 121
Cathryn A. Freund, Wake Forest University, USA 122
Eduardo Fuentes-Lillo, Laboratorio de Invasiones Biológicas (LIB), Instuto de Ecología y 123
Biodiversidad, Chile 124
Sara L. Gandy, Instute for Biodiversity, Animal Health and Comparave Medicine, University of 125
Glasgow, UK 126
Dusn G. Gannon, Department of Forest Ecosystems and Society, College of Forestry, Oregon State 127
University, USA 128
Ana I. García-Cervigón, Biodiversity and Conservaon Area, Rey Juan Carlos University, Spain 129
Alexis C. Garretson, Graduate School of Biomedical Sciences, Tus University, USA 130
Xuezhen Ge, Department of Integrave Biology, University of Guelph, Canada 131
William L. Geary, School of Life and Environmental Sciences (Burwood Campus), Deakin University, 132
Australia 133
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
5
Charly Géron, CNRS, University of Rennes, France 134
Marc Gilles, Department of Behavioural Ecology, Bielefeld University, Germany 135
Antje Girndt, Fakultät für Biologie, Arbeitsgruppe Evoluonsbiologie, Universität Bielefeld, Germany 136
Daniel Gliksman, Chair of Meteorology, Instute for Hydrology and Meteorology, Faculty of 137
Environmental Sciences, Technische Universität Dresden, Germany 138
Harrison B. Goldspiel, Department of Wildlife, Fisheries, and Conservaon Biology, University of 139
Maine, USA 140
Dylan G. E. Gomes, Department of Biological Sciences, Boise State University, USA 141
Megan Kate Good, School of Agriculture, Food and Ecosystem Sciences, The University of Melbourne, 142
Australia 143
Sarah C. Goslee, Pastures Systems and Watershed Management Research Unit, USDA Agricultural 144
Research Service, USA 145
J. Stephen Gosnell, Department of Natural Sciences, Baruch College, City University of New York, USA 146
Eliza M. Grames, Department of Biological Sciences, Binghamton University, USA 147
Paolo Graon, Diparmento di Biologia, Università di Roma "Tor Vergata", Italy 148
Nicholas M. Grebe, Department of Anthropology, University of Michigan, USA 149
Skye M. Greenler, College of Forestry, Oregon State University, USA 150
Maaike Griffioen, University of Antwerp, Belgium 151
Daniel M. Griffith, Earth & Environmental Sciences, Wesleyan University, USA 152
Frances J. Griffith, Yale School of Medicine, Department of Psychiatry, Yale University, USA 153
Jake J. Grossman, Biology Department and Environmental Studies Department, St. Olaf College, USA 154
Ali Güncan, Department of Plant Protecon, Faculty of Agriculture, Ordu University, Turkey 155
Stef Haesen, Department of Earth and Environmental Sciences, KU Leuven, Belgium 156
James G. Hagan, Department of Marine Sciences, University of Gothenburg, Sweden 157
Heather A. Hager, Department of Biology, Wilfrid Laurier University, Canada 158
Jonathan Philo Harris, Natural Resource Ecology and Management, Iowa State University, USA 159
Natasha Dean Harrison, School of Biological Sciences, University of Western Australia, Australia 160
Sarah Syedia Hasnain, Department of Biological Sciences, Middle East Technical University, Turkey 161
Jusn Chase Havird, Dept. of Integrave Biology, University of Texas at Ausn, USA 162
Andrew J. Heaton, Grand Bay Naonal Estuarine Research Reserve, USA 163
María Laura Herrera-Chaustre, Universidad de los Andes, Colombia 164
Tan ner J. Howard 165
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
6
Bin-Yan Hsu, Department of Biology, University of Turku, Finland 166
Fabiola Iannarilli, Dept of Fisheries, Wildlife and Conservaon Biology, University of Minnesota, USA 167
Esperanza C. Iranzo, Instuto de Ciencia Animal. Facultad de Ciencias Veterinarias, Universidad 168
Austral de Chile, Chile 169
Erik N. K. Iverson, Department of Integrave Biology, The University of Texas at Ausn, USA 170
Saheed Olaide Jimoh, Department of Botany, University of Wyoming, USA 171
Douglas H. Johnson, Department of Fisheries, Wildlife, and Conservaon Biology, University of 172
Minnesota, USA 173
Marn Johnsson, Department of Animal Breeding and Genecs, Swedish University of Agricultural 174
Sciences, Sweden 175
Jesse Jorna, Department of Biology, Brigham Young University, Brigham Young University, USA 176
Tommaso Jucker, School of Biological Sciences, University of Bristol, UK 177
Marn Jung, Internaonal Instute for Applied Systems Analysis (IIASA), Austria 178
Ineta Kačergytė, Department of Ecology, Swedish University of Agricultural Sciences, Sweden 179
Oliver Kaltz, Université de Montpellier, France 180
Alison Ke, Department of Wildlife, Fish, and Conservaon Biology, University of California, Davis, USA 181
Clint D. Kelly, Département des Sciences biologiques, Université du Québec à Montréal, Canada 182
Katharine Keogan, Instute of Evoluonary Biology, University of Edinburgh, UK 183
Friedrich Wolfgang Keppeler, Center for Limnology, Center for Limnology, University of Wisconsin - 184
Madison, USA 185
Alexander K. Killion, Center for Biodiversity and Global Change, Yale University, USA 186
Dongmin Kim, Department of Ecology, Evoluon, and Behavior, University of Minnesota, St. Paul, USA 187
David P. Kochan, Instute of Environment and Department of Biological Sciences, Florida 188
Internaonal University, USA 189
Peter Korsten, Department of Life Sciences, Aberystwyth University, UK 190
Shan Kothari, Instut de recherche en biologie végétale, Université de Montréal, Canada 191
Jonas Kuppler, Instute of Evoluonary Ecology and Conservaon Genomics, Ulm University, 192
Germany 193
Jillian M. Kusch, Department of Biology, Memorial University of Newfoundland, Canada 194
Malgorzata Lagisz, Evoluon & Ecology Research Centre and School of Biological, Earth & 195
Environmental Sciences, University of New South Wales, Australia 196
Kristen Marianne Lalla, Department of Natural Resource Sciences, McGill University, Canada 197
Daniel J. Larkin, Department of Fisheries, Wildlife and Conservaon Biology, University of Minnesota-198
Twin Cies, USA 199
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
7
Courtney L. Larson, The Nature Conservancy, USA 200
Katherine S. Lauck, Department of Wildlife, Fish, and Conservaon Biology, University of California, 201
Davis, USA 202
M. Elise Lauterbur, Ecology and Evoluonary Biology, University of Arizona, USA 203
Alan Law, Biological and Environmental Sciences, University of Srling, UK 204
Don-Jean Léandri-Breton, Department of Natural Resource Sciences, McGill University, Canada 205
Jonas J. Lembrechts, Department of Biology, University of Antwerp, Belgium 206
Kiara L'Herpiniere, Natural sciences, Macquarie University, Australia 207
Eva J. P. Lievens, Aquac Ecology and Evoluon Group, Limnological Instute, University of Konstanz, 208
Germany 209
Daniela Oliveira de Lima, Campus Cerro Largo, Universidade Federal da Fronteira Sul, Brazil 210
Shane Lindsay, School of Psychology and Social Work, University of Hull, UK 211
Marn Luquet, UMR 1224 ECOBIOP, Université de Pau et des Pays de lʹAdour, France 212
Ross MacLeod, School of Biological & Environmental Sciences, Liverpool John Moores University, UK 213
Kirsty H. Macphie, Instute of Ecology and Evoluon, University of Edinburgh, UK 214
Kit Magellan, Cambodia 215
Magdalena M. Mair, Stascal Ecotoxicology, Bayreuth Center of Ecology and Environmental 216
Research (BayCEER), University of Bayreuth, Germany 217
Lisa E. Malm, Ecology and Environmental Science, Umeå University, Sweden 218
Stefano Mammola, Molecular Ecology Group (MEG), Water Research Instute (IRSA), Naonal 219
Research Council of Italy (CNR), Italy 220
Caitlin P. Mandeville, Department of Natural History, Norwegian University of Science and 221
Technology, Norway 222
Michael Manhart, Center for Advanced Biotechnology and Medicine, Rutgers University Robert 223
Wood Johnson Medical School, USA 224
Laura Milena Manrique-Garzon, Departamento de Ciencias Biológicas, Universidad de los Andes, 225
Colombia 226
Elina Mäntylä, Department of Biology, University of Turku, Finland 227
Philippe Marchand, Instut de recherche sur les forêts, Université du Québec en Abibi-228
Témisc a m ingu e , Can a da
229
Benjamin Michael Marshall, Biological and Environmental Sciences, University of Srling, UK 230
Charles A. Marn, Université du Québec à Trois-Rivières, Canada 231
Dominic Andreas Marn, Instute of Plant Sciences, University of Bern, Switzerland 232
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
8
Jake Mitchell Marn, Department of Wildlife, Fish, and Environmental Studies, Swedish University of 233
Agricultural Sciences, Sweden 234
April Robin Marnig, School of Biological, Earth and Environmental Sciences, University of New South 235
Wales, Australia 236
Erin S. McCallum, Department of Wildlife, Fish and Environmental Studies, Swedish University of 237
Agricultural Sciences, Sweden 238
Mark McCauley, Whitney Laboratory for Marine Bioscience, University of Florida, USA 239
Sabrina M. McNew, Ecology and Evoluonary Biology, University of Arizona, USA 240
Sco J. Meiners, Biological Sciences, Eastern Illinois University, USA 241
Thomas Merkling, Centre d'Invesgaons Clinique Plurithémaque - Instut Lorrain du Coeur et des 242
Vaisseaux, Université de Lorraine, Inserm1433 CIC-P CHRU de Nancy, France 243
Marcus Michelangeli, Department of Wildlife, Fish and Environmental Studies, Swedish University of 244
Agricultural Sciences, Sweden 245
Maria Moiron, Evoluonary biology department, Bielefeld University, Germany 246
Bruno Moreira, Department of Ecology and global change, Centro de Invesgaciones sobre 247
Deserficación, Consejo Superior de Invesgaciones Cienficas (CIDE-CSIC/UV/GV), Spain 248
Jennifer Mortensen, Department of Biological Sciences, University of Arkansas, USA 249
Benjamin Mos, School of the Environment, Faculty of Science, The University of Queensland, 250
Australia 251
Taofeek Olatunbosun Muraina, Department of Animal Health and Producon, Oyo State College of 252
Agriculture and Technology, Nigeria 253
Penelope Wrenn Murphy, Department of Forest & Wildlife Ecology, University of Wisconsin-Madison, 254
USA 255
Luca Nelli, School of Biodiversity, One Health and Veterinary Medicine, University of Glasgow, UK 256
Petri Niemelä, Organismal and Evoluonary Biology Research Programme, Faculty of Biological and 257
Environmental Sciences, University of Helsinki, Finland 258
Josh Nighngale, South Iceland Research Centre, University of Iceland, Iceland 259
Gustav Nilsonne, Department of Clinical Neuroscience, Karolinska Instutet, Sweden 260
Sergio Nolazco, School of Biological Sciences, Monash University, Australia
261
Sabine S. Nooten, Animal Ecology and Tropical Biology, University of Würzburg, Germany 262
Jessie Lanterman Novotny, Biology, Hiram College, USA 263
Agnes Birgia Olin, Department of Aquac Resources, Swedish University of Agricultural Sciences, 264
Sweden 265
Chris L. Organ, Department of Earth Sciences, Montana State University, USA 266
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
9
Kate L. Ostevik, Department of Evoluon, Ecology, and Organismal Biology, University of California, 267
Riverside, USA 268
Facundo Xavier Palacio, Sección Ornitología, Universidad Nacional de La Plata, Argenna 269
Mahieu Paquet, Department of Ecology, Swedish University of Agricultural Sciences, Sweden 270
Darren James Parker, Bangor University, UK 271
David J. Pascall, MRC Biostascs Unit, University of Cambridge, UK 272
Valerie J. Pasquarella, Harvard Forest, Harvard University, USA 273
John Harold Paterson, Biological and Environmental Sciences, University of Srling, Scotland 274
Ana Payo-Payo, Departamento de Biodiversidad, Ecología y Evolución., Universidad Complutense de 275
Madrid, Spain 276
Karen Marie Pedersen, Biology Department, Technische Universität Darmstadt, Germany 277
Grégoire Perez, UMR 1309 ASTRE, CIRAD, France 278
Kayla I. Perry, Department of Entomology, The Ohio State University, USA 279
Patrice Poer, Evoluon & Ecology Research Centre, School of Biological, Earth and Environmental 280
Sciences, The University of New South Wales, Australia 281
Michael J. Proulx, Department of Psychology, University of Bath, UK 282
Raphaël Proulx, Chaire de recherche en intégrité écologique, Université du Québec à Trois-Rivières, 283
Canada 284
Jessica L Prue, Mississippi Based RESTORE Act Center of Excellence, University of Southern 285
Mississippi, USA 286
Veronarindra Ramananjato, Department of Integrave Biology, University of California, Berkeley, USA 287
Finaritra Tolotra Randimbiarison, Menon Zoologie et Biodiversité Animale, Université 288
d'Antananarivo, Madagascar 289
Onja H. Razafindratsima, Department of Integrave Biology, University of California, Berkeley, USA 290
Diana J. Rennison, Department of Ecology, Behavior and Evoluon, University of California, San 291
Diego, USA 292
Federico Riva, Instute for Environmental Sciences, VU Amsterdam, The Netherlands 293
Sepand Riyahi, Department of Evoluonary Anthropology, University of Vienna, Austria 294
Michael James Roast, Konrad Lorenz Instute for Ethology, University of Veterinary Medicine, Austria 295
Felipe Pereira Rocha, School of Biological Sciences, The University of Hong Kong, China
296
Dominique G. Roche, Instut de biologie, Université de Neuchâtel, Switzerland 297
Crisan Román-Palacios, School of Informaon, University of Arizona, USA 298
Michael S. Rosenberg, Center for Biological Data Science, Virginia Commonwealth University, USA 299
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
10
Jessica Ross, University of Wisconsin, USA 300
Freya E. Rowland, School of the Environment, Yale University, USA 301
Deusdedith Rugemalila, Instute of the Environment, Florida Internaonal University, USA 302
Avery L. Russell, Department of Biology, Missouri State University, USA 303
Suvi Ruuskanen, Department of Biological and Environmental Science, University of Jyväskylä, 304
Finland 305
Patrick Saccone, Instute for Interdisciplinary Mountain Research, OeAW (Austrian Academy of 306
Sciences), Austria 307
Asaf Sadeh, Department of Natural Resources, Newe Ya'ar Research Center, Agricultural Research 308
Organizaon (Volcani Instute), Israel 309
Stephen M. Salazar, Department of Animal Behaviour, Bielefeld University, Germany 310
Kris Sales, Office for Naonal Stascs, UK 311
Pablo Salmón, Instute of Avian Research "Vogelwarte Helgoland", Germany 312
Alfredo Sánchez-Tójar, Department of Evoluonary Biology, Bielefeld University, Germany 313
Lecia Pereira Santos, Ecology Department, Universidade Federal de Goiás, Brazil 314
Francesca Santostefano, University of Exeter, University of Exeter, UK 315
Hayden T. Schilling, New South Wales Department of Primary Industries Fisheries, Australia 316
Marcus Schmidt, Research Data Management, Leibniz Centre for Agricultural Landscape Research 317
(ZALF), Germany 318
Tim Schmoll, Evoluonary Biology, Bielefeld University, Germany 319
Adam C. Schneider, Biology Department, University of Wisconsin-La Crosse, USA 320
Allie E. Schrock, Department of Evoluonary Anthropology, Duke University, USA 321
Julia Schroeder, Department of Life Sciences, Imperial College London, UK 322
Nicolas Schckzelle, Earth and Life Instute, Ecology and Biodiversity, UCLouvain, Belgium 323
Nick L. Schultz, Future Regions Research Centre, Federaon University Australia, Australia 324
Drew A. Sco, United States Department of Agriculture- Agricultural Research Service-, USA 325
Michael Peter Scroggie, Arthur Rylah Insitute for Environmental Research, Australia 326
Julie Teresa Shapiro, Epidemiology and Surveillance Support Unit, University of Lyon - French Agency 327
for Food, Environmental and Occupaonal Health and Safety (ANSES), France 328
Nika Sharma, UCLA Anderson Center for Impact, University of California, Los Angeles, USA 329
Caroline L. Shearer, Department of Evoluonary Anthropology, Duke University, USA 330
Diego Simón, Facultad de Ciencias, Universidad de la República, Uruguay 331
Michael I. Sitvarin, Independent researcher, USA 332
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
11
Fabrício Luiz Skupien, Programa de Pós-Graduação em Ecologia, Instuto de Biologia, Centro de 333
Ciências da Saúde, Universidade Federal do Rio de Janeiro, Brazil 334
Heather Lea Slinn, Vive Crop Protecon, Canada 335
Grania Polly Smith, University of Cambridge, UK 336
Jeremy A. Smith, Brish Trust for Ornithology, UK 337
Rahel Sollmann, Department of Wildlife, Fish, and Conservaon Biology, University of California, 338
Davis, USA 339
Kaitlin Stack Whitney, Science, Technology & Society Department, Rochester Instute of Technology, 340
USA 341
Shannon Michael Sll, Nomad Ecology, USA 342
Erica F. Stuber, Wildland Resources Department, Utah State University, USA 343
Guy F. Suon, Center for Biological Control, Department of Zoology and Entomology, Rhodes 344
University, South Africa 345
Ben Swallow, School of Mathemacs and Stascs and Centre for Research in Ecological and 346
Environmental Modelling, University of St Andrews, UK 347
Conor Claverie Taff, Department of Ecology and Evoluonary Biology, Cornell University, USA 348
Elina Takola, Department of Computaonal Landscape Ecology, Helmholtz Centre for Environmental 349
Research – UFZ, Germany 350
Andrew J. Tanentzap, Ecosystems and Global Change Group, School of the Environment, Trent 351
University, Canada 352
Rocío Tarjuelo, Instuto Universitario de Invesgación en Gesón Forestal Sostenible (iuFOR), 353
Universidad de Valladolid, Spain 354
Richard J. Telford, Department of Biological Sciences, University of Bergen, Norway 355
Christopher J. Thawley, Department of Biological Science, University of Rhode Island, USA 356
Hugo Thierry, Department of Geography, McGill University, Canada 357
Jacqueline Thomson, Integrave Biology, University of Guelph, Canada 358
Svenja Tidau, School of Biological and Marine Sciences, University of Plymouth, UK 359
Emily M. Tompkins, Biology Deptartment, Wake Forest University, USA 360
Claire Marie Tortorelli, Plant Sciences, University of California, Davis, USA 361
Andrew Trlica, College of Natural Resources, North Carolina State University, USA 362
Biz R. Turnell, Instute of Zoology, Technische Universität Dresden, Germany 363
Lara Urban, Helmholtz AI, Helmholtz Zentrum Muenchen, Germany 364
Sjn Van de Vondel, Department of Biology, University of Antwerp, Belgium 365
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
12
Jessica Eva Megan van der Wal, FitzPatrick Instute of African Ornithology, University of Cape Town, 366
South Africa 367
Jens Van Eeckhoven, Department of Cell & Developmental Biology, Division of Biosciences, University 368
College London, UK 369
Francis van Oordt, Natural Resource Sciences, McGill University, Canada 370
K. Michelle Vanderwel, Biology, University of Saskatchewan, Canada 371
Mark C. Vanderwel, Department of Biology, University of Regina, Canada 372
Karen J. Vanderwolf, Biology, University of Waterloo, Canada 373
Juliana Vélez, Department of Fisheries, Wildlife and Conservaon Biology, University of Minnesota, 374
USA 375
Diana Carolina Vergara-Florez, Department of Ecology & Evoluonary Biology, University of Michigan, 376
USA 377
Brian C. Verrelli, Center for Biological Data Science, Virginia Commonwealth University, USA 378
Marcus Vinícius Vieira, Dept. Ecologia, Instuto de Biologia, Universidade Federal do Rio de Janeiro, 379
Brazil 380
Nora Villamil, Lothian Analycal Services, Public Health Scotland, UK 381
Valerio Vitali, Instute for Evoluon and Biodiversity, University of Muenster, Germany 382
Julien Vollering, Department of Environmental Sciences, Western Norway University of Applied 383
Sciences, Norway 384
Jeffrey Walker, Department of Biological Sciences, University of Southern Maine, USA 385
Xanthe J. Walker, Center for Ecosystem Science and Society, Northern Arizona University, USA 386
Jonathan A. Walter, Center for Watershed Sciences, University of California, Davis, USA 387
Pawel Waryszak, School of Agriculture and Environmental Science, University of Southern 388
Queensland, Australia 389
Ryan J. Weaver, Department of Ecology, Evoluon, and Organismal Biology, Iowa State University, 390
USA 391
Ronja E. M. Wedegärtner, Fram Project AS, Norway 392
Daniel L. Weller, Department of Food Science & Technology, Virginia Polytechnic Instute and State 393
University, USA 394
Shannon Whelan, Department of Natural Resource Sciences, McGill University, Canada 395
Rachel Louise White, School of Applied Sciences, University of Brighton, UK 396
David William Wolfson, Department of Fisheries, Wildlife and Conservaon Biology, University of 397
Minnesota, USA 398
Andrew Wood, Department of Biology, University of Oxford, UK 399
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
13
Sco W. Yanco, Department of Integrave Biology, University of Colorado, Denver, USA 400
Jian D. L. Yen, Arthur Rylah Instute for Environmental Research, Australia 401
Casey Youngflesh, Ecology, Evoluon, and Behavior Program, Michigan State University, USA 402
Giacomo Zilio, ISEM, University of Montpellier, CNRS, France 403
Cédric Zimmer, Laboratoire d’Ethologie Expérimentale et Comparée, LEEC, UR4443, Université 404
Sorbonne Paris Nord, USA 405
Gregory Mark Zimmerman, Department of Science and Environment, Lake Superior State University, 406
USA 407
Rachel A. Zitomer, Department of Forest Ecosystems and Society, Oregon State University, USA 408
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
14
Abstract 409
Although variaon in effect sizes and predicted values among studies of similar phenomena is 410
inevitable, such variaon far exceeds what might be produced by sampling error alone. One possible 411
explanaon for variaon among results is differences among researchers in the decisions they make 412
regarding stascal analyses. A growing array of studies has explored this analycal variability in 413
different (mostly social science) fields, and has found substanal variability among results, despite 414
analysts having the same data and research queson. We implemented an analogous study in 415
ecology and evoluonary biology, fields in which there have been no empirical exploraon of the 416
variaon in effect sizes or model predicons generated by the analycal decisions of different 417
researchers. We used two unpublished datasets, one from evoluonary ecology (blue t, Cyanistes 418
caeruleus, to compare sibling number and nestling growth) and one from conservaon ecology 419
(Eucalyptus, to compare grass cover and tree seedling recruitment), and the project leaders recruited 420
174 analyst teams, comprising 246 analysts, to invesgate the answers to prespecified research 421
quesons. Analyses conducted by these teams yielded 141 usable effects for the blue t dataset, and 422
85 usable effects for the Eucalyptus dataset. We found substanal heterogeneity among results for 423
both datasets, although the paerns of variaon differed between them. For the blue t analyses, 424
the average effect was convincingly negave, with less growth for nestlings living with more siblings, 425
but there was near connuous variaon in effect size from large negave effects to effects near zero, 426
and even effects crossing the tradional threshold of stascal significance in the opposite direcon. 427
In contrast, the average relaonship between grass cover and Eucalyptus seedling number was only 428
slightly negave and not convincingly different from zero, and most effects ranged from weakly 429
negave to weakly posive, with about a third of effects crossing the tradional threshold of 430
significance in one direcon or the other. However, there were also several striking outliers in 431
the Eucalyptus dataset, with effects far from zero. For both datasets, we found substanal variaon 432
in the variable selecon and random effects structures among analyses, as well as in the rangs of 433
the analycal methods by peer reviewers, but we found no strong relaonship between any of these 434
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
15
and deviaon from the meta-analyc mean. In other words, analyses with results that were far from 435
the mean were no more or less likely to have dissimilar variable sets, use random effects in their 436
models, or receive poor peer reviews than those analyses that found results that were close to the 437
mean. The existence of substanal variability among analysis outcomes raises important quesons 438
about how ecologists and evoluonary biologists should interpret published results, and how they 439
should conduct analyses in the future. 440
Key Words 441
credibility revoluon, heterogeneity, meta-analysis, metascience, Replicability, reproducibility 442
Introducon 443
One value of science derives from its production of replicable, and thus reliable, results. When we 444
repeat a study using the original methods we should be able to expect a similar result. However, 445
perfect replicability is not a reasonable goal. Effect sizes will vary, and even reverse in sign, by 446
chance alone [1]. Observed patterns can differ for other reasons as well. It could be that we do not 447
sufficiently understand the conditions that led to the original result so when we seek to replicate it, 448
the conditions differ due to some ‘hidden moderator’. This hidden moderator hypothesis is 449
described by meta-analysts in ecology and evolutionary biology as ‘true biological heterogeneity’ [2]. 450
This idea of true heterogeneity is popular in ecology and evolutionary biology, and there are good 451
reasons to expect it in the complex systems in which we work [3]. However, despite similar 452
expectations in psychology, recent evidence in that discipline contradicts the hypothesis that 453
moderators are common obstacles to replicability, as variability in results in a large ‘many labs’ 454
collaboration was mostly unrelated to commonly hypothesized moderators such as the conditions 455
under which the studies were administered [4]. Another possible explanation for variation in effect 456
sizes is that researchers often present biased samples of results, thus reducing the likelihood that 457
later studies will produce similar effect sizes [5–9]. It also may be that although researchers did 458
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
16
successfully replicate the conditions, the experiment, and measured variables, analytical decisions 459
differed sufficiently among studies to create divergent results [10, 11]. 460
Analytical decisions vary among studies because researchers have many options. Researchers need 461
to decide how to exclude possibly anomalous or unreliable data, how to construct variables, which 462
variables to include in their models, and which statistical methods to use. Depending on the dataset, 463
this short list of choices could encompass thousands or millions of possible alternative 464
specifications [10]. However, researchers making these decisions presumably do so with the goal of 465
doing the best possible analysis, or at least the best analysis within their current skill set. Thus it 466
seems likely that some specification options are more probable than others, possibly because they 467
have previously been shown (or claimed) to be better, or because they are more well known. Of 468
course, some of these different analyses (maybe many of them) may be equally valid alternatives. 469
Regardless, on probably any topic in ecology and evolutionary biology, we can encounter differences 470
in choices of data analysis. The extent of these differences in analyses and the degree to which these 471
differences influence the outcomes of analyses and therefore studies’ conclusions are important 472
empirical questions. These questions are especially important given that many papers draw 473
conclusions after applying a single method, or even a single statistical model, to analyze a dataset. 474
The possibility that different analytical choices could lead to different outcomes has long been 475
recognized [12], and various efforts to address this possibility have been pursued in the literature. 476
For instance, one common method in ecology and evolutionary biology involves creating a set of 477
candidate models, each consisting of a different (though often similar) set of predictor variables, and 478
then, for the predictor variable of interest, averaging the slope across all models (i.e. model 479
averaging) [13, 14]. This method reduces the chance that a conclusion is contingent upon a single 480
model specification, though use and interpretation of this method is not without challenges [14]. 481
Further, the models compared to each other typically differ only in the inclusion or exclusion of 482
certain predictor variables and not in other important ways, such as methods of parameter 483
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
17
estimation. More explicit examination of outcomes of differences in model structure, model type, 484
data exclusion, or other analytical choices can be implemented through sensitivity 485
analyses [e.g., 15]. Sensitivity analyses, however, are typically rather narrow in scope, and are 486
designed to assess the sensitivity of analytical outcomes to a particular analytical choice rather than 487
to a large universe of choices. Recently, however, analysts in the social sciences have proposed 488
extremely thorough sensitivity analysis, including ‘multiverse analysis’ [16] and the ‘specification 489
curve’ [10], as a means of increasing the reliability of results. With these methods, researchers 490
identify relevant decision points encountered during analysis and conduct the analysis many times 491
to incorporate many plausible decisions made at each of these points. The study’s conclusions are 492
then based on a broad set of the possible analyses and so allow the analyst to distinguish between 493
robust conclusions and those that are highly contingent on particular model specifications. These are 494
useful outcomes, but specifying a universe of possible modelling decisions is not a trivial 495
undertaking. Further, the analyst’s knowledge and biases will influence decisions about the 496
boundaries of that universe, and so there will always be room for disagreement among analysts 497
about what to include. Including more specifications is not necessarily better. Some analytical 498
decisions are better justified than others, and including biologically implausible specifications may 499
undermine this process. Regardless, these powerful methods have yet to be adopted, and even 500
more limited forms of sensitivity analyses are not particularly widespread. Most studies publish a 501
small set of analyses and so the existing literature does not provide much insight into the degree to 502
which published results are contingent on analytical decisions. 503
Despite the potential major impacts of analytical decisions on variance in results, the outcomes of 504
different individuals’ data analysis choices have received limited empirical attention. The only formal 505
exploration of this that we were aware of when we submitted our Stage 1 manuscript were (1) an 506
analysis in social science that asked whether male professional football (soccer) players with darker 507
skin tone were more likely to be issued red cards (ejection from the game for rule violation) than 508
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
18
players with lighter skin tone [11] and (2) an analysis in neuroimaging which evaluated nine separate 509
hypotheses involving the neurological responses detected with fMRI in 108 participants divided 510
between two treatments in a decision making task [17]. Several others have been published 511
since [e.g., 18, 19–21]. In the red card study, twenty-nine teams designed and implemented analyses 512
of a dataset provided by the study coordinators [11]. Analyses were peer reviewed (results blind) by 513
at least two other participating analysts; a level of scrutiny consistent with standard pre-publication 514
peer review. Among the final 29 analyses, odds-ratios varied from 0.89 to 2.93, meaning point 515
estimates varied from having players with lighter skin tones receive more red cards (odds ratio < 1) 516
to a strong effect of players with darker skin tones receiving more red cards (odds ratio > 1). Twenty 517
of the 29 teams found a statistically-significant effect in the predicted direction of players with 518
darker skin tones being issued more red cards. This degree of variation in peer-reviewed analyses 519
from identical data is striking, but the generality of this finding has only just begun to be formally 520
investigated. 521
In the neuroimaging study, 70 teams evaluated each of the nine different hypotheses with the 522
available fMRI data [17]. These 70 teams followed a divergent set of workflows that produced a wide 523
range of results. The rate of reporting of statistically significant support for the nine hypotheses 524
ranged from 21% to 84%, and for each hypothesis on average, 20% of research teams observed 525
effects that differed substantially from the majority of other teams. Some of the variability in results 526
among studies could be explained by analytical decisions such as choice of software package, 527
smoothing function, and parametric versus non-parametric corrections for multiple comparisons. 528
However, substantial variability among analyses remained unexplained, and presumably emerged 529
from the many different decisions each analyst made in their long workflows. Such variability in 530
results among analyses from this dataset and from the very different red-card dataset suggests that 531
sensitivity of analytical outcome to analytical choices may characterize many distinct fields, as 532
several more recent many-analyst studies also suggest [18–20]. 533
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
19
To further develop the empirical understanding of the effects of analytical decisions on study 534
outcomes, we chose to estimate the extent to which researchers’ data analysis choices drive 535
differences in effect sizes, model predictions, and qualitative conclusions in ecology and evolutionary 536
biology. This is an important extension of the meta-research agenda of evaluating factors influencing 537
replicability in ecology, evolutionary biology, and beyond [22]. To examine the effects of analytical 538
decisions, we used two different datasets and recruited researchers to analyze one or the other of 539
these datasets to answer a question we defined. The first question was “To what extent is the 540
growth of nestling blue tits (Cyanistes caeruleus) influenced by competition with siblings?” To 541
answer this question, we provided a dataset that includes brood size manipulations from 332 broods 542
conducted over three years at Wytham Wood, UK. The second question was “How does grass cover 543
influence Eucalyptus spp. seedling recruitment?” For this question, analysts used a dataset that 544
includes, among other variables, number of seedlings in different size classes, percentage cover of 545
different life forms, tree canopy cover, and distance from canopy edge from 351 quadrats spread 546
among 18 sites in Victoria, Australia. 547
We explored the impacts of data analysts’ choices with descripve stascs and with a series of tests 548
to aempt to explain the variaon among effect sizes and predicted values of the dependent variable 549
produced by the different analysis teams for both datasets separately. To describe the variability, we 550
present forest plots of the standardized effect sizes and predicted values produced by each of the 551
analysis teams, esmate heterogeneity (both absolute, τ2, and proporonal, I2) in effect size and 552
predicted values among the results produced by these different teams, and calculate a similarity 553
index that quanfies variability among the predictor variables selected for the different stascal 554
models constructed by the different analysis teams. These descripve stascs provide the first 555
esmates of the extent to which explanatory stascal models and their outcomes in ecology and 556
evoluonary biology vary based on the decisions of different data analysts. We then quanfied the 557
degree to which the variability in effect size and predicted values could be explained by (1) variaon 558
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
20
in the quality of analyses as rated by peer reviewers and (2) the similarity of the choices of predictor 559
variables between individual analyses. 560
Methods 561
This project involved a series of steps (1-6) that began with idenfying datasets for analyses and 562
connued through recruing independent groups of sciensts to analyze the data, allowing the 563
sciensts to analyze the data as they saw fit, generang peer review rangs of the analyses (based 564
on methods, not results), evaluang the variaon in effects among the different analyses, and 565
producing the final manuscript. 566
Step 1: Select Datasets 567
We used two previously unpublished datasets, one from evoluonary ecology and the other from 568
ecology and conservaon. 569
Evoluonary Ecology 570
Our evoluonary ecology dataset is relevant to a sub-discipline of life-history research which focuses 571
on idenfying costs and trade-offs associated with different phenotypic condions. 572
These data were derived from a brood-size manipulaon experiment imposed on wild birds nesng 573
in boxes provided by researchers in an intensively studied populaon. 574
Understanding how the growth of nestlings is influenced by the numbers of siblings in the nest can 575
give researchers insights into factors such as the evoluon of clutch size, determinaon of 576
provisioning rates by parents, and opmal levels of sibling compeon (Vander Werf 1992; DeKogel 577
1997; Royle et al. 1999; Verhulst, Holveck, and Riebel 2006; Nicolaus et al. 2009). Data analysts were 578
provided this dataset and instructed to answer the following queson: “To what extent is the growth 579
of nestling blue ts (Cyanistes caeruleus) influenced by compeon with siblings?” 580
581
Researchers conducted brood size manipulaons and populaon monitoring of blue ts at Wytham 582
Wood, a 380ha woodland in Oxfordshire, U.K (1º 20’W, 51º 47’N). Researchers regularly checked 583
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
21
approximately 1100 arficial nest boxes at the site and monitored the 330 to 450 blue t pairs 584
occupying those boxes in 2001-2003 during the experiment. Nearly all birds made only one breeding 585
aempt during the April to June study period in a given year. At each blue t nest, researchers 586
recorded the date the first egg appeared, clutch size, and hatching date. For all chicks alive at age 14 587
days, researchers measured mass and tarsus length and fied a uniquely numbered, Brish Trust for 588
Ornithology (BTO) aluminium leg ring. Researchers aempted to capture all adults at their nests 589
between day 6 and day 14 of the chick-rearing period. For these captured adults, researchers 590
measured mass, tarsus length, and wing length and fied a uniquely numbered BTO leg ring. During 591
the 2001-2003 breeding seasons, researchers manipulated brood sizes using cross fostering. They 592
matched broods for hatching date and brood size and moved chicks between these paired nests one 593
or two days aer hatching. They sought to either enlarge or reduce all manipulated broods by 594
approximately one fourth. To control for effects of being moved, each reduced brood had a poron 595
of its brood replaced by chicks from the paired increased brood, and vice versa. Net manipulaons 596
varied from plus or minus four chicks in broods of 12 to 16 to plus or minus one chick in broods of 4 597
or 5. Researchers le approximately one third of all broods unmanipulated. These unmanipulated 598
broods were not selected systemacally to match manipulated broods in clutch size or laying date. 599
We have mass and tarsus length data from 3720 individual chicks divided among 167 experimentally 600
enlarged broods, 165 experimentally reduced broods, and 120 unmanipulated broods. The full list of 601
variables included in the dataset is publicly available (hps://osf.io/hdv8m), along with the data 602
(hps://osf.io/qjzby).
603
604
Addional explanaon:
Shortly aer beginning to recruit analysts, several analysts noted a small set of related errors in
the blue t dataset. We corrected the errors, replaced the dataset on our OSF site, and emailed
the analysts on 19 April 2020 to instruct them to use the revised data. The email to analysts is
available here (hps://osf.io/4h53z). The errors are explained in that email.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
22
Ecology and Conservaon 605
Our ecology and conservaon dataset is relevant to a sub-discipline of conservaon research which 606
focuses on invesgang how best to revegetate private land in agricultural landscapes. These data 607
were collected on private land under the Bush Returns program, an incenve system where 608
parcipants entered into a contract with the Goulburn Broken Catchment Management Authority 609
and received annual payments if they executed predetermined restoraon acvies. This parcular 610
dataset is based on a passive regeneraon iniave, where livestock grazing was removed from the 611
property in the hopes that the Eucalyptus spp. overstorey would regenerate without acve (and 612
expensive) planng. Analyses of some related data have been published (Miles 2008; Vesk et al. 613
2016) but those analyses do not address the queson analysts answered in our study. Data analysts 614
were provided this dataset and instructed to answer the following queson: “How does grass cover 615
influence Eucalyptus spp. seedling recruitment?”. 616
Researchers conducted three rounds of surveys at 18 sites across the Goulburn Broken catchment in 617
northern Victoria, Australia in winter and spring 2006 and autumn 2007. In each survey period, a 618
different set of 15 x 15 m quadrats were randomly allocated across each site within 60 m of exisng 619
tree canopies. The number of quadrats at each site depended on the size of the site, ranging from 620
four at smaller sites to 11 at larger sites. The total number of quadrats surveyed across all sites and 621
seasons was 351. The number of Eucalyptus spp. seedlings was recorded in each quadrat along with 622
informaon on the GPS locaon, aspect, tree canopy cover, distance to tree canopy, and posion in 623
the landscape. Ground layer plant species composion was recorded in three 0.5 x 0.5 m sub-624
quadrats within each quadrat. Subjecve cover esmates of each species as well as bare ground, 625
lier, rock and moss/lichen/soil crusts were recorded. Subsequently, this was augmented with 626
informaon about the precipitaon and solar radiaon at each GPS locaon. The full list of variables 627
included in the dataset is publicly available (hps://osf.io/r5gbn), along with the data 628
(hps://osf.io/qz5cu).
629
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
23
Step 2: Recruitment and inial survey of analysts 630
The lead team (TP, HF, SN, EG, SG, PV, DH, FF) created a publicly available document providing a 631
general descripon of the project (hps://osf.io/mn5aj/). The project was adversed at conferences, 632
via Twier, using mailing lists for ecological sociees (including Ecolog, Evoldir, and lists for the 633
Environmental Decisions Group, and Transparency in Ecology and Evoluon), and via word of mouth. 634
The target populaon was acve ecology, conservaon, or evoluonary biology researchers with a 635
graduate degree (or currently studying for a graduate degree) in a relevant discipline. Researchers 636
could choose to work independently or in a small team. For the sake of simplicity, we refer to these 637
as ‘analysis teams’ though some comprised one individual. We aimed for a minimum of 12 analysis 638
teams independently evaluang each dataset (see sample size jusficaon below). We 639
simultaneously recruited volunteers to peer review the analyses conducted by the other volunteers 640
through the same channels. Our goal was to recruit a similar number of peer reviewers and analysts, 641
and to ask each peer reviewer to review a minimum of four analyses. If we were unable to recruit at 642
least half the number of reviewers as analysis teams, we planned to ask analysts to serve also as 643
reviewers (aer they had completed their analyses), but this was unnecessary. All analysts and 644
reviewers were offered the opportunity to share co-authorship on this manuscript and we planned to 645
invite them to parcipate in the collaborave process of producing the final manuscript. All analysts 646
signed [digitally] a consent (ethics) document (hps://osf.io/xyp68/) approved by the Whitman 647
College Instuonal Review Board prior to being allowed to parcipate. 648
649
Preregistraon Deviaon:
Due to the large number of recruited analysts and reviewers and the ancipated challenges of
receiving and integrang feedback from so many authors, we limited analyst and reviewer
parcipaon in the producon of the final manuscript to an invitaon to call aenon to serious
problems with the manuscript dra.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
24
We idenfied our minimum number of analysts per dataset by considering the number of effects 650
needed in a meta-analysis to generate an esmate of heterogeneity (τ2) with a 95% confidence 651
interval that does not encompass zero. This minimum sample size is invariant regardless of τ2. This is 652
because the same t-stasc value will be obtained by the same sample size regardless of variance 653
(τ2). We see this by first examining the formula for the standard error, SE for variance, (τ2) or SE(τ2) 654
assuming normality in an underlying distribuon of effect sizes [30]: 655
𝑆𝐸𝜏2𝑡
𝑛1 656
and then rearranging the above formula to show how the t-stasc is independent of τ2, as seen 657
below. 658
𝑡 𝜏
𝑆𝐸 𝜏 𝑛1
2 659
We then find a minimum n = 12 according to this formula. 660
Step 3: Primary Data Analysis 661
Analysis teams registered and answered a demographic and experse survey (hps://osf.io/seqzy/). 662
We then provided them with the dataset of their choice and requested that they answer a specific 663
research queson. For the evoluonary ecology dataset that queson was “To what extent is the 664
growth of nestling blue ts (Cyanistes caeruleus) influenced by compeon with siblings?” and for 665
the conservaon ecology dataset it was “How does grass cover influence Eucalyptus spp. seedling 666
recruitment?” Once their analysis was complete, they answered a structured survey 667
(hps://osf.io/neyc7/), providing analysis technique, explanaons of their analycal choices, 668
quantave results, and a statement describing their conclusions. They also were asked to upload 669
their analysis files (including the dataset as they formaed it for analysis and their analysis code [if 670
applicable]) and a detailed journal-ready stascal methods secon. 671
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
25
672
Step 4: Peer Review of Analysis 673
At minimum, each analysis was evaluated by four different reviewers, and each volunteer peer 674
reviewer was randomly assigned methods secons from at least four analyst teams (the exact 675
number varied). Each peer reviewer registered and answered a demographic and experse survey 676
idencal to that asked of the analysts, except we did not ask about ‘team name’ since reviewers did 677
not work in teams. Reviewers evaluated the methods of each of their assigned analyses one at a me 678
in a sequence determined by the project leaders. We systemacally assigned the sequence so that, if 679
possible, each analysis was allocated to each posion in the sequence for at least one reviewer. For 680
instance, if each reviewer were assigned four analyses to review, then each analysis would be the 681
first analysis assigned to at least one reviewer, the second analysis assigned to another reviewer, the 682
third analysis assigned to yet another reviewer, and the fourth analysis assigned to a fourth reviewer. 683
Balancing the order in which reviewers saw the analyses controls for order effects, e.g. a reviewer 684
might be less crical of the first methods secon they read than the last. 685
The process for a single reviewer was as follows. First, the reviewer received a descripon of the 686
methods of a single analysis. This included the narrave methods secon, the analysis team’s 687
answers to our survey quesons regarding their methods, including analysis code, and the dataset. 688
The reviewer was then asked, in an online survey (hps://osf.io/4t36u/), to rate that analysis on a 689
Preregistraon Deviaon:
We originally planned to have analysts complete a single survey (hps://osf.io/neyc7/), but aer
we evaluated the results of that survey, we realized we would need a second survey
(hps://osf.io/8w3v5/) to adequately collect the informaon we needed to evaluate
heterogeneity of results (step 5). We provided a set of detailed instrucons with the follow-up
survey, and these instrucons are publicly available and can be found within the following files
(blue t: hps://osf.io/kr2g9, Eucalyptus: hps://osf.io/dfvym).
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
26
scale of 0-100 based on this prompt: “Rate the overall appropriateness of this analysis to answer the 690
research queson (one of the two research quesons inserted here) with the available data. To help 691
you calibrate your rang, please consider the following guidelines: 692
693
100: A perfect analysis with no conceivable improvements from the reviewer 694
75: An imperfect analysis but the needed changes are unlikely to dramacally alter outcomes 695
50: A flawed analysis likely to produce either an unreliable esmate of the relaonship or an over-696
precise esmate of uncertainty 697
25: A flawed analysis likely to produce an unreliable esmate of the relaonship and an over-precise 698
esmate of uncertainty 699
0: A dangerously misleading analysis, certain to produce both an esmate that is wrong and a 700
substanally over-precise esmate of uncertainty that places undue confidence in the incorrect 701
esmate. 702
*Please note that these values are meant to calibrate your rangs. We welcome rangs of any 703
number between 0 and 100.” 704
705
Aer providing this rang, the reviewer was presented with this prompt, in mulple-choice format: 706
“Would the analycal methods presented produce an analysis that is (a) publishable as is, (b) 707
publishable with minor revision, (c) publishable with major revision, (d) deeply flawed and 708
unpublishable?” The reviewer was then provided with a series of text boxes and the following 709
prompts: “Please explain your rangs of this analysis. Please evaluate the choice of stascal analysis 710
type. Please evaluate the process of choosing variables for and structuring the stascal model. 711
Please evaluate the suitability of the variables included in (or excluded from) the stascal model. 712
Please evaluate the suitability of the structure of the stascal model. Please evaluate choices to 713
exclude or not exclude subsets of the data. Please evaluate any choices to transform data (or, if there 714
were no transformaons, but you think there should have been, please discuss that choice).” Aer 715
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
27
subming this review, a methods secon from a second analysis was then made available to the 716
reviewer. This same sequence was followed unl all analyses allocated to a given reviewer were 717
provided and reviewed. Aer providing the final review, the reviewer was simultaneously provided 718
with all four (or more) methods secons the reviewer had just completed reviewing, the opon to 719
revise their original rangs, and a text box to provide an explanaon. The invitaon to revise the 720
original rangs was as follows: “If, now that you have seen all the analyses you are reviewing, you 721
wish to revise your rangs of any of these analyses, you may do so now.” The text box was prefaced 722
with this prompt: “Please explain your choice to revise (or not to revise) your rangs.” 723
724
Step 5: Evaluate Variaon 725
The lead team conducted the analyses outlined in this secon. We described the variaon in model 726
specificaon in several ways. We calculated summary stascs describing variaon among analyses, 727
including mean, SD, and range of number of variables per model included as fixed effects, the 728
number of interacon terms, the number of random effects, and the mean, SD, and range of sample 729
sizes. We also present the number of analyses in which each variable was included. We summarized 730
the variability in standardized effect sizes and predicted values of dependent variables among the 731
individual analyses using standard random effects meta-analyc techniques. First, we derived 732
standardized effect sizes from each individual analysis. We did this for all linear models or 733
generalized linear models by converng the t value and the degree of freedom (df) associated with 734
Addional Explanaon:
To determine how consistent peer reviewers were in their rangs, we assessed inter-rater
reliability among reviewers for both the categorical and quantave rangs combining blue t
and Eucalyptus data using Krippendorff’s alpha for ordinal and connuous data respecvely. This
provides a value that is between -1 (total disagreement between reviewers) and 1 (total
agreement between reviewers).
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
28
regression coefficients (e.g. the effect of the number of siblings [predictor] on growth [response] or 735
the effect of grass cover [predictor] on seedling recruitment [response]) to the correlaon coefficient 736
(r), using the following: 737
𝑟𝑡
𝑡𝑑𝑓 738
This formula can only be applied if t and df values originate from linear or generalized linear models 739
[GLMs; [31]]. If, instead, linear mixed-effects models (LMMs) or generalized linear mixed-effects 740
models (GLMMs) were used by a given analysis, the exact df cannot be esmated. However, adjusted 741
df can be esmated, for example, using the Saerthwaite approximaon of df, dfs, [note that SAS 742
uses this approximaon to obtain df for LMMs and GLMMs; [32]]. For analyses using either LMMs or 743
GLMMs that do not produce dfs we planned to obtain dfs by rerunning the same (G)LMMs using the 744
lmer() or glmer() funcon in the lmerTest package in R [33, 34]. 745
746
We then used the t values and dfs from the models to obtain r as per the formula above. All r and 747
accompanying df (or dfs) were converted to Zr and it’s sampling variance 1/(n-3) where n=df+1. Any 748
analyses from which we could not derive a signed Zr, for instance one with a quadrac funcon in 749
which the slope changed sign, were excluded from the analyses of Fisher’s Zr. We expected such 750
analyses would be rare. In fact, most submied analyses excluded from our meta-analysis of Zr were 751
excluded because of a lack of sufficient informaon provided by the analyst team rather than due to 752
the use of effects that could not be converted to Zr. Regardless, as we describe below, we generated 753
Preregistraon Deviaon
Rather than re-run these analyses ourselves, we sent a follow-up survey (referenced above under
“Primary data analyses”) to analysts and asked them to follow our instrucons for producing this
informaon. The instrucons are publicly available and can be found within the following files
(blue t: hps://osf.io/kr2g9, Eucalyptus: hps://osf.io/dfvym).
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
29
a second set of standardized effects (predicted values) that could (in principle) be derived from any 754
explanatory model produced by these data. 755
Besides Zr, which describes the strength of a relaonship based on the amount of variaon in a 756
dependent variable explained by variaon in an independent variable, we also examined differences 757
in the shape of the relaonship between the independent and dependent variables. To accomplish 758
this, we derived a point esmate (out-of-sample predicted value) for the dependent variable of 759
interest for each of three values of our primary independent variable. We originally described these 760
three values as associated with the 25th percenle, median, and 75th percenle of the independent 761
variable and any covariates. 762
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
30
763
Preregistraon Deviaon
The original descripon of the out-of-sample specificaons did not account for the facts that (a)
some variables are not distributed in a way that allowed division in percenles and that (b)
variables could be either posively or negavely correlated with the dependent variable. We
provide a more thorough descripon here: We derived three point-esmates (out-of-sample
predicted values) for the dependent variable of interest; one for each of three values of our
primary independent variable that we specified. We also specified values for all other variables
that could have been included as independent variables in analysts’ models so that we could
derive the predicted values from a fully specified version of any model produced by analysts. For
all potenal independent variables, we selected three values or categories. Of the three we
selected, one was associated with small, one with intermediate, and one with large values of one
typical dependent variable (day 14 chick weight for the blue t data and total number of
seedlings for the Eucalyptus data; analysts could select other variables as their dependent
variable, but the others typically correlated with the two idenfied here). For connuous
variables, this means we idenfied the 25th percenle, median, and 75th percenle and, if the
slope of the linear relaonship between this variable and the typical dependent variable was
posive, we le the quarles ordered as is. If, instead, the slope was negave, we reversed the
order of the independent variable quarles so that the ‘lower’ quarle value was the one
associated with the lower value for the dependent variable. In the case of categorical variables,
we idenfied categories associated with the 25th percenle, median, and 75th percenle values
of the typical dependent variable aer averaging the values for each category. However, for some
connuous and categorical predictors, we also made selecons based on the principle of internal
consistency between certain related variables, and we fixed a few categorical variables as
idencal across all three levels where doing so would simplify the modelling process
(specificaon tables available: blue t: hps://osf.io/86akx; Eucalyptus: hps://osf.io/jh7g5).
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
31
We used the 25th and 75th percenles rather than minimum and maximum values to reduce the 764
chance of occupying unrealisc parameter space. We planned to derive these predicted values from 765
the model informaon provided by the individual analysts. All values (predicons) were first 766
transformed to the original scale along with their standard errors (SE); we used the delta method 767
(Ver Hoef 2012) for the transformaon of SE. We used the square of the SE associated with predicted 768
values as the sampling variance in the meta-analyses described below, and we planned to analyze 769
these predicted values in exactly the same ways as we analyzed Zr in the following analyses. 770
771
We ploed individual effect size esmates (Zr) and predicted values of the dependent variable (yi) 772
and their corresponding 95% confidence / credible intervals in forest plots to allow visualizaon of 773
the range and precision of effect size and predicted values. Further, we included these esmates in 774
random effects meta-analyses [36, 37] using the metafor package in R [34, 38]: 775
Zr ~ 1 + 1|analysisId 776
yi ~ 1 + 1|analysisId 777
Preregistraon Deviaon
Because analysts of blue t data chose different dependent variables on different scales, aer
transforming out-of-sample values to the original scales, we standardized all values as z scores
(‘standard scores’) to put all dependent variables on the same scale and make them comparable.
This involved taking each relevant value on the original scale (whether a predicted point esmate
or a SE associated with that esmate) and subtracng the value in queson from the mean value
of that dependent variable derived from the full dataset and then dividing this difference by the
standard deviaon, SD, corresponding to the mean from the full dataset. Thus, all our out-of-
sample predicon values from the blue t data are from a distribuon with the mean of 0 and SD
of 1. We did not add this step for the Eucalyptus data because (a) all responses were on the same
scale (counts of Eucalyptus stems) and were thus comparable and (b) these data, with many zeros
and high skew, are poorly suited for z scores.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
32
where yi is the predicted value for the dependent variable at the 25th percenle, median, or 75th 778
percenle of the independent variables. The individual Zr effect sizes were weighted with the inverse 779
of sampling variance for Zr. The individual predicted values for dependent variable (yi) were weighted 780
by the inverse of the associated SE2 original registraon omied “inverse of the” in error). These 781
analyses provided an average Zr score or an average yi with corresponding 95% confidence interval 782
and allowed us to esmate two heterogeneity indices, τ2 and I2. The former, τ2, is he absolute 783
measure of heterogeneity or the between-study variance (in our case, between-effect variance) 784
whereas I2 is a relave measure of heterogeneity. We obtained the esmate of relave heterogeneity 785
(I2) by dividing the between-effect variance by the sum of between-effect and within-effect variance 786
(sampling error variance). I2 is thus, in a standard meta-analysis, the proporon of variance that is 787
due to heterogeneity as opposed to sampling error. When calculang I2, within-study variance is 788
amalgamated across studies to create a “typical” within-study variance which serves as the sampling 789
error variance [36, 37]. Our goal here was to visualize and quanfy the degree of variaon among 790
analyses in effect size esmates [31]. We did not test for stascal significance. 791
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
33
792
Addional Explanaon
Our use of I2 to quanfy heterogeneity violates an important assumpon, but this violaon does
not invalidate our use of I2 as a metric of how much heterogeneity can derive from analycal
decisions. In standard meta-analysis, the stasc I2 quanfies the proporon of variance that is
greater than we would expect if differences among esmates were due to sampling error alone
[39]. However, it is clear that this interpretaon does not apply to our value of I2 because I2
assumes that each esmate is based on an independent sample (although these analyses can
account for non-independence via hierarchical modelling), whereas all our effects were derived
from largely or enrely overlapping subsets of the same dataset. Despite this, we believe that I2
remains a useful stasc for our purposes. This is because, in calculang I2, we are sll seng a
benchmark of expected variaon due to sampling error based on the variance associated with
each separate effect size esmate, and we are assessing how much (if it all) the variability among
our effect sizes exceeds what would be expected had our effect sizes been based on independent
data. In other words, our esmates can tell us how much proporonal heterogeneity is possible
from analycal decisions alone when sample sizes (and therefore meta-analyc within-esmate
variance) are similar to the ones in our analyses. Among other implicaons, our violaon of the
independent sample assumpon means that we (dramacally) over-esmate the variance
expected due to sampling error, and because I2 s a proporonal esmate, we thus underesmate
the actual proporon of variance due to differences among analyses other than sampling error.
However, correcng this underesmaon would create a trivial value since we designed the study
so that much of the variance would derive from analyc decisions as opposed to differences in
sampled data. Instead, retaining the I2 value as typically calculated provides a useful comparison
to I2 values from typical meta-analyses.
Interpretaon of τ2 also differs somewhat from tradional meta-analysis, and we discuss this
further in the Results.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
34
Finally, we assessed the extent to which deviaons from the meta-analyc mean by individual effect 793
sizes (Zr) or the predicted values of the dependent variable (yi) were explained by the peer rang of 794
each analysis team’s method secon, by a measurement of the disncveness of the set of predictor 795
variables included in each analysis, and by the choice of whether or not to include random effects in 796
the model. The deviaon score, which served as the dependent variable in these analyses, is the 797
absolute value of the difference between the meta-analyc mean Zr (or yi) and the individual Zr (or 798
yi) esmate for each analysis. We used the Box-Cox transformaon on the absolute values of 799
deviaon scores to achieve an approximately normal distribuon [c.f. 40, 41]. We described variaon 800
in this dependent variable with both a series of univariate analyses and a mulvariate analysis. All 801
these analyses were general linear (mixed) models. These analyses were secondary to our esmaon 802
of variaon in effect sizes described above. We wished to quanfy relaonships among variables, but 803
we had no a priori expectaon of effect size and made no dichotomous decisions about stascal 804
significance. 805
When examining the extent to which reviewer rangs (on a scale from 0 to 100) explained deviaon 806
from the average effect (or predicted value), each analysis had been rated by mulple peer 807
reviewers, so for each reviewer score to be included, we include each deviaon score in the analysis 808
mulple mes. To account for the non-independence of mulple rangs of the same analysis, we 809
planned to include analysis identy as a random effect in our general linear mixed model in the lme4 810
package in R [34, 42]. To a ccount f o r pote n al differences among reviewers in their scoring of 811
analyses, we also planned to include reviewer identy as a random effect: 812
DeviaonScorej = BoxCox(abs(DeviaonFromMeanj)) 813
DeviaonScoreij ~ Rangij + ReviewerIDi + AnalysisIDj 814
ReviewerIDi ~ ϗ(0, σ2) 815
AnalysisIDi ~ ϗ(0, σ2) 816
Where DeviaonFromMeanj is the deviaon from the meta-analyc mean for the jth analysis, 817
ReviewerIDi is the random intercept assigned to each i reviewer, and AnalysisIDj is the random 818
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
35
intercept assigned to each j analysis, both of which are assumed to be normally distributed with a 819
mean of 0 and a variance of σ2 Absolute deviaon scores were Box-Cox transformed using the 820
step_box_cox() funcon from the metk package in R [34, 43]. 821
We conducted a similar analysis with the four categories of reviewer rangs ((1) deeply flawed and 822
unpublishable, (2) publishable with major revision, (3) publishable with minor revision, (4) 823
publishable as is) set as ordinal predictors numbered as shown here. As with the analyses above, we 824
planned for these analyses to also include random effects of analysis identy and reviewer identy. 825
Both of these analyses (1: 1-100 rangs as the fixed effect, 2: categorical rangs as the fixed effects) 826
were planned to be conducted eight mes for each dataset. Each of the four responses (Zr, y25, y50, 827
y75) were to be compared once to the inial rangs provided by the peer reviewers, and again based 828
on the revised rangs provided by the peer reviewers. 829
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
36
830
The next set of univariate analyses sought to explain deviaons from the mean effects based on a 831
measure of the disncveness of the set of variables included in each analysis. As a ‘disncveness’ 832
score, we used Sorensen’s Similarity Index (an index typically used to compare species composion 833
across sites), treang variables as species and individual analyses as sites. To generate an individual 834
Sorensen’s value for each analysis required calculang the pairwise Sorensen’s value for all pairs of 835
analyses (of the same dataset), and then taking the average across these Sorensen’s values for each 836
analysis. We calculated the Sorensen’s index values using the betapart package [44] in R: 837
Preregistraon Deviaon
1. We planned to include random effects of both analysis identy and reviewer identy in
these models comparing reviewer rangs with deviaon scores. However, aer we
received the analyses, we discovered that a subset of analyst teams had either conducted
mulple analyses and/or idenfied mulple effects per analysis as answering the target
queson. We therefore faced an even more complex potenal set of random effects. We
decided that including team ID, analysis ID, and effect ID along with reviewer ID as
random effects in the same model would almost certainly lead to model fit problems, and
so we started with simpler models including just effect ID and reviewer ID. However, even
with this simpler structure, our dataset was sparse, with reviewers rang a small number
of analyses, resulng in models with singular fit (Secon C.2). Removing one of the
random effects was necessary for the models to converge. The models that included the
categorical quality rang converged when including reviewer ID, and the models that
included the connuous quality rang converged when including effect ID.
2. We conducted analyses only with the final peer rangs aer the opportunity for revision,
not with the inial rangs. This was because when we recorded the final rangs, they
over-wrote the inial rangs, and so we did not have access to those inial values.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
37
𝛽𝑆𝑜𝑟𝑒𝑛𝑠𝑒𝑛 𝑏𝑐
2𝑎𝑏𝑐 838
Where a is the number of variables common to both analyses, b is the number of variables that 839
occur in the first analysis but not in the second and c is he number of variables that occur in the 840
second analysis. We then used the per-model average Sorensen’s index value as an independent 841
variable to predict the deviaon score in a general linear model, and included no random effect since 842
each analysis is included only once, in R [34]: 843
𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑆𝑐𝑜𝑟𝑒 ~ 𝛽𝑆𝑜𝑟𝑒𝑛𝑠𝑒𝑛 844
845
Finally, we conducted a mulvariate analysis with the five predictors described above (peer rangs 0-846
100 and peer rangs of publishability 1-4; both original and revised and Sorensen’s index, plus a 847
sixth, presence /absence of random effects) with random effects of analysis identy and reviewer 848
identy in the lme4 package in R [34, 42]. We had stated here in the text that we would use only the 849
revised (final) peer rangs in this analysis, so the absence of the inial rangs is not a deviaon from 850
our plan: 851
Addional Explanaon
When we planned this analysis, we ancipated that analysts would idenfy a single primary effect
from each model, so that each model would appear in the analysis only once. Our expecaon was
incorrect because some analysts idenfied >1 effect per analysis, but we sll chose to specify our
model as registered and not use a random effect. This is because most models produced only one
effect and so we expected that specifying a random effect to account for the few cases where >1
effect was included for a given model would prevent model convergence.
Note that this analysis contrasts with the analyses in which we used reviewer rangs as predictors
because in the analyses with reviewer rangs, each effect appeared in the analysis approximately
four mes due to mulple reviews of each analysis, and so it was much more important to
account for that variance through a random effect.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
38
DeviaonScorej ~ RangsConnuousij + RangsCategoricalij + βSorensenj + AnalysisIDj + ReviewerIDi
852
ReviewerIDi ~ ϗ (0, σ2) 853
AnalysisIDj ~ ϗ (0, σ2) 854
We conducted all the analyses described above eight mes; for each of the four responses (Zr, y25, 855
y50, y75) one me for each of the two datasets. 856
We have publicly archived all relevant data, code, and materials on the Open Science Framework 857
(hps://osf.io/mn5aj/). Archived data includes the original datasets distributed to all analysts, any 858
edited versions of the data analyzed by individual groups, and the data we analyzed with our meta-859
analyses, which include the effect sizes derived from separate analyses, the stascs describing 860
variaon in model structure among analyst groups, and the anonymized answers to our surveys of 861
analysts and peer reviewers. Similarly, we have archived both the analysis code used for each 862
individual analysis (where available) and the code from our meta-analyses. We have also archived 863
copies of our survey instruments from analysts and peer reviewers. 864
Our rules for excluding data from our study were as follows. We excluded from our synthesis any 865
individual analysis submied aer we had completed peer review or those unaccompanied by 866
analysis files that allow us to understand what the analysts did. We also excluded any individual 867
analysis that did not produce an outcome that could be interpreted as an answer to our primary 868
queson (as posed above) for the respecve dataset. For instance, this means that in the case of the 869
data on blue t chick growth, we excluded any analysis that did not include something that can be 870
interpreted as growth or size as a dependent (response) variable, and in the case of the Eucalyptus 871
establishment data, we excluded any analysis that did not include a measure of grass cover among 872
the independent (predictor) variables. Also, as described above, any analysis that could not produce 873
an effect that could be converted to a signed Zr was excluded from analyses of Zr. 874
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
39
875
Preregistraon Deviaon
Some analysts had difficulty implemenng our instrucons to derive the out-of-sample
predicons, and in some cases (especially for the Eucalyptus data), they submied predicons
with implausibly extreme values. We believed these values were incorrect and thus made the
conservave decision to exclude out-of-sample predicons where the esmates were > 3
standard deviaons from the mean value from the full dataset.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
40
876
Addional Explanaon
1. Evaluang model fit.
We evaluated all fied models using the performance() funcon from
the performance package [45] and the glance() funcon from the broom.mixed package [46]. For
all models, we calculated the square root of the residual variance (Sigma) and the root mean
squared error (RMSE). For GLMMs performance ()calculates the marginal and condional R2
values as well as the contribuon of random effects (ICC), based on Nakagawa et al. [47]. The
condional R2 accounts for both the fixed and random effects, while the marginal R2 considers
only the variance of the fixed effects. The contribuon of random effects is obtained by
subtracng the marginal R2 from the condional R2.
2. Exploring outliers and analysis quality.
Aer seeing the forest plots of Zr values and nocing the existence of a small number of extreme
outliers, especially from the Eucalyptus analyses, we wanted to understand the degree to which
our heterogeneity esmates were influenced by these outliers. To explore this queson, we
removed the highest two and lowest two values of Zr in each dataset and re-calculated our
heterogeneity esmates.
To help understand the possible role of the quality of analyses in driving the heterogeneity we
observed among esmates of Zr, we recalculated our heterogeneity esmates aer removing all
effects from analysis teams that had received at least one rang of “deeply flawed and
unpublishable” and then again aer removing all effects from analysis teams with at least one
rang of either “deeply flawed and unpublishable” or “publishable with major revisions”. We also
used self-idenfied levels of stascal experse to examine heterogeneity when we retained
analyses only from analysis teams that contained at least one member who rated themselves as
“highly proficient” or “expert” (rather than “novice” or “moderately proficient”) in conducng
stascal analyses in their research area in our intake survey.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
41
877
Step 6: Facilitated Discussion and Collaborave Write-Up of Manuscript 878
We planned for analysts and iniang authors to discuss the limitaons, results, and implicaons of 879
the study and collaborate on wring the final manuscript for review as a stage-2 Registered Report. 880
881
Results 882
Summary Stascs 883
In total, 173 analyst teams, comprising 246 analysts, contributed 182 usable analyses of the two 884
datasets examined in this study which yielded 215 effects. Analysts produced 135 disnct effects that 885
met our criteria for inclusion in at least one of our meta-analyses for the blue t dataset. Analysts 886
Addional Explanaon
3. Exploring possible impacts of lower quality esmates of degrees of freedom.
Our meta-analyses of variaon in Zr required variance esmates derived from esmates of the
degrees of freedom in original analyses from which Zr esmates were derived. While processing
the esmates of degrees of freedom submied by analysts, we idenfied a subset of these
esmates in which we had lower confidence because two or more effects from the same analysis
were submied with idencal degrees of freedom. We therefore conducted a second set of
(more conservave) meta-analyses that excluded these Zr esmates with idencal esmates of
degrees of freedom and we present these analyses in the supplement.
Preregistraon Deviaon
As described above, due to the large number of recruited analysts and reviewers and the
ancipated challenges of receiving and integrang feedback from so many authors, we limited
analyst and reviewer parcipaon in the producon of the final manuscript to an invitaon to call
aenon to serious problems with the manuscript dra.
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
42
produced 81 disnct effects meeng our criteria for inclusion for the Eucalyptus dataset. Excluded 887
analyses and effects either did not answer our specified biological quesons, were submied with 888
insufficient informaon for inclusion in our meta-analyses, or were incompable with producon of 889
our effect size(s). We expected this final scenario (incompable analyses), for instance we cannot 890
extract a Zr from random forest models, which is why we analyzed two disnct types of effects, Zr 891
and out-of-sample (yi). Effects included in only a subset of our meta-analyses provided sufficient 892
informaon for inclusion in only that subset (see Table A.1). For both datasets, most submied 893
analyses incorporated mixed effects. Submied analyses of the blue t dataset typically specified 894
normal error and analyses of the Eucalyptus dataset typically specified a non-normal error 895
distribuon (Supplementary Table A.1).
896
For both datasets, the composion of models varied substanally in regards to the number of fixed 897
and random effects, interacon terms, and the number of data points used, and these paerns 898
differed somewhat between the blue t and Eucalyptus analyses (See Supplementary Table A.2).
899
Focusing on the models included in the Zr analyses (because this is the larger sample), blue t 900
models included a similar number of fixed effects on average (mean 5.2 ± 2.92 SD) as Eucalyptus 901
models (mean 5.01 ± 3.83 SD), but the standard deviaon in number of fixed effects was somewhat 902
larger in the Eucalyptus models. The average number of interacon terms was much larger for the 903
blue t models (mean 0.44 ± 1.11 SD) than for the Eucalyptus models (mean 0.16 ± 0.65 SD), but sll 904
under 0.5 for both, indicang that most models did not contain interacon terms. Blue t models 905
also contained more random effects (mean 3.53 ± 2.08 SD) than Eucalyptus models (mean 1.41 ± 906
1.09 SD). The maximum possible sample size in the blue t dataset (3720 nestlings) was an order of 907
magnitude larger than the maximum possible in the Eucalyptus dataset (351 plots), and the means 908
and standard deviaons of the sample size used to derive the effects eligible for our study were also 909
an order of magnitude greater for the blue t dataset (mean 2622.07 ± 939.28 SD) relave to the 910
Eucalyptus models (mean 298.43 ± 106.25 SD). However, the standard deviaon in sample size from 911
the Eucalyptus models was heavily influenced by a few cases of dramac sub-seng (described 912
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
43
below). Approximately three quarters of Eucalyptus models used sample sizes within 3% of the 913
maximum. In contrast, fewer than 20% of blue t models relied on sample sizes within 3% of the 914
maximum, and approximately 50% of blue t models relied on sample sizes 29% or more below the 915
maximum. 916
Analysts provided qualitave descripons of the conclusions of their analyses. Each analysis team 917
provided one conclusion per dataset. These conclusions could take into account the results of any 918
formal analyses completed by the team as well as exploratory and visual analyses of the data. Here 919
we summarize all qualitave responses, regardless of whether we had sufficient informaon to use 920
the corresponding model results in our quantave analyses below. We classified these conclusions 921
into the categories summarized below (Table 1): 922
923
Mixed: some evidence supporng a posive effect, some evidence supporng a negave effect 924
Conclusive negave: negave relaonship described without caveat 925
Qualified negave: negave relaonship but only in certain circumstances or where analysts express 926
uncertainty in their result 927
Conclusive none: analysts interpret the results as conclusive of no effect 928
None qualified: analysts describe finding no evidence of a relaonship but they describe the 929
potenal for an undetected effect
930
Qualified posive: posive relaonship described but only in certain circumstances or where analysts 931
express uncertainty in their result 932
Conclusive posive: posive relaonship described without caveat 933
934
For the blue t dataset, most analysts concluded that there was negave relaonship between 935
measures of sibling compeon and nestling growth, though half the teams expressed qualificaons 936
or described effects as mixed or absent. For the Eucalyptus dataset, there was a broader spread of 937
conclusions with at least one analyst team providing conclusions consistent with each conclusion 938
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
44
category. The most common conclusion for the Eucalyptus dataset was that there was no 939
relaonship between grass cover and Eucalyptus recruitment (either conclusive or qualified 940
descripon of no relaonship), but more than half the teams concluded that there were effects; 941
negave, posive, or mixed. 942
Table 1: Tallies of analysts’ qualitave answers to the research quesons addressed by their 943
analyses. 944
Dataset Mixed Negave
Conclusive
Negave
Qualified
None
Conclusive
None
Qualified
Posive
Qualified
Posive
Conclusive
blue t 5 37 27 4 1 0 0
Eucalytpus 8 6 12 19 12 4 2
945
Distribuon of Effects 946
Effect Size Zr 947
Although the majority (111 of 132) of the usable Zr effects from the blue t dataset found nestling 948
growth decreased with sibling compeon, and the meta-analyc mean Zr (Fisher’s transformaon 949
of the correlaon coefficient) was convincingly negave (-0.35 ± 0.06 95% CI), there was substanal 950
variability in the strength and the direcon of this effect. Zr ranged approximately connuously from 951
-0.93 to 0.19, (Figure 1a and Table 4) and of the 111 effects with negave slopes, 92 had confidence 952
intervals excluding 0. Of the 20 with posive slopes indicang increased nestling growth in the 953
presence of more siblings, 3 had confidence intervals excluding zero (Figure 1a). 954
Meta-analysis of the Eucalyptus dataset also showed substanal variability in the strength of effects 955
as measured by Zr, and unlike with the blue ts, a notable lack of consistency in the direcon of 956
effects (Figure 1b, Table 4). Zr ranged from -4.47 (Supplementary Figure A.2), indicang a strong
957
tendency for reduced Eucalyptus seedling success as grass cover increased, to 0.39, indicang the 958
opposite. Although the range of reported effects skewed strongly negave, this was due to a small 959
number of substanal outliers. Most values of Zr were relavely small with values < 0.2 and the 960
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
45
meta-analyc mean effect size was close to zero (-0.09 ± 0.12 95% CI). Of the 79 effects, fiy-three 961
had confidence intervals overlapping zero, approximately a quarter (fieen) crossed the tradional 962
threshold of stascal significance indicang a negave relaonship between grass cover and 963
seedling success, and eleven crossed the significance threshold indicang a posive relaonship 964
between grass cover and seedling success (Figure 1b). 965
966
967
Figure 1: Forest plots of meta-analyc esmated standardized effect sizes (Zr) and their 95% 968
confidence intervals for each effect size included in the meta-analysis model for a) blue t and b) 969
Eucalytpus. The meta-analyc mean effect size is noted in black and as a dashed vercal line, with 970
error bars also represenng the 95% confidence interval. The solid black vercal line demarcates 971
effect size of 0, indicang no relaonship between the test variable and the response variable. Note 972
that the Eucalyptus plot omits one extreme outlier with the value of -4.47 (Figure A.2) in order to 973
standardize the x-axes on these two panels. 974
975
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
46
Out-of-sample predicons (yi) 976
As with the effect size Zr, we observed substanal variability in the size of out-of-sample predicons 977
derived from the analysts’ models. Blue t predicons (Figure 2a), which were z-score-standardised 978
to accommodate the use of different response variables, always ranged far in excess of one standard 979
deviaon. In the y25 scenario, model predicons ranged from -1.85 to 0.42 (a range of 2.68 standard 980
deviaons), in the y50 scenario, they ranged from -0.53 to 1.11 (a range of 1.63 standard deviaons), 981
and in the y75 scenario they ranged from -0.03 to 1.58 (a range of 1.9 standard deviaons). As should 982
be expected given the existence of both negave and posive Zr values, all three out-of-sample 983
scenarios produced both negave and posive predicons, although as with the Zr values, there is a 984
clear trend for scenarios with more siblings to be associated with smaller nestlings. This is supported 985
by the meta-analyc means of these three sets of predicons which were -0.66 (95% CI -0.82,–0.5) 986
for the y25, 0.34 (95% CI 0.2-0.48) for the y50, and 0.67 (95% CI 0.57-0.77) for the y75. 987
Eucalyptus out-of-sample predicons also varied substanally (Figure 2b), but because they were not 988
z-score-standardised and are instead on the original count scale, the types of interpretaons we can 989
make differ. The predicted Eucalyptus seedling counts per 15 x 15 m plot for the y25 scenario ranged 990
from 0.04 to 33.66, for the y50 scenario ranged from 0.03 to 13.02, and for the y75 scenario they 991
ranged from 0.05 to 21.93. The meta-analyc mean predicons for these three scenarios were 992
similar; 0.58 (95% CI, 0.21,-1.37) for the y25, 0.92 (95% CI 0.36-1.65) for the y50, and 1.67 (95% CI 0.8-993
2.83) for the y75 scenarios respecvely. 994
995
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
47
996
Figure 2: Forest plot of meta-analyc esmated standardized (z-score) blue t out-of-sample 997
predicons, y
i
. for a) blue t and b) Eucalyptus. Triangles represent individual esmates, circles 998
represent the meta-analyc mean for each predicon scenario. Error bars are 95% confidence 999
intervals. 1000
Quanfying Heterogeneity
1001
Effect Size (Zr)
1002
We quanfied both absolute (τ
2
) and relave (I
2
) heterogeneity resulng from analycal variaon. 1003
Both measures suggest that substanal variability among effect sizes was aributable to the 1004
analycal decisions of analysts. 1005
The total absolute level of variance beyond what would typically be expected due to sampling error, 1006
τ
2
(Table 2), among all usable blue t effects was 0.088 and for Eucalyptus effects was 0.267. This is 1007
similar to or exceeding the median value (0.105) of τ
2
found across 31 recent meta-analyses 1008
(calculated from the data in 48]). The similarity of our observed values to values from meta-analyses 1009
of different studies based on different data suggest the potenal for a large poron of heterogeneity 1010
to arise from analycal decisions. For further discussion of interpretaon of τ
2
in our study, please 1011
consult discussion of post hoc analyses below. 1012
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
48
Table 2: Heterogeneity in the esmated effects Zr for meta-analyses of the full dataset, as well as 1013
from post hoc analyses including the dataset with outliers removed, the dataset excluding effects 1014
from analysis teams with at least one “unpublishable” rang, the dataset excluding effects from 1015
analysis teams with at least one “major revisions” rang or worse, or the dataset including only 1016
analyses from teams in which at least one analyst rated themselves as "highly proficient" or "expert" 1017
in stascal analysis. τTeam2 is the absolute heterogeneity for the random effect Team, τEffectID2 is 1018
the absolute heterogeneity for the random effect EffectID, nested under Team, and τtotal2 is the total 1019
absolute heterogeneity. I2Total is the proporonal heterogeneity; the proporon of the variance 1020
among effects not aributable to sampling error, I2Team is the subset of the proporonal 1021
heterogeneity due to differences among Teams and I2Team, EffectID is subset of the proporonal 1022
heterogeneity aributable to among-EffectID differences. 1023
Dataset τ2To ta l τ2Team τ2EffectID I2Tot a l I2Team I2Team , EffectID N. Obs
All Analyses
blue t 0.09 0.04 0.05 97.732% 40.11% 57.63% 131
Eucalyptus 0.27 0.02 0.25 98.589% 6.88% 91.71% 79
All analyses, outliers Removed
blue t 0.07 0.05 0.02 97.030% 66.90% 30.13% 127
Eucalyptus 0.01 0.00 0.01 66.193% 19.27% 46.93% 75
Analyses receiving at least one 'Unpublishable' rang removed
blue t 0.08 0.03 0.05 97.601% 38.10% 59.50% 109
Eucalyptus 0.01 0.01 0.01 79.741% 28.32% 51.42% 55
Analyses receiving at least one 'Unpublishable' and or 'Major Revisions' rang removed
blue t 0.14 0.01 0.13 98.718% 5.17% 93.55% 32
Eucalyptus 0.03 0.03 0.00 88.915% 88.91% 0.00% 13
Analyses from teams that include highly proficient or expert data analysts
blue t 0.10 0.04 0.06 98.058% 36.27% 61.78% 89
Eucalyptus 0.58 0.02 0.56 99.412% 3.49% 95.93% 34
1024
In our analyses, I2 is a plausible index of how much more variability among effect sizes we have 1025
observed, as a proporon, than we would have observed if sampling error were driving variability. 1026
We discuss our interpretaon of I2 further in the methods, but in short, it is a useful metric for 1027
comparison to values from published meta-analyses and provides a plausible value for how much 1028
heterogeneity could arise in a normal meta-analysis with similar sample sizes due to analycal 1029
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
49
variability alone. In our study, total I2 for the blue t Zr esmates was extremely large, at 97.73%, as 1030
was the Eucalyptus esmate (98.59% Table 2). 1031
Although the overall I2 values were similar for both Eucalyptus and blue t analyses, the relave 1032
composion of that heterogeneity differed. For both datasets, the majority of heterogeneity in Zr 1033
was driven by differences among effects as opposed to differences among teams, though this was 1034
more prominent for the Eucalyptus dataset, where nearly all of the total heterogeneity was driven by 1035
differences among effects (91.71%) as opposed to differences among teams (6.88%) (Table 2). 1036
Out-of-sample predicons (yi) 1037
We observed substanal heterogeneity among out-of-sample esmates, but the paern differed 1038
somewhat from the Zr values (Table 3). Among the blue t predicons, I2 ranged from medium-high 1039
for the y25 scenario (68.36) to low (27.02) for the y75 scenario. Among the Eucalyptus predicons, I2 1040
values were uniformly high (>82%). For both datasets, most of the exisng heterogeneity among 1041
predicted values was aributable to among-team differences, with the excepon of the y50 analysis 1042
of the Eucalyptus dataset. We are limited in our interpretaon of τ2 for these esmates because, 1043
unlike for the Zr esmates, we have no benchmark for comparison with other meta-analyses. 1044
1045
Table 3: Heterogeneity among the out-of-sample predicons yi for both blue t and Eucalyptus 1046
datasets. τTeam2 is the absolute heterogeneity for the random effect Tea m , τEffectID2 is the absolute 1047
heterogeneity for the random effect EffectID, nested under Te a m , and τtotal2 is the total absolute 1048
heterogeneity. I2Tot a l is the proporonal heterogeneity; the proporon of the variance among 1049
effects not aributable to sampling error, I2Team is the subset of the proporonal heterogeneity due 1050
to differences among Teams and I2Team, EffectID is subset of the proporonal heterogeneity 1051
aributable to among-EffectID differences. 1052
Dataset Scenario N.
Obs
Τ2To ta l τ2Team τ2EffectID I2Tot a l I 2Te a m I 2Te a m,EffectID
blue t y25 62 0.14 0.11 0.03 68.36% 51.82% 16.54%
y50 59 0.07 0.06 0.01 50.37% 45.66% 4.71%
y75 62 0.02 0.02 0.00 27.02% 25.57% 1.45%
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
50
Eucalyptus y25 22 3.05 1.95 1.10 88.76% 56.76% 32.00%
y50 24 1.61 0.53 1.08 83.26% 27.52% 55.73%
y75 24 1.69 1.41 0.28 79.76% 66.52% 13.25%
1053
Post-hoc Analysis: Exploring outlier characteriscs and the effect of outlier removal on 1054
heterogeneity 1055
Effect Sizes (Zr) 1056
The outlier Eucalyptus Zr values were striking and merited special examinaon. The three negave 1057
outliers had very low sample sizes were based on either small subsets of the dataset or, in one case, 1058
extreme aggregaon of data. The outliers associated with small subsets had sample sizes (n= 117, 90) 1059
that were less than half of the total possible sample size of 351. The case of extreme aggregaon 1060
involved averaging all values within each of the 18 sites in the dataset. 1061
Surprisingly, both the largest and smallest effect sizes in the blue t analyses (Figure 1a) come from 1062
the same analyst (anonymous ID: Adelong), with idencal models in terms of the explanatory 1063
variable structure, but with different response variables. However, the radical change in effect was 1064
primarily due to collinearity with covariates. The primary predictor variable (brood count aer 1065
manipulaon) was accompanied by several collinear variables, including the highly collinear 1066
(correlaon of approximately 0.9 (Supplementary Figure D.2)) covariate (brood size at day 14) in both
1067
analyses. In the analysis of nestling weight, brood count aer manipulaon showed a strong posive 1068
paral correlaon with weight aer controlling for brood count at day 14 and treatment category 1069
(increased, decreased, unmanipulated). In that same analysis, the most collinear covariate (the day 1070
14 count) had a negave paral correlaon with weight. In the analysis with tarsus length as the 1071
response variable, these paral correlaons were almost idencal in absolute magnitude, but 1072
reversed in sign and so brood count aer manipulaon was now the collinear predictor with the 1073
negave relaonship. The two models were therefore very similar, but the two collinear predictors 1074
simply switched roles, presumably because a subtle difference in the distribuon of weight and 1075
tarsus length data. 1076
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
51
When we dropped the Eucalyptus outliers, I2 decreased from high (98.59%), using Higgins’ [36] 1077
suggested benchmark, to between moderate and high (66.19%, Table 2). However, more notably, τ2 1078
dropped from 0.27 to 0.01, indicang that, once outliers were excluded, the observed variaon in 1079
effects was similar to what we would expect if sampling error were driving the differences among 1080
effects (since τ2 is the variance in addion to that driven by sampling error). The interpretaon of this 1081
value of τ2 in the context of our many-analyst study is somewhat different than a typical meta-1082
analysis, however, since in our study (especially for Eucalyptus, where most analyses used almost 1083
exactly the same data points), there is almost no role for sampling error in driving the observed 1084
differences among the esmates. Thus, rather than concluding that the variability we observed 1085
among esmates (aer removing outliers) was due only to sampling error (because τ2 became small: 1086
10% of the median from 48), we instead conclude that the observed variability, which must be due to 1087
the divergent choices of analysts rather than sampling error, is approximately of the same magnitude 1088
as what we would have expected if, instead, sampling error, and not analycal heterogeneity, were at 1089
work. Presumably, if sampling error had actually also been at work, it would have acted as an 1090
addional source of variability and would have led total variability among esmates to be higher. 1091
With total variability higher and thus greater than expected due to sampling error alone, τ2 would 1092
have been noceably larger. Conversely, dropping outliers from the set of blue t effects did not 1093
meaningfully reduce I2, and only modestly reduced τ2 (Table 2). Thus, effects at the extremes of the 1094
distribuon were much stronger contributors to total heterogeneity for effects from analyses of the 1095
Eucalyptus than for the blue t dataset. 1096
Tab le 4: Esmated mean value of the standardised correlaon coefficient, Zr, along with its standard 1097
error and 95% confidence intervals. We re-computed the meta-analysis for different post-hoc subsets 1098
of the data: All eligible effects, removal of effects from analysis teams that received at least one peer 1099
rang of ‘deeply flawed and unpublishable’, removal of any effects from analysis teams that received 1100
at least one peer rang of either ‘deeply flawed and unpublishable’ or ‘publishable with major 1101
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
52
revisions’ , inclusion of only effects from analysis teams that included at least one member who rated 1102
themselves as "highly proficient" or "expert" at conducng stascal analyses in their research area.. 1103
Dataset 𝝁
SE[𝝁
95% CI stasc p-value
All Analyses
blue t −0.35 0.03 [−0.41,−0.28] −10.49 <0.001
Eucalyptus −0.09 0.06 [−0.22,0.03] −1.47 0.14
Analyses receiving at least one 'Unpublishable' rang removed
blue t −0.36 0.03 [−0.43,−0.29] −10.49 <0.001
Eucalyptus −0.02 0.02 [−0.07,0.02] −1.15 0.3
Analyses receiving at least one 'Unpublishable' and or 'Major Revisions' rang removed
blue t −0.37 0.07 [−0.51,−0.23] −5.34 <0.001
Eucalyptus −0.04 0.05 [−0.15,0.07] −0.77 0.4
All analyses - outliers removed
blue t −0.35 0.03 [−0.42,−0.29] −10.95 <0.001
Eucalyptus −0.03 0.01 [−0.06,0.00] −2.23 0.026
Analyses from teams with highly proficient or expert data analysts
blue t −0.35 0.04 [−0.44,−0.27] −8.31 <0.001
Eucalyptus −0.17 0.13 [−0.43,0.10] −1.24 0.2
1104
Out-of-sample predicons (yi) 1105
We did not conduct these post hoc analyses on the out-of-sample predicons as the number of 1106
eligible effects was smaller and the paern of outliers differed. 1107
Post-hoc analysis: Exploring the effect of removing analyses with poor peer rangs on 1108
heterogeneity 1109
Effect Size (Zr) 1110
Removing poorly rated analyses had limited impact on the meta-analyc means (Supplementary
1111
Figure B.3). For the Eucalyptus dataset, the meta-analyc mean shied from -0.09 to -0.02 when
1112
effects from analyses rated as unpublishable were removed, and to -0.04 when effects from analyses 1113
rated, at least once, as unpublishable or requiring major revisions were removed. Further, the 1114
confidence intervals for all of these means overlapped each of the other means (Table 4). We saw 1115
similar paerns for the blue t dataset, with only small shis in the meta-analyc mean, and 1116
confidence intervals of all three means overlapping each other mean (Table 4). Refing the meta-1117
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
53
analysis with a fixed effect for categorical rangs also showed no indicaon of differences in group 1118
meta-analyc means due to peer rangs (Supplementary Figure B.1).
1119
For the blue t dataset, removing poorly-rated analyses led to only negligible changes in I2Total and 1120
relavely minor impacts on τ2. However, for the Eucalyptus dataset, removing poorly-rated analyses 1121
led to notable reducons in I2Total and substanal reducons in τ2. When including all analyses, the 1122
Eucalyptus I2Total was 98 . 5 9 % and τ2 was 0.27, but eliminang analyses with rangs of 1123
“unpublishable” reduced I2Total to 79.74% and τ2 to 0.01, and removing also those analyses “needing 1124
major revisions” le I2Total at 88.91% and τ2 at 0.03 (Table 2). Addionally, the allocaons of I2 to the 1125
team versus individual effect were altered for both blue t and Eucalyptus meta-analyses by 1126
removing poorly rated analyses, but in different ways. For blue t meta-analysis, between a third and 1127
two-thirds of the total I2 was aributable to among-team variance in most analyses unl both 1128
analyses rated “unpublishable” and analyses rated in need of “major revision” were eliminated, in 1129
which case almost all remaining heterogeneity was aributable to among-effect differences. In 1130
contrast, for Eucalyptus meta-analysis, the among-team component of I2 was less than third unl 1131
both analyses rated “unpublishable” and analyses rated in need of “major revision” were eliminated, 1132
in which case almost 90% of heterogeneity was aributable to differences among teams. 1133
Out-of-sample predicons (yi) 1134
We did not conduct these post hoc analyses on the out-of-sample predicons as the number of 1135
eligible effects was smaller and our ability to interpret heterogeneity values for these analyses was 1136
limited. 1137
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
54
Post-hoc analysis: Exploring the effect of including only analyses conducted by analysis 1138
teams with at least one member self-rated as “highly proficient” or “expert” in 1139
conducng stascal analyses in their research area 1140
Effect sizes (Zr) 1141
Including only analyses conducted by teams that contained at least one member who rated 1142
themselves as “highly proficient” or “expert” in conducng the relevant stascal methods had 1143
negligible impacts on the meta-analyc means (Table 4), the distribuon of Zr effects 1144
(Supplementary Figure B.4), or heterogeneity esmates (Table 2), which remained extremely high.
1145
Out-of- sample predicons (yi) 1146
We did not conduct these post hoc analyses on the out-of-sample predicons as the number of 1147
eligible effects was smaller. 1148
Post-hoc analysis: Exploring the effect of excluding esmates of Zr in which we had 1149
reduced confidence 1150
As described in our addendum to the methods, we idenfied a subset of esmates of Zr in which we 1151
had less confidence because of features of the submied degrees of freedom. Excluding these effects 1152
in which we had lower confidence had minimal impact on the meta-analyc mean and the esmates 1153
of total I2 and τ2 for both blue t and Eucalyptus meta-analyses, regardless of whether outliers were 1154
also excluded (Supplementary Table B.1).
1155
Explaining Variaon in Deviaon Scores 1156
None of the pre-registered predictors explained substanal variaon in deviaon among submied 1157
stascal effects from the meta-analyc mean (Table 5, Table 6). Note that the extremely high 1158
R2Condional values from the analyses of connuous peer rangs as predictors of deviaon scores are a 1159
funcon of the random effects, not the fixed effect of interest. These high values of R2Condional result 1160
from the fact that each effect size was included in the analysis mulple mes, to allow comparison 1161
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
55
with rangs from the mulple peer reviewers who reviewed each analysis, and therefore when we 1162
included Effect ID as a random effect, the observaons within each random effect category were 1163
idencal. 1164
Table 5: Summary metrics for registered models seeking to explain deviaon (Box-Cox transformed 1165
absolute deviaon scores) from the mean Zr as a funcon of Sorensen’s Index, categorical peer 1166
rangs, and connuous peer rangs for blue t and Eucalyptus analyses, and as a funcon of the 1167
presence or absence of random effects (in the analyst’s models) for Eucalyptus analyses. We report 1168
coefficient of determinaon, R2, for our models including only fixed effects as predictors of deviaon, 1169
and we report R2Condional, R2Marginal and the intra-class correlaon (ICC) from our models that included 1170
both fixed and random effects. For all our models, we calculated the residual standard deviaon σ 1171
and root mean squared error (RMSE). 1172
Dataset R2 R
2Condional R2Marginal ICC σ RMSE N. Obs.
Deviaon explained by categorical rangs
blue t 0.0903 0.0067 0.0842 6.52e-01 6.32e-01 473
Eucalyptus 0.1319 0.0124 0.1209 1.06e+00 1.02e+00 346
Deviaon explained by connuous rangs
blue t 1.0000 2.00e-26 1.0000 1.63e-05 1.56e-12 473
Eucalyptus 0.9998 6.57e-30 0.9998 7.93e-03 7.09e-14 346
Deviaon explained by Sorensen's index
blue t 0.0011 0.681 0.676 124
Eucalyptus 0.0005 1.14 1.120 72
Deviaon explained by inclusion of random effects
blue t 0.0268 0.658 0.653 131
Eucalyptus 8.67e-08 1.12 1.100 79
1173
Table 6: Parameter esmates from models of Box-Cox transformed deviaon scores as a funcon of 1174
connuous and categorical peer rangs, Sorensen scores, and the inclusion of random effects. 1175
Standard Errors (SE), 95% confidence intervals (95%CI) are reported for all esmates, while t values, 1176
degrees of freedom and p-values are presented for fixed-effects. Note that posive parameter 1177
esmates mean that as the predictor variable increases, so does the absolute value of the deviaon 1178
from the meta-analyc mean. 1179
Dataset Parameter Effect Coeff. SE 95% CI t df p-value
Deviaon explained by inclusion of random effects
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
56
Eucalyptus Intercept 2.53 0.27 -3.06,-1.99 -9.31 77 <0.001
Random effects 0.00 0.31 -0.60, 0.60 0.00 77 >0.9
Deviaon explained by mean Sorensen’s index
Eucalyptus Intercept -2.75 1.07 -4.85,-0.65 -2.57 70 0.010
Sorensen Index 0.29 1.54 -2.74, 3.32 0.19 70 0.9
blue t Intercept -1.56 0.38 -2.30,-0.82 -4.12 122 <0.001
Mean Sorensen
Index
0.23 0.63 -1.00, 1.46 0.37 122 0.7
Deviaon explained by connuous rangs
Eucalyptus Intercept Fixed -2.52 0.06 -2.63,-2.40 -42.58 342 <0.001
Connuous
Rang
Fixed 6e-17 2e-
10
-4e-10, 4e-
10
-3e-07 342 >0.9
SD (Intercept) Random
(EffectID)
0.53 0.04 0.45, 0.62
SD
(Observaons)
Random
(Residual)
0.01 3e-
04
0.01,0.01
blue t Intercept Fixed -1.41 0.03 -1.47,-1.35 -46.54 469 <0.001
Connuous
Rang
Fixed -3e-15 1e-
09
-2e-09,2e-09 -2e-06 469 >0.9
SD (Intercept) Random
(EffectID)
0.34 0.02 0.30, 0.39
SD
(Observaons)
Random
(Residual)
2e-05 6e-
07
2e-05,2e-05
Deviaon explained by categorical rangs
Eucalyptus Intercept Fixed -2.66 0.27 -3.18,-2.13 -9.97 340 <0.001
Publishable with
major revisions
Fixed 0.29 0.29 -0.27, 0.85 1.02 340 0.3
Publishable with
minor revisions
Fixed 0.01 0.28 -0.54, 0.56 0.04 340 >0.9
Publishable as is Fixed 0.05 0.31 -0.55, 0.66 0.17 340 0.9
SD (Intercept) Random
(ReviewerID)
0.39 0.09 0.25, 0.61
SD (Observaons Random
(Residual)
1.06 0.04 0.98,1.15
blue t Intercept Fixed -1.21 0.15 -1.50,-0.93 -8.29 467 <0.001
Publishable with
major revisions
Fixed -0.23 0.15 -0.53, 0.07 -1.50 467 0.13
Publishable with
minor revisions
Fixed -0.23 0.15 -0.53, 0.07 -1.52 467 0.13
Publishable as is Fixed -0.15 0.17 -0.48, 0.18 -0.89 467 0.4
SD (Intercept) Random
(ReviewerID)
0.20 0.05 0.13, 0.31
SD (Observaons Random
(Residual)
0.65 0.02 0.61,0.7
1180
Deviaon Scores as explained by reviewer rangs 1181
Effect Sizes (Zr) 1182
We obtained reviews from 128 reviewers who reviewed analyses for a mean of 3.27 (range 1 - 11) 1183
analysis teams. Analyses of the blue t dataset received a total of 240 reviews, each was reviewed by 1184
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
57
a mean of 3.87 (SD 0.71, range 3-5) reviewers. Analyses of the Eucalyptus dataset received a total of 1185
178 reviews, each was reviewed by a mean of 4.24 (SD 0.79, range 3-6) reviewers. We tested for 1186
inter-rater reliability to examine how similarly reviewers reviewed each analysis and found 1187
approximately no agreement among reviewers. When considering connuous rangs, IRR was 0.01, 1188
and for categorical rangs, IRR was -0.14. 1189
Many of the models of deviance as a funcon of peer rangs faced issues of failure to converge or 1190
singularity due to sparse design matrices with our pre-registered random effects (EffectID and 1191
ReviewerID) (see Supplementary Table C.1). These issues persisted aer increasing the tolerance and
1192
changing the opmizer. For both Eucalyptus and blue t datasets, models with connuous rangs as 1193
a predictor were singular when both pre-registered random effects were included. 1194
When using only categorical rangs as predictors, models converged only when specifying reviewer 1195
ID as a random effect. That model had a R2C of 0.09 and a R2M of 0.01. The model using the 1196
connuous rangs converged for both random effects (in isolaon), but not both. We present results 1197
for the model using study ID as a random effect because we expected it would be a more important 1198
driver of variaon in deviaon scores. That model had a R2C of 1 and a R2M of 0.01 for the blue t 1199
dataset and a R2C of 1 and a R2M of 0.01 for the Eucalyptus dataset. Neither connuous or categorical 1200
reviewer rangs of the analyses meaningfully predicted deviance from the meta-analyc mean 1201
(Table 6, Figure 3). We re-ran the mul-level meta-analysis with a fixed-effect for the categorical 1202
publishability rangs and found no difference in mean standardised effect sizes among publishability 1203
rangs (Supplementary Figure B.1).
1204
1205
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
58
1206
Figure 3: Violin plot of Box-Cox transformed deviaon from meta-analyc mean as a funcon of 1207
categorical peer rang for a) blue t and b) Eucalyptus. Grey points for each rang group denote 1208
model-esmated marginal mean deviaon, and error bars denote 95% CI of the esmate. 1209
1210
Out-of-sample predicons (yi)
1211
Some models of the influence of reviewer rangs on out-of-sample predicons (y
i
) had issues with 1212
convergence and singularity of fit (see Supplementary Table C.2) and those models that converged
1213
and were not singular showed no strong relaonship (Supplementary Figures C.2, Figure C.3), as with 1214
the Zr analyses. 1215
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
59
Deviaon scores as explained by the disncveness of variables in each analysis
1216
Effect Size (Zr)
1217
We employed Sorensen’s index to calculate the disncveness of the set of predictor variables used 1218
in each model (Figure 5). The mean Sorensen’s score for blue t analyses was 0.69 (range 0.55-0.98), 1219
and for Eucalyptus analyses was 0.59 (range 0.43-0.86). 1220
We found no meaningful relaonship between disncveness of variables selected and deviaon 1221
from the meta-analyc mean (Table 6, Figure 5) for either blue t (mean 0.23, 95% CI -1,1.46) or 1222
Eucalyptus effects (mean 0.29, 95% CI -2.74,3.32). 1223
1224
1225
Figure 4: Fied model of the Box-Cox-transformed deviaon score (deviaon in effect size from 1226
meta-analyc mean) as a funcon of the mean Sorensen’s index showing disncveness of the set of 1227
predictor variables for a) blue t, and b) Eucalyptus. Grey ribbons on predicted values are 95% CI’s. 1228
1229
Out-of-sample predicons
1230
As with the Zr esmates, we did not observe any convincing relaonships between deviaon scores 1231
of out-of-sample predicons and Sorensen’s index values. Please see Supplementary Material C.4.2.
1232
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
60
Deviaon scores as explained by the inclusion of random effects
1233
Effect Size (Zr)
1234
There were only three blue t analyses that did not include random effects, which is below the pre-1235
registered threshold for fing a model of the Box-Cox transformed deviaon from the meta-analyc 1236
mean as a funcon of whether the analysis included random-effects. However, 17 Eucalyptus 1237
analyses included only fixed effects, which crossed our pre-registered threshold. Consequently, we 1238
performed this analysis for the Eucalyptus dataset only. There was no relaonship between random-1239
effect inclusion and deviaon from meta-analyc mean among the Eucalyptus analyses (Table 6, 1240
Figure 5). 1241
1242
Figure 5: Violin plot of mean Box-Cox transformed deviaon from meta-analyc mean as a funcon 1243
of random-effects inclusion in Eucalyptus analyses. ‘1’ indicates random-effects were included in 1244
analyst’s model, while 0 indicates no random-effects were included. White points for each group of 1245
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
61
analyses denote model-esmated marginal mean deviaon, and error bars denote 95% CI of the 1246
esmate. 1247
Out-of-sample predicons 1248
As with the Zr esmates, we did not examine the possibility of a relaonship between the inclusion 1249
of random effects and the deviaon scores of the blue t out-of-sample predicons. When we 1250
examined the possibility of this relaonship for the Eucalyptus effects, we found consistent evidence 1251
of somewhat higher Box-Cox-transformed deviaon values for models including a random effect, 1252
meaning the models including random effects averaged slightly higher deviaon from the meta-1253
analyc means (Supplementary Figure C.5).
1254
Mulvariate Analysis Effect size (Zr) and out-of-sample predicons (yi) 1255
Like the univariate models, the mulvariate models did a poor job of explaining deviaons from the 1256
meta-analyc mean. Because we pre-registered a mulvariate model that contained collinear 1257
predictors that produce results which are not readily interpretable, we present these models in the 1258
supplement. We also had difficulty with convergence and singularity for mulvariate models of out-1259
of-sample (yi) result, and had to adjust which random effects we included (Supplementary Table C.7). 1260
However, no mulvariate analyses of Eucalyptus out-of-sample results avoided problems of 1261
convergence or singularity, no maer which random effects we included (Supplementary Table C.7).
1262
We therefore present no mulvariate Eucalyptus yi models. We present parameter esmates from 1263
mulvariate Zr models for both datasets (Supplementary Tables C.5, C.6) and from yi models from 1264
the blue t dataset (Supplementary Tables C.8, C.9). We include interpretaon of the results from
1265
these models in the supplement, but the results do not change the interpretaons we present above 1266
based on the univariate analyses. 1267
Discussion 1268
When a large pool of ecologists and evoluonary biologists analyzed the same two datasets to 1269
answer the corresponding two research quesons, they produced substanally heterogeneous sets 1270
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
62
of answers. Although the variability in analycal outcomes was high for both datasets, the paerns 1271
of this variability differed disnctly between them. For the blue t dataset, there was nearly 1272
connuous variability across a wide range of Zr values. In contrast, for the Eucalyptus dataset, there 1273
was less variability across most of the range, but more striking outliers at the tails. Among out-of-1274
sample predicons, there was again almost connuous variaon across a wide range (2 SD) among 1275
blue t esmates. For Eucalyptus, out-of-sample predicons were also notably variable, with about 1276
half the predicted stem count values at <2 but the other half being much larger, and ranging to 1277
nearly 40 stems per 15 m x 15 m plot. We invesgated several hypotheses for drivers of this 1278
variability within datasets, but found lile support for any of these. Most notably, even when we 1279
excluded analyses that had received one or more poor peer reviews, the heterogeneity in results 1280
largely persisted. Regardless of what drives the variability, the existence of such dramacally 1281
heterogeneous results when ecologists and evoluonary biologists seek to answer the same 1282
quesons with the same data should trigger conversaons about how ecologists and evoluonary 1283
biologists analyze data and interpret the results of their own analyses and those of others in the 1284
literature [e.g., 11, 20, 49, 50]. 1285
Our observaon of substanal heterogeneity due to analycal decisions is consistent with a growing 1286
body of work, much of it from the quantave social sciences [e.g., 11, 17–21]. In all of these 1287
studies, when volunteers from the discipline analyzed the same data, they produced a worryingly 1288
diverse set of answers to a pre-set queson. This diversity always included a wide range of effect 1289
sizes, and in most cases, even involved effects in opposite direcons. Thus, our result should not be 1290
viewed as an anomalous outcome from two parcular datasets, but instead as evidence from 1291
addional disciplines regarding the heterogeneity that can emerge from analyses of complex 1292
datasets to answer quesons in probabilisc science. Not only is our major observaon consistent 1293
with other studies, it is, itself, robust because it derived primarily from simple forest plots that we 1294
produced based on a small set of decisions that were mostly registered before data gathering and 1295
which conform to widely accepted meta-analyc pracces. 1296
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
63
Unlike the strong paern we observed in the forest plots, our other analyses, both registered and 1297
post hoc, produced either inconsistent paerns, weak paerns, or the absence of paerns. Our 1298
registered analyses found that deviaons from the meta-analyc mean by individual effect sizes (Zr) 1299
or the predicted values of the dependent variable (yi) were poorly explained by our hypothesized 1300
predictors: peer rang of each analysis team’s method secon, a measurement of the disncveness 1301
of the set of predictor variables included in each analysis, or whether the model included random 1302
effects. However, in our post hoc analyses, we found that dropping analyses idenfied as 1303
unpublishable or in need of major revision by at least one reviewer modestly reduced the observed 1304
heterogeneity among the Zr outcomes, but only for Eucalyptus analyses, apparently because this led 1305
to the dropping of the major outlier. This limited role for peer review in explaining the variability in 1306
our results should be interpreted cauously because the inter-rater reliability among peer reviewers 1307
was extremely low, and at least some analyses that appeared flawed to us were not marked as 1308
flawed by reviewers. However, the hypothesis that poor quality analyses drove the heterogeneity we 1309
observed was also contradicted by our observaon that analysts’ self-declared stascal experse 1310
appeared unrelated to heterogeneity. When we retained only analyses from teams including at least 1311
one member with high self-declared levels of experse, heterogeneity among effect sizes remained 1312
high. Thus, our results suggest lack of stascal experse is not the primary factor responsible for 1313
the heterogeneity we observed, although further work is merited before rejecng a role for 1314
stascal experse. Not surprisingly, simply dropping outlier values of Zr for Eucalyptus analyses, 1315
which had more extreme outliers, led to less observable heterogeneity in the forest plots, and also 1316
reducons in our quantave measures of heterogeneity. We did not observe a similar effect in the 1317
blue t dataset because that dataset had outliers that were much less extreme and instead had more 1318
variability across the core of the distribuon. 1319
Our major observaons raise two broad quesons; why was the variability among results so high, 1320
and why did the paern of variability differ between our two datasets. One important and plausible 1321
answer to the first queson is that much of the heterogeneity derives from the lack of a precise 1322
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
64
relaonship between the two biological research quesons we posed and the data we provided. This 1323
lack of a precise relaonship between data and queson creates many opportunies for different 1324
model specificaons, and so may inevitably lead to varied analycal outcomes [50]. However, we 1325
believe that the research quesons we posed are consistent with the kinds of research queson that 1326
ecologists and evoluonary biologists typically work from. When designing the two biological 1327
research quesons, we deliberately sought to represent the level of specificity we typically see in 1328
these disciplines. This level of specificity is evident when we look at the research quesons posed by 1329
some recent meta-analyses in these fields: 1330
“how [does] urbanisaon impact mean phenotypic values and phenotypic variaon … [in] paired 1331
urban and non-urban comparisons of avian life-history traits” [51] 1332
“[what are] the effects of ocean acidificaon on the crustacean exoskeleton, assessing both 1333
exoskeletal ion content (calcium and magnesium) and funconal properes (biomechanical 1334
resistance and cucle thickness)” [52] 1335
“[what is] the extent to which restoraon affects both the mean and variability of biodiversity 1336
outcomes … [in] terrestrial restoraon” [53] 1337
“[does] drought stress [have] a negave, posive, or null effect on aphid fitness” [54] 1338
“[what is] the influence of nitrogen-fixing trees on soil nitrous oxide emissions” [55] 1339
There is not a single precise answer to any of these quesons, nor to the quesons we posed to 1340
analysts in our study. And this lack of single clear answers will obviously connue to cause 1341
uncertainty since ecologists and evoluonary biologists conceive of the different answers from the 1342
different stascal models as all being answers to the same general queson. A possible response 1343
would be a call to avoid these general quesons in favor of much more precise alternaves [50]. 1344
However, the research community rewards researchers who pose broad quesons [56], and so 1345
researchers are unlikely to narrow their scope without a change in incenves. Further, we suspect 1346
that even if individual studies specified narrow research quesons, other sciensts would group 1347
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
65
these more narrow quesons into broader categories, for instance in meta-analyses, because it is 1348
these broader and more general quesons that oen interest the research community. 1349
Although variability in stascal outcomes among analysts may be inevitable, our results raise 1350
quesons about why this variability differed between our two datasets. We are parcularly 1351
interested in the differences in the distribuon of Zr since the distribuons of out-of-sample 1352
predicons were on different scales for the two datasets, thus liming the value of comparisons. The 1353
forest plots of Zr from our two datasets showed disnct paerns, and these differences are 1354
consistent with several alternave hypotheses. The results submied by analysts of the Eucalyptus 1355
dataset showed a small average (close to zero) with most esmates also close to zero (± 0.2), though 1356
about a third far enough above or below zero to cross the tradional threshold of stascal 1357
significance. There were a small number of striking outliers that were very far from zero. In contrast, 1358
the results submied by analysts of the blue t dataset showed an average much further from zero (- 1359
0.35) and a much greater spread in the core distribuon of esmates across the range of Zr values (± 1360
0.5 from the mean), with few modest outliers. So, why was there more spread in effect sizes (across 1361
the esmates that are not outliers) in the blue t analyses relave to the Eucalyptus analyses? 1362
One possible explanaon for the lower heterogeneity among most Eucalyptus Zr effects is that weak 1363
relaonships may limit the opportunies for heterogeneity in analycal outcome. Some evidence for 1364
this idea comes from two sets of “many labs” studies in psychology [4, 57]. In these studies, many 1365
independent lab groups each replicated a large set of studies, including, for each study, the 1366
experiment, data collecon, and stascal analyses. These studies showed that, when the meta-1367
analyc mean across the replicaons from different labs was small, there was much less 1368
heterogeneity among the outcomes than when the mean effect sizes were large [4, 57]. Of course, a 1369
weak average effect size would not prevent divergent effects in all circumstances. As we saw with the 1370
Eucalyptus analyses, taking a radically smaller subset of the data can lead to dramacally divergent 1371
effect sizes even when the mean with the full dataset is close to zero. 1372
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
66
Our observaon that dramac sub-seng in the Eucalyptus dataset was associated with 1373
correspondingly dramac divergence in effect sizes leads us towards another hypothesis to explain 1374
the differences in heterogeneity between the Eucalyptus and blue t analysis sets. It may be that 1375
when analysts oen divide a dataset into subsets, the result will be greater heterogeneity in 1376
analycal outcome for that dataset. Although we saw sub-seng associated with dramac outliers in 1377
the Eucalyptus dataset, nearly all other analyses of Eucalyptus data used very close to the same set 1378
of 351 samples, and as we saw, these effects did not vary substanally. However, analysts oen 1379
analyzed only a subset of the blue t data, and as we observed, sample sizes were much more 1380
variable among blue t effects, and the effects themselves were also much more variable. Important 1381
to note here is that subsets of data may differ from each other for biological reasons, but they may 1382
also differ due to sampling error. Sampling error is a funcon of sample size, and sub-samples are, by 1383
definion, smaller samples, and so more subject to variability in effects due to sampling error [58]. 1384
Other features of datasets are also plausible candidates for driving heterogeneity in analycal 1385
outcomes, including features of covariates. In parcular, relaonships between covariates and the 1386
response variable as well as relaonships between covariates and the primary independent variable 1387
(collinearity) can strongly influence the modeled relaonship between the independent variable of 1388
interest and the dependent variable [59, 60]. Therefore, inclusion or exclusion of these covariates 1389
can drive heterogeneity in effect sizes (Zr). Also, as we saw with the two most extreme Zr values from 1390
the blue t analyses, in mulvariate models with collinear predictors, extreme effects can emerge 1391
when esmang paral correlaon coefficients due to high collinearity, and conclusions can differ 1392
dramacally depending on which relaonship receives the researcher’s aenon. Therefore, 1393
differences between datasets in the presence of strong and/or collinear covariates could influence 1394
the differences in heterogeneity in results among those datasets. 1395
Although it is too early in the many-analyst research program to conclude which analycal decisions 1396
or which features of datasets are the most important drivers of heterogeneity in analycal outcomes, 1397
we must sll grapple with the possibility that analycal outcomes may vary substanally based on 1398
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
67
the choices we make as analysts. If we assume that, at least somemes, different analysts will 1399
produce dramacally different stascal outcomes, what should we do as ecologists and 1400
evoluonary biologists? We review some ideas below. 1401
The easiest path forward aer learning about this analycal heterogeneity would be simply to 1402
connue with “business as usual”, where researchers report results from a small number of stascal 1403
models. A case could be made for this path based on our results. For instance, among the blue t 1404
analyses, the precise values of the esmated Zr effects varied substanally, but the average effect 1405
was convincingly different from zero, and a majority of individual effects (84%) were in the same 1406
direcon. Arguably, many ecologists and evoluonary biologists appear primarily interested in the 1407
direcon of a given effect and the corresponding p-value[61], and so the variability we observed 1408
when analyzing the blue t dataset may not worry these researchers. Similarly, most effects from the 1409
Eucalyptus analyses were relavely close to zero, and about two-thirds of these effects did not cross 1410
the tradional threshold of stascal significance. Therefore, a large proporon of people analyzing 1411
these data would conclude that there was no effect, and this is consistent with what we might 1412
conclude from the meta-analysis. 1413
However, we find the counter arguments to “business as usual” to be compelling. For blue ts, there 1414
were a substanal minority of calculated effects that would be interpreted by many biologists as 1415
indicang the absence of an effect (28%), and there were three tradionally ‘significant’ effects in 1416
the opposite direcon to the average. The qualitave conclusions of analysts also reflected 1417
substanal variability, with fully half of teams drawing a conclusion disnct from the one we draw 1418
from the distribuon as a whole. These teams with different conclusion were either uncertain about 1419
the negave relaonship between compeon and nestling growth, or they concluded that effects 1420
were mixed or absent. For the Eucalyptus analyses, this issue is more concerning. Around two-thirds 1421
of effects had confidence intervals overlapping zero, and of the third of analyses with confidence 1422
intervals excluding zero, almost half were posive, and the rest were negave. Accordingly, the 1423
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
68
qualitave conclusions of the Eucalyptus teams were spread across the full range of possibilies. But 1424
even these problems are opmisc. 1425
A potenally larger argument against “business as usual” is that it provides the raw material for 1426
biasing the literature. When different model specificaons readily lead to different results, analysts 1427
may be tempted to report the result that appears most interesng, or that is most consistent with 1428
expectaon [7, 12]. There is growing evidence that researchers in ecology and evoluonary biology 1429
oen report a biased subset of the results they produce [62, 63], and that this bias exaggerates the 1430
average size of effects in the published literature between 30 and 150% [9, 48].The bias then 1431
accumulates in meta-analyses, apparently more than doubling the rate of conclusions of “stascal 1432
significance” in published meta-analyses above what would have been found in the absence of bias 1433
[48]. Thus, “business as usual” does not just create noisy results, it helps create systemacally 1434
misleading results. 1435
Conclusions 1436
Overall, our results suggest to us that, where there is a diverse set of plausible analysis opons, no 1437
single analysis should be considered a complete or reliable answer to a research queson. We 1438
contend that ecologists and evoluonary biologists typically do mulple analyses (as many of our 1439
analyst teams did) however, some of these analyses dont make it into the published manuscript. 1440
Further, because of the evidence that ecologists and evoluonary biologists oen present a biased 1441
subset of the analyses they conduct [48, 62, 63], we do not expect that even a collecon of different 1442
effect sizes from different studies will accurately represent the true distribuon of effects 1443
[48]. Therefore, we believe that an increased level of skepcism of the outcomes of single analyses, 1444
or even single meta-analyses, is warranted going forward. We recognize that some researchers have 1445
long maintained a healthy level of skepcism of individual studies as part of sound and praccal 1446
scienfic pracce, and it is possible that those researchers will be neither surprised nor concerned by 1447
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
69
our results. However, we doubt that many researchers are sufficiently aware of the potenal 1448
problems of analycal flexibility to be appropriately skepcal. 1449
If we are skepcal of single analyses, the path forward may be mulple analyses per dataset. One 1450
possibility is the tradional robustness or sensivity check [e.g., 64, 65], in which the researcher 1451
presents several alternave versions of an analysis to demonstrate that the result is ‘robust’ [66]. 1452
Unfortunately, robustness checks are at risk of the same potenal biases of reporng found in other 1453
studies [11], especially given the relavely few models typically presented. However, these risks 1454
could be minimized by running more models and doing so with pre-registraon or registered report. 1455
Another opon is model averaging. Averages across models oen perform well [e.g., 67], and in 1456
some forms this may be a relavely simple soluon. As most oen pracced in ecology and 1457
evoluonary biology, model averaging involves first idenfying a small suite of candidate models 1458
[see 13], then using Akaike weights, based on Akaike’s Informaon Criterion (AIC), to calculate 1459
weighted averages for parameter esmates from those models. Again, the small number of models 1460
limits the exploraon of specificaon space, but we can examine a larger number of models. 1461
However, there are more concerning limitaons. The largest of these limitaons is that averaging 1462
regression coefficients is problemac when models differ in interacon terms or collinear variables 1463
[68]. Addionally, weighng by AIC may oen be inconsistent with our modelling goals. AIC balances 1464
the trade-off between model complexity and predicve ability, but penalizing models for complexity 1465
may not be suited for tesng hypotheses about causaon. So, AIC may oen not offer the weight we 1466
want to use for an average, and we may also not wish to just generate an average. Instead, if we 1467
hope to understand an extensive universe of possible modelling outcomes, we could conduct a 1468
mulverse analysis, possibly with a specificaon [10, 49]. This could mean running hundreds or 1469
thousands of models (or more!) to examine the distribuon of possible effects, and to see how 1470
different specificaon choices map onto these effects. However, there is a trade-off between 1471
efficiently exploring large areas of specificaon space and liming the analyses to biologically 1472
plausible specificaons. Instead of simply idenfying modelling decisions and creang all possible 1473
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
70
combinaons for the mulverse, a researcher could aempt to prevent implausible combinaons, 1474
though the more variables in the dataset, the more difficult this becomes. To make this easier, one 1475
could recruit many analysts to each designate one or a few plausible specificaons, as with our 1476
‘many analyst’ study [11]. An alternave that may be more labor intensive for the primary analyst, 1477
but which may lead to a more plausible set of models, could involve hypothesizing about causal 1478
pathways with DAGs [directed acyclic graphs; [69]] to constrain the model set. Devong this effort to 1479
thoughul mulverse specificaons, possibly combined with pre-registraon to hinder undisclosed 1480
data dredging, seems worthy of consideraon. 1481
Although we have reviewed a variety of potenal responses to the existence of variability in 1482
analycal outcomes, we certainly do not wish to imply that this is a comprehensive set of possible 1483
responses. Nor do we wish to imply that the opinions we have expressed about these opons are 1484
correct. Determining how the disciplines of ecology and evoluonary biology should respond to 1485
knowledge of the variability in analycal outcome will benefit from the contribuon and discussion 1486
of ideas from across these disciplines. We look forward to learning from these discussions and to 1487
seeing how these disciplines ulmately respond. 1488
Declaraons 1489
Ethics approval and consent to parcipate 1490
We obtained permission to conduct this research from the Whitman College Instuonal Review 1491
Board (IRB). As part of this permission, the IRB approved the consent form (hps://osf.io/xyp68/) 1492
that all parcipants completed prior to joining the study. 1493
Consent for publicaon 1494
Not applicable 1495
Availability of data and materials 1496
All data cleaning and preparaon for our analyses was conducted in R (R Core Team 2022) and is 1497
publicly archived at (hps://zenodo.org/doi/10.5281/zenodo.10046152). Please see session info for 1498
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
71
the full list of packages and their citaons used in our analysis pipeline. We built an R package, 1499
ManyEcoEvo to conduct the analyses described in this chapter. This same package can be used to 1500
reproduce our analyses or replicate the analyses described here using alternate datasets. 1501
Compeng interests 1502
The authors declare that they have no compeng interests 1503
Funding 1504
EG’s contribuons were supported by an Australian Government Research Training Program 1505
Scholarship, AIMOS top-up scholarship (2022) and Melbourne Centre of Data Science Doctoral 1506
Academy Fellowship (2021). FF’s contribuons were supported by ARC Future Fellowship 1507
FT150100297. 1508
Author’s contribuons 1509
HF, THP and FF conceptualized the project. PV provided raw data for Eucalyptus analyses and SG and 1510
THP provided raw data for blue t analyses. DGH, HF and THP prepared surveys for collecng 1511
parcipang analysts and reviewer’s data. EG, HF, THP, PV, SN and FF planned the analyses of the 1512
data provided by our analysts and reviewers, EG, HF, and THP curated the data, EG and HF wrote the 1513
soware code to implement the analyses and prepare data visualisaons. EG ensured that analyses 1514
were documented and reproducible. THP and HF administered the project, including coordinang 1515
with analysts and reviewers. FF provided funding for the project. THP, HF, and EG wrote the 1516
manuscript. Authors listed alphabecally contributed analyses of the primary datasets or reviews of 1517
analyses. All authors read and approved the final manuscript. 1518
Acknowledgements 1519
Not applicable 1520
References 1521
1. Arif S, MacNeil MA. Applying the structural causal model framework for observaonal causal 1522
inference in ecology. Ecological Monographs. 2023;93:e1554. 1523
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
72
2. Atkinson J, Brudvig LA, Mallen-Cooper M, Nakagawa S, Moles AT, Bonser SP. Terrestrial ecosystem 1524
restoraon increases biodiversity and reduces its variability, but not to reference levels: A global 1525
meta-analysis. Ecology Leers. 2022;25:1725–37. 1526
3. Auspurg K, Brüderl J. Has the credibility of the social sciences been credibly destroyed? 1527
Reanalyzing the “many analysts, one data set” project. Socius. 2021;7:23780231211024421. 1528
4. Schloerke B, Cook D, Larmarange J, Briae F, Marbach M, Thoen E, et al. GGally: Extension to 1529
’ggplot2’. 2022. 1530
5. Baselga A, Orme D, Villeger S, De Bortoli J, Leprieur F, Logez M, et al. Package “betapart”. 2023. 1531
6. Bates D, Mächler M, Bolker B, Walker S. Fing linear mixed-effects models using lme4. 2015. 1532
2015;67:48. 1533
7. Bolker B, Robinson D, Menne D, Gabry J, Buerkner P, Hau C, et al. Package “broom.mixed”. 2022. 1534
8. Borenstein M, Higgins JPT, Hedges L, Rothstein H. Basics of meta-analysis: I2 is not an absolute 1535
measure of heterogeneity. Research Synthesis Methods. 2017;8:5–18. 1536
9. Botvinik-Nezer R, Holzmeister F, Camerer CF, Dreber A, Huber J, Johannesson M, et al. Variability in 1537
the analysis of a single neuroimaging dataset by many teams. Nature. 2020;582:84–8. 1538
10. Breznau N, Rinke EM, Wuke A, Nguyen HHV, Adem M, Adriaans J, et al. Observing many 1539
researchers using the same data and hypothesis reveals a hidden universe of uncertainty. 1540
Proceedings of the Naonal Academy of Sciences. 2022;119:e2203150119. 1541
11. Briga M, Verhulst S. Mosaic metabolic ageing: Basal and standard metabolic rates age in opposite 1542
direcons and independent of environmental quality, sex and life span in a passerine. Funconal 1543
Ecology. 2021;35:1055–68. 1544
12. Burnham KP, Anderson DR. Model selecon and mulmodel inference: A praccal informaon-1545
theorecal approach. 2nd edion. Book. New York: Springer-Verlag; 2002. 1546
13. Cade BS. Model averaging and muddled mulmodel inferences. Ecology. 2015;96:2370–82. 1547
14. Capilla-Lasheras P, Thompson MJ, Sánchez-Tójar A, Haddou Y, Branston CJ, Réale D, et al. A global 1548
meta-analysis reveals higher variaon in breeding phenology in urban birds than in their non-urban 1549
neighbours. Ecology Leers. 2022;25:2552–70. 1550
15. Corea S, Casillas JV, Roessig S, Franke M, Ahn B, Al-Hoorie AH, et al. Muldimensional signals 1551
and analyc flexibility: Esmang degrees of freedom in human-speech analyses. Advances in 1552
Methods and Pracces in Psychological Science. 2023;6:25152459231162567. 1553
16. DeKogel CH. Long-term effects of brood size manipulaon on morphological development and 1554
sex-specific mortality of offspring. Journal of Animal Ecology. 1997;66:167–78. 1555
17. Deressa T, Stern D, Vangronsveld J, Minx J, Lizin S, Malina R, et al. More than half of stascally 1556
significant research findings in the environmental sciences are actually not. EcoEvoRxiv. 2023. 1557
hps://doi.org/hps://doi.org/10.32942/X24G6Z. 1558
18. Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G, et al. Collinearity: A review of 1559
methods to deal with it and a simulaon study evaluang their performance. Ecography. 1560
2013;36:27–46. 1561
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
73
19. Fanelli D, Costas R, Ioannidis JPA. Meta-assessment of bias in science. Proceedings of the Naonal 1562
Academy of Sciences. 2017;114:3714–9. 1563
20. Fanelli D, Ioannidis JPA. US studies may overesmate effect sizes in soer research. Proceedings 1564
of the Naonal Academy of Sciences. 2013;110:15031–6. 1565
21. Fidler F, Burgman MA, Cumming G, Burose R, Thomason N. Impact of cricism of null-1566
hypothesis significance tesng on stascal reporng pracces in conservaon biology. Conservaon 1567
Biology. 2006;20:1539–44. 1568
22. Fidler F, Chee YE, Wintle BC, Burgman MA, McCarthy MA, Gordon A. Metaresearch for evaluang 1569
reproducibility in ecology and evoluon. BioScience. 2017;67:282–9. 1570
23. Forstmeier W, Wagenmakers E-J, Parker TH. Detecng and avoiding likely false-posive findings – 1571
a praccal guide. Biological Reviews. 2017;92:1941–68. 1572
24. Fraser H, Parker T, Nakagawa S, Barne A, Fidler F. Quesonable research pracces in ecology 1573
and evoluon. PLOS ONE. 2018;13:e0200303. 1574
25. Gelman A, Weakliem D. Of beauty, sex, and power. American Scienst. 2009;97:310–6. 1575
26. Gelman A, Loken E. The garden of forking paths: Why mulple comparisons can be a problem, 1576
even when there is no “fishing expedion” or “p-hacking” and the research hypothesis was posited 1577
ahead of me. Department of Stascs, Columbia University. 2013. 1578
27. Grueber CE, Nakagawa S, Laws RJ, Jamieson IG. Mulmodel inference in ecology and evoluon: 1579
Challenges and soluons. Journal of Evoluonary Biology. 2011;24:699–711. 1580
28. Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 1581
2003;327:557–60. 1582
29. Hunngton-Klein N, Arenas A, Beam E, Bertoni M, Bloem JR, Burli P, et al. The influence of hidden 1583
researcher decisions in applied microeconomics. Economic Inquiry. 2021;59:944–60. 1584
30. Jennions MD, Lore CJ, Rosenberg MS, Rothstein HR. Publicaon and related biases. In: Koricheva 1585
J, Gurevitch J, Mengersen K, editors. Handbook of meta-analysis in ecology and evoluon. Princeton, 1586
USA: Princeton University Press; 2013. p. 207–36. 1587
31. Kimmel K, Avolio ML, Ferraro PJ. Empirical evidence of widespread exaggeraon bias and 1588
selecve reporng in ecology. Nature Ecology & Evoluon. 2023. hps://doi.org/10.1038/s41559-1589
023-02144-3. 1590
32. Klein RA, Ratliff KA, Vianello M, Jr. RBA, Bahník Š, Bernstein MJ, et al. Invesgang variaon in 1591
replicability: A "many labs" replicaon project. Social Psychology. 2014;45:142–52. 1592
33. Klein RA, Vianello M, Hasselman F, Adams BG, Adams RB, Alper S, et al. Many labs 2: Invesgang 1593
variaon in replicability across samples and sengs. Advances in Methods and Pracces in 1594
Psychological Science. 2018;1:443–90. 1595
34. Knight K. Mathemacal stascs. Book. New York: Chapman; Hall; 2000. 1596
35. Kou-Giesbrecht S, Menge DNL. Nitrogen-fixing trees increase soil nitrous oxide emissions: A
1597
meta-analysis. Ecology. 2021;102:e03415. 1598
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
74
36. Kuznetsova A, Brockhoff PB, Christensen RHB. lmerTest package: Tests in linear mixed effects 1599
models. Journal of Stascal Soware. 2017;82:1–26. 1600
37. Leybourne DJ, Preedy KF, Valenne TA, Bos JIB, Karley AJ. Drought has negave consequences on 1601
aphid fitness and plant vigor: Insights from a meta-analysis. Ecology and Evoluon. 2021;11:11915–1602
29. 1603
38. Lu X, White H. Robustness checks and robustness tests in applied economics. Journal of 1604
Econometrics. 2014;178:194–206. 1605
39. Lüdecke D, Ben-Shachar MS, Pal I, Waggoner P, Makowski D. Performance: An r package for 1606
assessment, comparison and tesng of stascal models. Journal of Open Source Soware. 1607
2021;6:3139. 1608
40. Luke SG. Evaluang significance in linear mixed-effects models in r. Behavior Research Methods. 1609
2017;49:1494–502. 1610
41. Miles C. Tesng market-based instruments for conservaon in northern victoria. In: Norton T, 1611
Lefroy T, Bailey K, Unwin G, editors. Biodiversity: Integrang conservaon and producon: Case 1612
studies from australian farms, forests and fisheries. Melbourne, Australia: CSIRO Publishing; 2008. p. 1613
133–46. 1614
42. Morrissey MB, Ruxton GD. Mulple regression is not mulple regressions: The meaning of 1615
mulple regression and the non-problem of collinearity. Philosophy, Theory, and Pracce in Biology. 1616
2018;10. 1617
43. Nakagawa S, Cuthill IC. Effect size, confidence interval and stascal significance: A praccal 1618
guide for biologists. Biological Reviews. 2007;82:591–605. 1619
44. Nakagawa S, Noble DW, Senior AM, Lagisz M. Meta-evaluaon of meta-analysis: Ten appraisal 1620
quesons for biologists. BMC Biology. 2017;15:18. 1621
45. Nicolaus M, Michler SPM, Ubels R, Velde M van der, Komdeur J, Both C, et al. Sex-specific effects 1622
of altered compeon on nestling growth and survival: An experimental manipulaon of brood size 1623
and sex rao. Journal of Animal Ecology. 2009;78:414–26. 1624
46. Noble DWA, Lagisz M, O’Dea RE, Nakagawa S. Nonindependence and sensivity analyses in 1625
ecological and evoluonary meta-analyses. Molecular Ecology. 2017;26:2410–25. 1626
47. Open Science Collaboraon. Esmang the reproducibility of psychological science. Science. 1627
2015;349:aac4716. 1628
48. Parker TH, Forstmeier W, Koricheva J, Fidler F, Hadfield JD, Chee YE, et al. Transparency in ecology 1629
and evoluon: Real problems, real soluons. Trends in Ecology & Evoluon. 2016;31:711–9. 1630
49. Parker TH, Yang Y. Exaggerated effects in ecology. Nature Ecology & Evoluon. 2023. 1631
hps://doi.org/10.1038/s41559-023-02156-z. 1632
50. Pei Y, Forstmeier W, Wang D, Marn K, Rutkowska J, Kempenaers B. Proximate causes of inferlity 1633
and embryo mortality in capve zebra finches. The American Naturalist. 2020;196:577–96. 1634
51. R Core Team. R: A language and environment for stascal compung. Vienna, Austria: R 1635
Foundaon for Stascal Compung; 2022. 1636
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
75
52. Rosenberg MS. Moment and least-squares based approaches to metaanalyc inference. In: 1637
Koricheva J, Gurevitch J, Mengersen K, editors. Handbook of meta-analysis in ecology and evoluon. 1638
Princeton, USA: Princeton University Press; 2013. p. 108–24. 1639
53. Royle NJ, Hartley IR, Owens IPF, Parker GA. Sibling compeon and the evoluon of growth rates 1640
in birds. Proceedings of the Royal Society B-Biological Sciences. 1999;266:923–32. 1641
54. Schweinsberg M, Feldman M, Staub N, Akker OR van den, Aert RCM van, Assen M van, et al. 1642
Same data, different conclusions: Radical dispersion in empirical results when independent analysts 1643
operaonalize and test the same hypothesis. Organizaonal Behavior and Human Decision 1644
Processes. 2021;165:228–49. 1645
55. Senior AM, Grueber CE, Kamiya T, Lagisz M, O’Dwyer K, Santos ESA, et al. Heterogeneity in 1646
ecological and evoluonary meta-analyses: Its magnitude and implicaons. Ecology. 2016;97:3293–9. 1647
56. Shavit A, Ellison AM. Stepping in the same river twice: Replicaon in biological research. Edited 1648
Book. New Haven, Conneccut, USA: Yale University Press; 2017. 1649
57. Siegel KR, Kaur M, Grigal AC, Metzler RA, Dickinson GH. Meta-analysis suggests negave, but 1650
pCO2-specific, effects of ocean acidificaon on the structural and funconal properes of crustacean 1651
biomaterials. Ecology and Evoluon. 2022;12:e8922. 1652
58. Silberzahn R, Uhlmann EL, Marn DP, Anselmi P, Aust F, Awtrey E, et al. Many analysts, one data 1653
set: Making transparent how variaons in analyc choices affect results. Advances in Methods and 1654
Pracces in Psychological Science. 2018;1:337–56. 1655
59. Simons DJ, Shoda Y, Lindsay DS. Constraints on generality (COG): A proposed addion to all 1656
empirical papers. Perspecves on Psychological Science. 2017. 1657
hps://doi.org/10.1177/174569161770863. 1658
60. Simonsohn U, Simmons JP, Nelson LD. Specificaon curve: descripve and inferenal stascs on 1659
all reasonable specificaons. SSRN Electronic Journal. 2015. hps://doi.org/10.2139/ssrn.2694998 . 1660
61. Simonsohn U, Simmons JP, Nelson LD. Specificaon curve analysis. Nature Human Behaviour. 1661
2020;4:1208–14. 1662
62. Steegen S, Tuerlinckx F, Gelman A, Vanpaemel W. Increasing transparency through a mulverse 1663
analysis. Perspecves on Psychological Science. 2016;11:702–12. 1664
63. Taylor JW, Taylor KS. Combining probabilisc forecasts of COVID-19 mortality in the united states. 1665
European Journal of Operaonal Research. 2023;304:25–41. 1666
64. Dancho M, Vaughan D. Timetk: A tool kit for working with me series. 2023. 1667
65. Vander Werf E. Lack’s clutch size hypothesis: An examinaon of the evidence using meta-analysis. 1668
Ecology. 1992;73:1699–705. 1669
66. Ver Hoef JM. Who invented the delta method? The American Stascian. 2012;66:124–7. 1670
67. Verhulst S, Holveck MJ, Riebel K. Long-term effects of manipulated natal brood size on metabolic 1671
rate in zebra finches. Biology Leers. 2006;2:478–80. 1672
68. Vesk PA, Morris WK, McCallum W, Apted R, Miles C. Processes of woodland eucalypt 1673
regeneraon: Lessons from the bush returns trial. Proceedings of the Royal Society of Victoria. 1674
2016;128:54–63. 1675
we recommend viewing this manuscript in html format at hps://egouldo.github.io/ManyAnalysts/
76
69. Viechtbauer W. Conducng meta-analyses in r with the metafor package. 2010. 2010;36:48. 1676
70. Yang Y, Sánchez-Tójar A, O’Dea RE, Noble DWA, Koricheva J, Jennions MD, et al. Publicaon bias 1677
impacts on effect size, stascal power, and magnitude (type m) and sign (type s) errors in ecology 1678
and evoluonary biology. BMC Biology. 2023;21:71. 1679