Content uploaded by Ali H. Al-Hoorie
Author content
All content in this area was uploaded by Ali H. Al-Hoorie on Jun 15, 2023
Content may be subject to copyright.
Multidimensional signals and analytic flexibility 1
Multidimensional
signals and analytic
flexibility: Estimating
degrees of freedom in
human speech
analyses
Stefano Coretta*1, 2, Joseph V. Casillas3, Simon Roessig4,
Michael Franke5, Byron Ahn6, Ali H. Al-Hoorie7, Jalal Al-Tamimi8,
Najd E. Alotaibi9, Mohammed K. AlShakhori10, Ruth M.
Altmiller11, Pablo Arantes12, Angeliki Athanasopoulou13 , Melissa
M. Baese-Berk14, George Bailey15, Cheman Baira A Sangma16,
Eleonora J. Beier17, Gabriela M. Benavides18, Nicole Benker19 ,
Emelia P. BensonMeyer20, Nina R. Benway21, Grant M. Berry22,
Liwen Bing23, Christina Bjorndahl24 , Mariska Bolyanatz25, Aaron
Braver26, Violet A. Brown27, Alicia M. Brown28, Alejna Brugos29 ,
Erin M. Buchanan30, Tanna Butlin31, Andr´
es Bux ´
o-Lugo32, Coline
Caillol33, Francesco Cangemi34 , Christopher Carignan35, Sita
Carraturo36, Tiphaine Caudrelier37, Eleanor Chodroff38, Michelle
Cohn39, Johanna Cronenberg40, Olivier Crouzet41 , Erica L.
Dagar42, Charlotte Dawson43, Carissa A. Diantoro44 , Marie
Dokovova45, Shiloh Drake46, Fengting Du47 , Margaux Dubuis48,
Florent Duˆ
eme49, Matthew Durward50, Ander Egurtzegi51,
Mahmoud M. Elsherif52, Janina Esser53 , Emmanuel Ferragne54,
Fernanda Ferreira55, Lauren K. Fink56 , Sara Finley57, Kurtis
Foster58, Paul Foulkes59, Rosa Franzke60 , Gabriel
Frazer-McKee61, Robert Fromont62, Christina Garc´ıa63, Jason
Geller64, Camille L. Grasso65 , Pia Greca66, Martine Grice67 ,
Magdalena S. Grose-Hodge68, Amelia J. Gully69, Caitlin
Halfacre70, Ivy Hauser71 , Jen Hay72, Robert Haywood73, Sam
Hellmuth74, Allison I. Hilger75, Nicole Holliday76, Damar
Hoogland77, Yaqian Huang78, Vincent Hughes79, Ane Icardo
Isasa80, Zlatomira G. Ilchovska81, Hae-Sung Jeon82 , Jacq Jones83,
Prepared using sagej.cls
2Journal Title XX(X)
M´
agat N. Junges84, Stephanie Kaefer85, Constantijn Kaland86,
Matthew C. Kelley87, Niamh E. Kelly88, Thomas Kettig89, Ghada
Khattab90, Ruud Koolen91, Emiel Krahmer92 , Dorota Krajewska93,
Andreas Krug94, Abhilasha A. Kumar95, Anna Lander96 , Tomas O.
Lentz97, Wanyin Li98, Yanyu Li99, Maria Lialiou100 , Ronaldo M.
Lima Jr.101, Justin J. H. Lo102, Julio Cesar Lopez Otero103, Bradley
Mackay104, Bethany MacLeod105, Mel Mallard106, Carol-Ann Mary
McConnellogue107, George Moroz108, Mridhula Murali109 , Ladislas
Nalborczyk110, Filip Nenadi ´
c111, Jessica Nieder112 , Duˇ
san
Nikoli´
c113, Francisco G. S. Nogueira114 , Heather M. Offerman115,
Elisa Passoni116, Maud P ´
elissier117, Scott J. Perry118, Alexandra
M. Pfiffner119, Michael Proctor120, Ryan Rhodes121, Nicole
Rodr´ıguez122, Elizabeth Roepke123, Jan P. R ¨
oer124, Lucia
Sbacco125, Rebecca Scarborough126, Felix Schaeffler127, Erik
Schleef128, Dominic Schmitz129, Alexander Shiryaev130, M ´
arton
S´
oskuthy131, Malin Spaniol132, Joseph A. Stanley133, Alyssa
Strickler134, Alessandro Tavano135, Fabian Tomaschek136,
Benjamin V. Tucker137, Rory Turnbull138, Kingsley O. Ugwuanyi139,
I˜
nigo Urrestarazu-Porta140, Ruben van de Vijver141, Kristin J. Van
Engen142, Emiel van Miltenburg143, Bruce Wang144, Natasha
Warner145, Simon Wehrle146, Hans Westerbeek147, Seth Wiener148,
Stephen Winters149, Sidney G.-J. Wong150, Anna Wood151, Jane
Wottawa152, Chenzi Xu153 , Germ´
an Z´
arate-S´
andez154, Georgia
Zellou155, Cong Zhang156 , Jian Zhu157, Timo B. Roettger158
Prepared using sagej.cls
Abstract
Recent empirical studies have highlighted the large degree of analytic
flexibility in data analysis which can lead to substantially different
conclusions based on the same data set. Thus, researchers have
expressed their concerns that these researcher degrees of freedom
might facilitate bias and can lead to claims that do not stand the
test of time. Even greater flexibility is to be expected in fields in
which the primary data lend themselves to a variety of possible
operationalizations. The multidimensional, temporally extended nature
of speech constitutes an ideal testing ground for assessing the
variability in analytic approaches, which derives not only from
aspects of statistical modeling, but also from decisions regarding the
quantification of the measured behavior. In the present study, we gave
the same speech production data set to 46 teams of researchers
and asked them to answer the same research question, resulting in
substantial variability in reported effect sizes and their interpretation.
Using Bayesian meta-analytic tools, we further find little to no evidence
that the observed variability can be explained by analysts’ prior
beliefs, expertise or the perceived quality of their analyses. In light
of this idiosyncratic variability, we recommend that researchers more
transparently share details of their analysis, strengthen the link
between theoretical construct and quantitative system and calibrate
their (un)certainty in their conclusions
Keywords
crowdsourcing science, data analysis, scientific transparency, speech,
acoustic analysis
Prepared using sagej.cls [Version: 2017/01/17 v1.20]
4Journal Title XX(X)
1, 2Department of Linguistics and English Language, University of Edinburgh, United Kingdom
3Department of Spanish and Portuguese, Rutgers University, United States
4Department of Linguistics, Cornell University, United States
5Department of General and Computational Linguistics, University of T¨
ubingen, Germany
6Program in Linguistics, Princeton University, United States
7Jubail English Language and Preparatory Year Institute, Royal Commission for Jubail and Yanbu, Saudi
Arabia
8Laboratoire de Lingusitique Formelle (LLF), CNRS, Universit´
e Paris Cit´
e, France
9School of Education, Communication and Language Studies - ECLS, Newcastle University, United
Kingdom
10Department of Linguistics, University of Arizona, United States
11Psychological and Brain Sciences, Washington University in Saint Louis, United States
12Departamento de Letras, Universidade Federal de S˜
ao Carlos, Brazil
13School of Languages, Linguistics, Literatures and Cultures, University of Calgary, Canada
14Department of Linguistics, University of Oregon, United States
15Department of Language and Linguistic Science, University of York, United Kingdom
16School of Languages, Linguistics, Literatures and Cultures, University of Calgary, Canada
17Department of Psychology, University of California, Davis, United States
18Department of Linguistics, University of Arizona, United States
19Institute of Phonetics and Speech Processing, University of Munich, Germany
20University of Pennsylvania, United States
21Department of Communication Sciences and Disorders, Syracuse University, United States
22Department of Spanish, Villanova University, United States
23Department of English Language and Linguistics, University of Birmingham, United Kingdom
24Department of Philosophy, Carnegie Mellon University, United States
25Department of Spanish & French Studies, Occidental College, United States
26Department of English, Texas Tech University, United States
27Psychological and Brain Sciences, Washington University in Saint Louis, United States
28Department of Spanish and Portuguese, University of Arizona, United States
29Division of Mathematics and Computer Science, Simmons University, United States
30Analytics, Harrisburg University of Science and Technology, United States
31School of Languages, Linguistics, Literatures and Cultures, University of Calgary, Canada
32Department of Psychology, University at Buffalo, SUNY, United States
33Universit ´
e Paris Cit´
e, France
34IfL-Phonetik, University of Cologne, Germany
35Department of Speech, Hearing and Phonetic Sciences, University College London, United Kingdom
36Psychological and Brain Sciences, Washington University in Saint Louis, United States
37Basque Center on Cognition Brain and Language, Spain
38Department of Language and Linguistic Science, University of York, United Kingdom
39Department of Linguistics, University of California, Davis, United States
40Institute of Phonetics and Speech Processing, University of Munich, Germany
41LLING, UMR6310, Nantes Universit ´
e / CNRS, France
42Department of Linguisticss and TESOL, University of Texas at Arlington, United States
43School of Psychology, Newcastle University, United Kingdom
44Department of Linguistics, University of Oregon, United States
45School of Psychological Sciences and Health, University of Strathclyde, United Kingdom
46Department of Linguistics, University of Oregon, United States
47School of English Literature, Language and Linguistics, Newcastle University, United Kingdom
48Department of comparative language science, Universit¨
at Z¨
urich, Swizerland, Switzerland
49Basque Center on Cognition Brain and Language, Spain
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 5
50Department of Linguistics, University of Canterbury, New Zealand
51IKER (UMR 5478), Centre National de la Recherche Scientifique (CNRS), France
52Department of Neuroscience, Psychology and Behaviour, University of Leicester, United Kingdom
53Statistics Group, Association for Diversity in Linguistics, Germany
54CLILLAC-ARP, Universit ´
e Paris Cit´
e, France
55Department of Psychology, University of California, Davis, United States
56Department of Music, Max Planck Institute for Empirical Aesthetics, Germany
57Department of Psychology, Pacific Lutheran University, United States
58Department of Linguistics, University of Oregon, United States
59Department of Language and Linguistic Science, University of York, United Kingdom
60Institute of Phonetics and Speech Processing, University of Munich, Germany
61Department of languages, linguistics, and translation, Universit´
e Laval, Canada
62New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, New Zealand
63Department of Languages, Literatures, and Cultures, Saint Louis University, United States
64Department of Psychology, Princeton University, United States
65LPC, Aix Marseille Univ, CNRS, France
66Institute of Phonetics and Speech Processing, University of Munich, Germany
67IfL-Phonetik, University of Cologne, Germany
68Department of English Language and Linguistics, University of Birmingham, United Kingdom
69Department of Language and Linguistic Science, University of York, United Kingdom
70School of English Literature, Language and Linguistics, Newcastle University, United Kingdom
71Department of Linguistics and TESOL, University of Texas at Arlington, United States
72New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, New Zealand
73Ao Tawhiti Unlimited Discovery, New Zealand
74Department of Language and Linguistic Science, University of York, United Kingdom
75Department of Speech, Language, and Hearing Sciences, University of Colorado Boulder, United States
76Department of Linguistics and Cognitive Science, Pomona College, United States
77School of Education, Communication and Language Studies - ECLS, Newcastle University, United
Kingdom
78University of California, Los Angeles, United States
79Department of Language and Linguistic Science, University of York, United Kingdom
80Department of Modern and Classical Languages and Literatures, California State University, Northridge,
United States
81School of Psychology, University of Birmingham, United Kingdom
82School of Humanities, Language and Global Studies, University of Central Lancashire, United Kingdom
83Department of Linguistics, University of Canterbury, Aotearoa
84Programa de P ´
os-Graduac¸ ˜
ao em Letras, Federal University of Rio Grande do Sul, Brazil
85Department of Linguistics, University of Canterbury, New Zealand
86Institute of Linguistics, University of Cologne, Germany
87Department of Linguistics, University of Washington, United States
88School of English Literature, Language and Linguistics, Newcastle University, United Kingdom
89Department of Language and Linguistic Science, University of York, United Kingdom
90School of Education, Communication and Language Studies - ECLS, Newcastle University, United
Kingdom
91Department of Communication and Cognition, Tilburg University, Netherlands
92Department of Communication and Cognition, Tilburg University, the Netherlands
93Department of Linguistics and Basque Studies, University of the Basque Country UPV/EHU, Spain
94School of Education, Communication and Language Studies - ECLS, Newcastle University, United
Kingdom
95Department of Psychology, Bowdoin College, ME, United States
Prepared using sagej.cls
6Journal Title XX(X)
96Linguistic Convergence Laboratory, HSE University, Russia
97Department of Communication and Cognition, Tilburg University, the Netherlands
98School of Psychology, University of Birmingham, United Kingdom
99School of Education, Communication and Language Studies - ECLS, Newcastle University, United
Kingdom
100Institute of German Language I Linguistics, University of Cologne, Germany
101Department of English Language Studies, Federal University of Cear´
a, Brazil
102Department of Speech, Hearing and Phonetic Sciences, University College London, United Kingdom
103Department of Hispanic Studies, University of Houston, United States
104Department of English and American Studies, University of Salzburg, Austria
105School of Linguistics & Language Studies, Carleton University, Canada
106Psychological and Brain Sciences, Washington University in Saint Louis, United States
107Population Health Sciences Institute - Faculty of Medical Sciences, Newcastle University, United
Kingdom
108HSE University, Russia
109Speech and Language Therapy, University of Strathclyde, United Kingdom
110LPC, Aix Marseille Univ, CNRS, France
111Department of Psychology, Faculty of Media and Communications, Singidunum University, Serbia
112Institute of Linguistics, Heinrich-Heine University D ¨
usseldorf, Germany
113School of Languages, Linguistics, Literatures and Cultures, University of Calgary, Canada
114Graduate Program in Linguistics, Federal University of Cear´
a, Brazil
115World Languages, Literatures and Cultures Department, University of Arkansas, United States
116Deparment of Linguistics, Queen Mary University of London, United Kingdom
117CLILLAC-ARP, Universit ´
e Paris Cit´
e, France
118Department of Linguistics, University of Alberta, Canada
119Department of Linguistics, University of California, Berkeley, United States
120Department of Linguistics, Macquarie University, Australia
121Center for Cognitive Science, Rutgers University, United States
122Department of Spanish and Portuguese, Rutgers University, United States
123Department of Speech, Language, and Hearing Sciences, Saint Louis University, United States
124Department of Psychology, Witten/Herdecke University, Germany
125School of Education, Communication and Language Studies - ECLS, Newcastle University, United
Kingdom
126Department of Linguistics, University of Colorado Boulder, United States
127Clinical Audiology, Speech and Language (CASL) Research Centre, Queen Margaret University
Edinburgh, United Kingdom
128Department of English and American Studies, University of Salzburg, Austria
129Department of English and American Studies, Heinrich Heine University D¨
usseldorf, Germany
130Linguistic Convergence Laboratory, HSE University, Russia
131Department of Linguistics, University of British Columbia, Canada
132Department of Psychiatry and Psychotherapy, University Hospital Cologne, Germany
133Department of Linguistics, Brigham Young University, United States
134Department of Linguistics, University of Colorado Boulder, United States
135Department of Neuroscience, Max Planck Institute for Empirical Aesthetics, Germany
136Department of General Linguistics, University of T¨
ubingen, Germany
137Department of Linguistics, University of Alberta, Canada
138School of English Literature, Language and Linguistics, Newcastle University, United Kingdom
139Department of English & Literary Studies, University of Nigeria, Nsukka, Nigeria
140IKER-UMR5478, Centre National de la Recherche Scientifique (CNRS), France
141Institute of Linguistics, Heinrich-Heine University D ¨
usseldorf, Germany
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 7
Introduction
In order to effectively accumulate knowledge, science needs (i) to
produce data that can be replicated using the original methods and (ii)
to arrive at robust conclusions substantiated by such data. In recent
coordinated efforts to replicate published findings, scientific disciplines
have uncovered surprisingly low success rates (e.g., Open Science
Collaboration 2015; Camerer et al. 2018) leading to what is now
referred to as the replication crisis. Beyond the difficulties of replicating
scientific findings, a growing body of evidence suggests that researchers’
conclusions often vary even when they have access to the same data.
The latter situation has been referred to as the inference crisis (Rotello
et al. 2015; Starns et al. 2019) and is, among other things, rooted in
the inherent flexibility of data analysis (often referred to as researcher
degrees of freedom: Simmons et al. 2011; Gelman and Loken 2014). Data
analysis involves many different steps, such as inspecting, organizing,
transforming, and modeling data, to name a few. Along the way, different
methodological and analytic choices need to be made, all of which may
influence the final interpretation of the data.
These researcher degrees of freedom are both a blessing and a curse.
They are a blessing because they afford us the opportunity to look at
142Psychological and Brain Sciences, Washington University in Saint Louis, United States
143Department of Communication and Cognition, Tilburg University, the Netherlands
144Chinese and Bilingual Studies, Hong Kong Polytechnic University, Hong Kong SAR, China
145Department of Linguistics, University of Arizona, United States
146IfL-Phonetik, University of Cologne, Germany
147Department of Languages, Literature and Communication, Utrecht University, Netherlands
148Department of Modern Languages, Carnegie Mellon University, United States
149School of Languages, Linguistics, Literatures and Cultures, University of Calgary, Canada
150Geospatial Research Institute, University of Canterbury, New Zealand
151Department of Linguistics, University of Oregon, United States
152D´
epartement de Lettres modernes, LIUM, LST, Le Mans Universit´
e, France
153NA, University of Oxford, United Kingdom
154Department of Spanish, Western Michigan University, United States
155Department of Linguistics, University of California, Davis, United States
156School of Education, Communication and Language Studies - ECLS, Newcastle University, United
Kingdom
157School of Information, University of Michigan, Ann Arbor, United States
158Department of Linguistics and Scandinavian Studies, University of Oslo, Norway
Corresponding author:
Timo B. Roettger
Email: timo.b.roettger@iln.uio.no
Prepared using sagej.cls
8Journal Title XX(X)
nature from different angles, which, in turn, allows us to make important
discoveries and generate new hypotheses (e.g., Box 1976; Tukey 1977;
De Groot 2014). They are a curse because idiosyncratic choices can lead to
categorically different interpretations, which eventually find their way into
the publication record where they are taken for granted (Simmons et al.
2011). Recent projects have shown that the variability between different
data analysts is vast and can lead independent researchers to draw different
conclusions from the same data set (e.g., Silberzahn et al. 2018; Starns
et al. 2019; Botvinik-Nezer et al. 2020). These studies, however, might
still underestimate the extent to which analysts vary because data analysis
is not restricted to the statistical analysis of ready-made numeric data.
These data can in fact be the result of complex measurement processes
that translate a phenomenon, such as human behavior, into numbers. This
is particularly true for fields that draw conclusions about human behavior
and cognition from multidimensional data like audio or video data. In
fields working on speech production, for example, researchers need to
make numerous decisions about what to measure and how to measure it, in
other words, how to operationalize the phenomenon under investigation.
This is not trivial, given the temporal extension of the acoustic signal and
its complex structural composition.
In this article, we investigate the impact of analytic choices on
research results when many analyst teams examine the same speech
production data set, a process that involves both decisions regarding
the operationalization of linguistically relevant constructs and decisions
regarding statistical analysis. Specifically, we discuss the degree of
variability in research results obtained by 46 teams who had to choose the
operationalization and statistical procedures to answer the same research
question, on the basis of the same set of raw data (here, speech recordings).
Our goals are twofold: (i) our study conceptually replicates previous
many-analyses projects, by probing the effects of different statistical
analyses and by assessing the generalizability of published findings to
other disciplines (here, the speech sciences); (ii) our study extends the
scope of inquiry to include flexibility in the operationalization of complex
human behavior (here, speech). This is an important addition in that the
increased number of “forking paths” in the “garden of analytic choices”—
derived from the many decisions involved in quantification—might reveal
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 9
a higher degree of variability across analysts than previously observed,
thus giving us a more realistic estimate of variability.
Researcher degrees of freedom
Data analysis comes with many decisions, for example how to measure
a given phenomenon or behavior, which data to submit to statistical
modeling and which to exclude in the final analysis, or what inferential
decision-making procedure to apply. This can be problematic because
humans show cognitive biases that can lead to erroneous inferences.
Humans are biased (e.g., Tversky and Kahneman 1974), e.g. they see
coherent patterns in randomness (Brugger 2001), convince themselves
of the validity of prior expectations (“I knew it”, Nickerson 1998), and
perceive events as being plausible in hindsight (“I knew it all along”,
Fischhoff 1975). In conjunction with an academic incentive system that
rewards certain discovery processes more than others (Sterling 1959;
Koole and Lakens 2012), we often find ourselves exploring many possible
analytic pipelines, but only reporting a selected few.
This issue is particularly amplified in fields in which the raw data lend
themselves to many possible ways of being measured (Roettger 2019).
Combined with a wide variety of methodological and theoretical traditions
as well as varying levels of quantitative training across subfields, the
inherent flexibility of data analysis might lead to a vast plurality of analytic
approaches that can lead to different scientific conclusions (Roettger et al.
2019). Analytic flexibility has been widely discussed from a conceptual
point of view (Simmons et al. 2011; Wagenmakers et al. 2012; Nosek
and Lakens 2014) and in regard to its application in individual scientific
fields (e.g. Wicherts et al. 2016; Charles et al. 2019; Roettger 2019). This
notwithstanding, there are still many unknowns regarding the extent of
analytic plurality in practice.
Consequently, a substantial body of published papers likely present
overconfident interpretations of data and statistical results based on
idiosyncratic analytic strategies (e.g., Simmons et al. 2011; Gelman
and Loken 2014). These interpretations, and the conclusions that derive
from them, are thus associated with an unknown degree of uncertainty
(dependent on the strength of evidence provided) and with an unknown
degree of generalizability (dependent on the chosen analysis). Moreover,
the same data could lead to very different conclusions depending on
Prepared using sagej.cls
10 Journal Title XX(X)
the analytic path taken by the researcher. However, instead of being
critically evaluated, scientific results often remain unchallenged in the
publication record. Despite recent efforts to improve transparency and
reproducibility (e.g. Miguel et al. 2014; Klein et al. 2018) and the advent
of freely available and accessible infrastructures, such as those provided
by the Open Science Framework (osf.io), critical re-analyses of published
analytic strategies are still uncommon because data sharing remains rare
(Wicherts et al. 2006).
Crowd-sourcing alternative analyses
Recent collaborative attempts have started to shed light on how different
analysts tackle the same data set and have revealed a large amount of
variability. In a pioneering collaborative effort, Silberzahn et al. (2018)
let twenty-nine independent analysis teams address the same research
hypothesis: whether soccer referees are more likely to give red cards
to dark-skin-toned players than to light-skin-toned players. The analytic
approaches and, consequently, the results varied widely between teams.
Twenty teams (69%) found support for the hypothesis, and 9 (31%) did
not. Out of the 29 analytic strategies, there were 21 unique combinations
of covariates. Importantly, the observed variability was neither predicted
by the teams’ preconceptions about the phenomenon under investigation
nor by peer ratings of the quality of their analyses. The authors’ results
suggest that analytic plurality may be an inevitable byproduct of the
scientific process and not necessarily driven by different levels of expertise
or bias.
Several other recent studies corroborated this analytic flexibility across
different disciplines. Dutilh et al. (2019) and Starns et al. (2019)
investigated analysts’ choices when inferring theoretical constructs based
on the same data set using computational models. Both studies revealed
vastly different modeling strategies, even though scientific conclusions
were similar across analysis teams (see also Parker et al. 2020, and
Botvinik-Nezer et al. (2020), regarding analytic flexibility in ecology and
neuroimaging data, respectively). Bastiaansen et al. (2020) crowd-sourced
clinical recommendations based on analyses of an individual patient. Their
results suggest that analysts differed substantially regarding decisions
related to both the statistical analysis of the data and the theoretical
rationale behind interpreting the statistical results.
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 11
Building on the many-analysts approach, Landy et al. (2020) asked 15
research teams to independently design studies to answer five different
research questions related to moral judgments. Again, they found vast
heterogeneity across researchers’ conclusions. The observed variation was
not predicted by the researchers’ expertise, but seem to vary for the
five different research questions which might exhibit different degrees
of theoretical underspecification. This is in line with Auspurg and
Br¨
uderl (2021) who re-analyzed the red card study mentioned above. The
authors argue that some of the observed heterogeneity across analysts
in Silberzahn et al. (2018) might have been driven by flexibility in
statistically interpreting the research question.
While these studies attested a large degree of analytic flexibility with
possibly impactful consequences, they focused on analytic decisions
related to the study design, the statistical analysis or the architecture
of computational models. In these studies the data sets were fixed and
neither data collection nor measurement could be changed. Thus the
estimates of variability found in the literature might reflect a lower bound
only, ignoring large parts of the forking paths related to measurement.
However, in many fields the primary raw data are complex signals,
for which theoretical constructs need to be operationalized relative to a
theoretically motivated research question. This is especially true in the
Social Sciences, where the phenomenon under investigation corresponds
to both observable and unobservable human behavior.
Decisions about how to measure theoretical constructs related to
human behavior and cognition might interact with downstream decisions
about statistical modeling and vice versa. For instance, Flake and
Fried (2020) discuss the cascading impact that different practices can
have on psychometric research. The authors highlight, among others,
the following degrees of freedom in the choice and development of
measures: definition of the theoretical construct, justification of the
selected measure, description of the measure and of how it maps onto the
construct, response coding and related transformations, as well as post-
hoc modifications to the chosen measure. Taken together, these aspects
alone dramatically increase the combinations of possible analytic choices,
and hence flexibility in research outcomes.
In those disciplines concerned with communication, human behavior
often corresponds to multidimensional visual and/or acoustic signals. The
Prepared using sagej.cls
12 Journal Title XX(X)
complex nature of this data exponentiates the number of possible analytic
approaches, thus further increasing analytic flexibility. In order to estimate
this increased flexibility, the present study looks at experimentally elicited
speech production data.
Operationalizing speech
Research on speech lies at the intersection of the cognitive sciences,
informing psychological models of language, categorization, and memory,
guiding methods for diagnosis and treatment of speech disorders, and
facilitating advancement in automatic speech recognition and speech
synthesis. One major challenge in the Speech Sciences is the mapping
between communicative intentions (the unobserved behavior) and their
physical manifestation (the observed behavior).
Speech signals are complex as they are characterized by structurally
different acoustic parameters distributed throughout different temporal
domains. Thus, choosing how to assess a communicative intention of
interest is an important analytic step. Take for example the sentence in
(1).
(1) “I can’t bear another meeting on Zoom.”
Depending on the speaker’s intention, this sentence can be said in different
ways. For instance, if the speaker is exhausted by all their meetings, they
might acoustically highlight the word another or meeting to contrast it
with more pleasant activities. If, on the other hand, the speaker is just
tired of video conferences, as opposed to say face-to-face meetings, they
might acoustically highlight the word Zoom.
If we decide to compare the speech signal associated with these two
intentions, how can we quantify the difference between them? In other
words, given their physical manifestation (speech), what do we measure
and how do we measure it? Because of the continuous and transient nature
of speech, identifying speech parameters and temporal domains within
which to measure those parameters becomes a non-trivial task. Utterances
stretch over several thousand milliseconds and contain different levels
of linguistically relevant units such as phrases, words, syllables, and
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 13
individual sounds. The researcher is thus confronted with a considerable
number of parameters and combinations thereof to choose from.
From a phonetic viewpoint, linguistically relevant units are inherently
multidimensional and dynamic: they consist of clusters of parameters that
are modulated over time. The acoustic parameters of units are usually
asynchronous, i.e. they appear at different time points in the unfolding
signal, and overlap with parameters of other units (e.g. Jongman et al.
2000; Lisker 1986; Summerfield 1981; Winter 2014). A classic example is
the distinction between voiced and voiceless stops in English (i.e. /b/ and
/p/ in bear vs. pear). This contrast is manifested by many acoustic features
which can differ depending on several factors, such as the position of the
consonant in the word and context of surrounding sounds (Lisker 1977).
Furthermore, correlates of the contrast can even be found away from the
consonant, in temporally distant speech units. For example, the initial /l/
of the English words led and let is affected by the voicing of the final
consonant (/d, t/) (Hawkins and Nguyen 2004).
The multiplicity of phonetic measurements grows exponentially if we
look at larger temporal domains, as is the case with suprasegmental
aspects of speech. For example, studies investigating acoustic correlates
of word stress (e.g. the difference between ´
ınsight and inc´
ıte) use a wide
variety of measurements, including temporal characteristics (duration
of certain segments or sub-segmental intervals), spectral characteristics
(intensity, formants, and spectral tilt), and measurements related to
fundamental frequency (f0) (e.g., Gordon and Roettger 2017). Moving
on to the expression of higher-level communicative functions, like
information structure and discourse pragmatics, relevant acoustic cues
can be distributed throughout even larger domains, such as phrases and
whole utterances (e.g., Ladd 2008). Differences in position, shape, and
alignment of f0 modulations over multiple locations within a sentence
are correlated with differences in discourse functions (e.g., Niebuhr
et al. 2011). The latter can also be expressed by global vs. local pitch
modulations (Van Heuven et al. 2002), as well as acoustic information
within the temporal or spectral domain (e.g., Van Heuven and Van Zanten
2005). Extra-linguistic information, like the speaker’s intentions, levels
of emotional arousal or social identity, are also conveyed by broad-
domain parameters, such as voice quality, rhythm, and pitch (Foulkes and
Docherty 2006; Ogden 2004; White et al. 2009).
Prepared using sagej.cls
14 Journal Title XX(X)
In short, when testing hypotheses on speakers’ intentions using speech
production data, researchers are faced with many choices and possibilities.
The larger the functional domain (e.g. segments vs. words vs. utterances),
the higher the number of conceivable operationalizations. For example,
several decisions have to be made when comparing the two realizations of
the sentence in (1), one of which is intended to signal emphasis on another
and one of which emphasizes Zoom (see 2a and 2b).
(2a) I can’t bear ANOTHER meeting on Zoom.
(2b) I can’t bear another meeting on ZOOM.
Do we compare only the word another in (2a) and (2b), or also the word
Zoom? Do we measure utterance-wide acoustic profiles, whole words, or
just stressed syllables? Do we average across the chosen time domain or
do we measure a specific point in time? Do we measure f0, intensity, or
something else (Stevens 2000)?
When looking at phrase-level temporal domains, the number of possible
alternative analytic pipelines increases substantially. Figure 1A shows a
typical example of a decision tree with which speech researchers are
often confronted. Each of the four analytic decisions in the example have
different possible options. Here only one particular path has been taken.
A different one would likely produce different results and might lead to
different conclusions. Once we have decided to compare f0 of the word
another across the two utterances, there are still many choices to be made,
all of which need to be justified. As Figures 1B-C illustrate, we could
measure f0 at specific points in time like the onset of the temporal window,
the offset, or the midpoint. We could also measure the value or time of
the f0 minimum or maximum. We could summarize f0 across the entire
window and extract the mean, median or standard deviation of f0, all
of which have been used to analyze speech data in previous work (see
Gordon and Roettger 2017). But the journey in the garden of analytic
paths goes on. Other important operationalization steps could involve
filtering the audio signal, smoothing the extracted f0 track, removing
values that substantially deviate from surrounding values or expectations,
either manually or automatically, and so on.
These decisions are intended to be made prior to any statistical analysis,
but are at times revised a posteriori in light of unforeseen or surprising
outcomes (i.e. after data collection and/or preliminary analyses). This
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 15
Figure 1. Illustration of the analytic flexibility associated with acoustic analyses. (A) An
example of multiple possible and justifiable decisions when comparing two utterances; (B)
Waveform and f0 track of the utterances I can’t bear ANOTHER meeting on Zoom and I can’t
bear another meeting on ZOOM. The green boxes mark the word another in both sentences;
(C) Spectrogram and f0 track of the word another, exemplifying possible operationalizations
of differences in f0.
multitude of possible decisions are multiplied by those researcher degrees
of freedom related to statistical analysis (e.g. Wicherts et al. 2016).
Prepared using sagej.cls
16 Journal Title XX(X)
In sum, speech data is made of complex physical signals that generate
an as-of-yet unappreciated amount of analytic flexibility in the choice of
measures and operationalizations. The present paper probes this garden
of forking paths in the analysis of speech. To assess the variability in
data analysis pipelines, including both operationalization and statistical
analysis, across independent researchers, we provide analytic teams
with an experimentally elicited speech production data set. The data
set derives from the unpublished research project Prosodic encoding of
redundant referring expressions, which set out to investigate whether
speakers acoustically modify utterances to signal unexpected referring
expressions.*In the following section we introduce the research question
and the experimental procedure of said project, and we describe the
resulting data set as used in the current study.
The data set: The acoustic properties of atypical modifiers
Referring is one of the most basic and prevalent uses of language and
one of the most widely researched areas in Language Science. When
trying to refer to a banana, what does a speaker say and how do they
say it in a given context? The context within which an entity occurs (i.e.,
with other non-fruits, other fruits, or other bananas) plays a large part
in determining the choice of referring expressions. Generally, speakers
aim to be as informative as possible to uniquely establish reference to
the intended object, but they are also resource-efficient in that they avoid
redundancy (Grice 1975). Thus one would expect the use of a modifier,
for example, only if it is necessary for disambiguation. For instance, one
might use the adjective yellow to describe a banana in a situation in which
there are both a yellow and a less ripe green banana available, but not
when there is only one banana.
Despite the coherent idea that speakers are both rational and efficient,
there is much evidence that speakers are often over-informative. Speakers
use referring expressions that are more specific than strictly necessary
for the unambiguous identification of the intended referent (Sedivy
2003; Rubio-Fern´
andez 2016), which has been argued to facilitate object
identification and make communication between speakers and listeners
more efficient (Arts et al. 2011; Paraboni et al. 2007; Rubio-Fern´
andez
∗Results of this research project were neither published nor publicly presented and are stored on a private OSF
repository.
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 17
2016). Recent findings suggest that the utility of referring expressions
depends on how useful they are for a listener (compared to other referring
expressions) to identify a target object. For example, Degen et al. (2020)
showed that modifiers that are less typical for a given referent (e.g. a
blue banana) are more likely to be used in an over-informative scenario
(e.g. when there is just one banana)(see also Westerbeek et al. 2015). This
account, however, has mainly focused on content selection (Gatt et al.
2013), i.e. what words to use.
Even when morphosyntactically identical expressions are involved,
speakers can modulate utterances via acoustic properties like temporal
and spectral modifications (e.g., Ladd 2008). Most prominently, languages
can use intonation to signal discourse relationships between referents.
Intonation marks discourse-relevant referents for being new or given
information, to guide the listeners’ interpretation of incoming messages.
Beyond structuring information relative to the discourse, a few studies
suggest that speakers might use intonation to signal atypical lexical
combinations (e.g. Dimitrova et al. 2008, 2009). Referential expressions
such as blue banana were produced with greater prosodic prominence than
more typical referents such as yellow banana. These results are in line
with the idea of resource-efficient, rational language users who modulate
their speech in order to facilitate listeners’ comprehension. However, the
above studies are based on a small sample size (10 participants) and on
potentially anti-conservative statistical analyses, leaving reason to doubt
the generalizability of the studies’ conclusions.
To further illuminate the question of whether speakers modify speech to
signal atypical referents, and overcome some of the limitations of previous
work, thirty native German speakers were recorded in a production
study while interacting with a confederate (one of the experimenters)
in a referential game, following experimental procedures typical of the
field. The participants had to verbally instruct the confederate to select
a specified target object out of four objects presented on a screen. The
subject and confederate were seated at the opposite sides of a table, each
facing one of two computer screens. The participant and the experimenter
could not see each other nor each others’ screens. Figure 2 shows
the experimental procedure time-line. After a familiarization phase, the
subject first saw four colored objects in the top left, top right, bottom left,
and bottom right corners of the screen. One of the objects served as the
Prepared using sagej.cls
18 Journal Title XX(X)
Figure 2. Experimental procedure. The upper row illustrates the trial sequence for the
speaker (participant) and the lower row illustrates the trial sequence for the confederate. After
a preview of 1500ms the speaker sees an arrow indicating one of the referents (b). Reading
the orthographic instructions out loud, the speaker gives the confederate verbal instructions
onto which referent they should drag the cube (c). The confederate, in turn, drags the black
cube onto the target referent (d). Both the arrow and the orthographic instruction disappear
from the speaker’s screen and a new referent is indicated by an arrow on the same display
alongside a new orthographic instruction (e). The speaker gives the confederate verbal
instructions (f) which the confederate follows by dragging the cube onto the next referent (g).
target, another as the competitor, and the remaining two objects served as
distractors. Objects were referred to using noun phrases consisting of an
adjective modifier denoting color and a modified object (e.g. gelbe Zitrone
‘yellow lemon’, rote Gurke ‘red cucumber’, rote Socken ‘red socks’).
In the center of the screen, a black cube was displayed, which could be
moved by the experimenter. Participant read a sentence prompt out loud
(Du sollst den W¨
urfel auf der COLOR OBJECT ablegen ‘You have to put
the cube on top of the COLOR OBJECT’) to instruct the experimenter to
drag the cube on top of one of the four depicted objects (the competitor)
using the mouse. After the experimenter had moved the cube as instructed,
the subject would read another sentence prompt (Und jetzt sollst du den
W¨
urfel auf der COLOR OBJECT ablegen ‘And now, you have to put
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 19
the cube on top of the COLOR OBJECT’) instructing the experimenter
to move the cube on top of a different object (the target). The second
utterance in the trial was the critical trial for analysis.
The two sentence prompts were used to create a focus contrast between
the competitor and the target object. Focused units denote the set of
all (contextually relevant) alternatives (e.g., Rooth 1992). Concretely, a
focus contrast marks one or more elements in a sentence as prominent,
by different linguistic means depending on the language (Mati’c and
Wedgwood 2013; Burdin et al. 2015). For instance, if the competitor
and target objects differ but their color does not (e.g. yellow banana
vs. yellow tomato), the noun is said to be in focus (Noun Focus condition,
NF). If the objects are the same but differ in color (e.g. yellow banana
vs. blue banana), the color adjective is in focus (Adjective Focus
condition, AF). If both the color and the object differ (e.g. yellow banana
vs. blue tomato), then the whole noun phrase is in focus (Adjective/Noun
Focus condition, ANF). The NF condition constituted the experimentally
relevant condition, while the AF and ANF conditions acted as fillers.
Crucially, the color-object combinations in the Noun Focus (NF) condition
were manipulated with respect to their typicality. The combinations were
either typical (e.g. orange mandarin), medium typical (e.g. green tomato),
or atypical (e.g. yellow cherry), as established by a norming study that
was conducted prior to the production experiment just described.†Each
subject produced 15 critical trials (NF condition). Each trial was repeated
twice, yielding a total of 30 trials per participant and a grand-total of 900
(15 ×2×30 participants) spoken utterances.
For the present study, 46 analysis teams have received access to
the entire data set generated by the production study. The data set is
constituted by audio recordings and annotation files in a format that is
typical for the field. The teams were instructed to answer the following
research question, using the provided data set: Do speakers acoustically
modify utterances to signal atypical word combinations?
†A detailed description of the norming and production studies from the Prosodic encoding of redundant
referring expressions project, which was given to the analysts with the data set, can be found in
methods norm prod.pdf at https://bit.ly/3Ahawc7.
Prepared using sagej.cls
20 Journal Title XX(X)
Methods
As outlined in Section Operationalizing speech, researchers are faced with
a large number of analytic choices when analyzing a multidimensional
signal such as speech. Analysts must identify and operationalize relevant
measurements, as well as the temporal domain(s) from which these
measurements are to be taken, and then possibly transform said
measurements before submitting them to statistical models, which must
be chosen alongside inferential criteria. The complexity of speech data
constitutes the ideal testing ground to assess the upper bound of analytic
flexibility that social science might face across disciplines. We employed a
meta-analytic approach to assess (i) the variability of the reported effects,
and (ii) how analytic and researcher-related predictors affect the final
results.
In this study, we followed the procedures proposed by Parker et al.
(2020) and Aczel et al. (2021). The project comprised the following five
phases:
1. RECRUITMENT: We recruited independent groups of researchers to
analyze the data and review others’ data analyses.
2. TE AM ANALYS IS: We gave researchers access to the speech corpus
and let them analyze the data as they saw fit.
3. REVIEW: We asked reviewers to generate peer-review ratings of the
analyses based on methods (not results).
4. ME TA-ANALYSIS: We evaluated variability among the different
analyses and how different predictors affected the outcomes.
5. WR IT E-U P: We collaboratively produced the final manuscript.
We initially estimated that this process, from the time of an in-principle
acceptance of the Stage 1 Registered Report to the end of Phase 5,
would take nine months. Phase 4 (meta-analysis) took longer than initially
anticipated and the total duration of the project was approximately 12
months.
The project OSF repository contains all the materials mentioned
in this paper and can be accessed at https://osf.io/3bmcp/.
The repository holds three main OSF components (Data,Teams
analyses and Questionnaires), and a link to the project’s GitHub
repository. The following sections report the criteria for sample size, data
exclusions, data manipulations, and all the measures in the study.
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 21
Phase 1: Recruitment of analysts and initial survey
An online landing page provided a general description of the project,
including a short pre-recorded slide-show that summarizes the data
set and research question (https://many-speech-analyses.
github.io). The project was advertised via social media, using
mailing lists for linguistic and psychological societies, and via word
of mouth. Social media advertising was accompanied by a short
recruitment form (recruitment form.pdf). The target population
comprised active speech science researchers with a graduate/doctoral
degree (or currently studying for a graduate/doctoral degree) in
relevant disciplines. All individuals interested in participating were
asked to complete a questionnaire detailing their familiarity with
numerous analytic approaches common in the speech sciences
(analytic approach quest.pdf). Researchers could choose to
work independently or in small teams. For the sake of simplicity, we
will refer both to a single researcher and teams as ANALYSIS TEAMS.‡
Recruitment for this project commenced after having received in-principle
acceptance.
As outlined above, our primary aim is to assess the variability
of the reported effects, rather than the meta-analytic estimate of
the investigated effect per se. To estimate the degree of uncertainty
around effect variability as driven by number of teams, we ran
a series of sample size simulations with values of variability
extracted from Silberzahn et al. (2018). The code is available at
https://many-speech-analyses.github.io/many_
analyses/scripts/r/simulations/simulations, Section
2.§Variability among teams was operationalized as the standard deviation
of the teams’ reported effects from Silberzahn et al. (2018) (which we
z-scored prior to simulations to make it comparable to our study). For the
mean of the teams’ true standard deviation (0.68 z-score), the simulation
indicates that the degree of uncertainty around the estimated teams’
standard deviation will be below 1 SD at any sample size greater than 10
teams. Thus in order to achieve our main goal, i.e. estimating variability
among teams, we considered a minimum sample size of 10 teams as
‡Terms in small caps in this and later sections are included with their definition in the glossary at the end of the
paper for the reader’s convenience.
§Cached model outputs can be found at https://osf.io/wds2m/.
Prepared using sagej.cls
22 Journal Title XX(X)
sufficient. Given the exploratory nature of our study, however, we have
sampled as many analysts as possible. We received initial expressions of
interest to participate from more than 200 analysts, though there was a
substantial drop-out rate (see Section Results).
After submitting their analyses, we asked the analysts to also function as
peer-reviewers. Each team had to review four other analyses. All analysts
involved share co-authorship on this manuscript and participated in the
collaborative process of producing the final manuscript. Informed consent
was obtained as part of the intake form.
Phase 2: Primary Data Analyses
The analysis teams registered for participation and each of the ana-
lysts individually answered a demographic and expertise questionnaire
(intake form.pdf). A PDF version of this and all other question-
naires are available in the repository’s Questionnaires component, at
https://osf.io/h6z8w/. The questionnaire collected information
on the analysts’ current position and self-estimated breadth and level
of statistical expertise and acoustic analysis skills. We then requested
that they answer the research question: Do speakers acoustically modify
utterances to signal atypical word combinations? To do so, they were
given the data generated by the experiment described in Section The data
set. Data included the audio recordings with corresponding time-aligned
transcriptions in the form of Praat TextGrid files. These can be found in
the Data component at https://osf.io/5agn9/.
Once their analysis was complete, they answered a structured
questionnaire (analytic quest.pdf), providing information about
their analysis technique, an explanation of their analytic choices, their
quantitative results, and a statement describing their conclusions. They
also uploaded their analysis files (including the additionally derived data
and text files that were used to extract and pre-process the acoustic data),
their analysis code (if applicable), and a detailed journal-ready analysis
section.
Phase 3: Peer Review of Analyses
The analyses from each team were evaluated by four different teams who
functioned as peer-reviewers. Each peer-reviewer was randomly assigned
to analyses from at least four analysis teams. Reviewers evaluated the
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 23
methods of each of their assigned analyses one at a time in a sequence
determined by the initiating authors. The sequences were systematically
assigned so that, if possible, each analysis is allocated to each position in
the sequence for at least one reviewer.
The process for a single reviewer was as follows. First, the reviewer
received a description of the methods of a single analysis. This included
the narrative methods and results sections, the analysis team’s answers to
the questionnaire regarding their methods, including analysis code and
the data set. The reviewer was then asked in an online questionnaire
(peer review quest.pdf) to rate both the acoustic and the statistical
analyses and to provide an overall rating, using a scale of 0-100,
respectively. To help reviewers calibrate their rating, they were given the
following guidelines:
•100. A perfect analysis with no conceivable improvements from the
reviewer.
• 75. An imperfect analysis but the needed changes are unlikely to
dramatically alter the final interpretation.
• 50. A flawed analysis likely to produce either an unreliable estimate
of the relationship or an over-precise estimate of uncertainty.
• 25. A flawed analysis likely to produce an unreliable estimate of the
relationship and an over-precise estimate of uncertainty.
• 0. A dangerously misleading analysis, certain to produce both an
estimate that is wrong and a substantially over-precise estimate
of uncertainty that places undue confidence in the incorrect
estimate.
The reviewers were also given the option to include further comments in
a text box for each of the three ratings.
After submitting the review, a methods section from a second analysis
was made available to the reviewer. This same sequence was followed
until all analyses allocated to a given reviewer were provided and
reviewed.¶
¶Initially we planned to present simultaneously all four (or more) methods sections to each reviewer after the
fourth round, with the option to revise their original ratings and provide an explanation. Ultimately, we decided
to skip this step due to time constraints.
Prepared using sagej.cls
24 Journal Title XX(X)
Phase 4: Evaluating variation
The initiating authors (SC, JC, TR) conducted the analyses outlined in this
section. We did not conduct confirmatory tests of any a priori hypotheses.
We consider our analyses exploratory.
Descriptive statistics We calculated summary statistics describing
variation among analyses, including (a) the nature and number of acoustic
measures (e.g. f0 or duration), (b) the operationalization and the temporal
domain of measurement (e.g. mean of an interval or value at a specified
point in time), (c) the nature and number of model parameters for both
fixed and random effects (if applicable), (d) the nature and reasoning
behind inferential assessments (e.g. dichotomous decision based on p-
values, ordinal decision based on a Bayes factor), as well as the (e) mean,
(f) standard deviation and (g) range of the standardized effect sizes (see the
next section for the standardization procedure). These summary statistics
are reported in Descriptive statistics of the Results section.
Meta-analytic estimation We investigated the variability in RE PO RTE D
EFF EC T SIZES using Bayesian meta-analytic techniques. As the measure
of variability, we took the meta-analytic GROUP-LEVEL STAN DARD
DEVIATION (σαt, see below), where each analysis team represents a group.
As we detail in the Results section below, we have also run further non-
preregistered analyses. For these we refer the reader to that section, while
we only describe the preregistered analyses in the following paragraphs.
Based on the common practices currently in place within the field,
we anticipated that researchers would use multilevel regression models,
thus common measurements of effect size, such as Cohen’s d, might
have been inappropriate. Furthermore, Aczel et al. (2021) suggest that
directly asking analysts to report standardized effect sizes could bias the
choice of analyses towards types that more straightforwardly return a
standardized effect. Since the variables used by the analysis teams might
have substantially differed in their measurement scales (e.g, Hertz for
frequency vs. milliseconds for duration) which was indeed the case, we
have standardized all reported effects by refitting each REPORTED MODEL
with centered and scaled continuous variables (z-scores, i.e. the observed
values subtracted from the mean divided by the standard deviation) and
sum-coded factor variables. Each STANDARDIZED MODEL was fitted as a
Bayesian regression model with Stan (Team 2021), RStan (Team 2020),
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 25
and brms (B¨
urkner 2017) in R (R Core Team 2020). Model refitting also
constituted a way of validating the reported analyses, a step recommended
by Aczel et al. (2021). Details about the refitting procedure can
be found at https://many-speech-analyses.github.io/
many_analyses/scripts/r/04_refit_workflow. Relative to
the registered protocol, we made minor changes to the refitting procedure,
specifically file and variable naming conventions and the use of
treatment contrasts instead of sum coding. All models converged ( ˆ
Rwas
approximately 1). Of the models with divergent transitions (n = 10), the
number of divergences ranged from 1 to 156 (143 represents 3.9% of total
number of samples), which the authors deemed not to be problematic.
The coefficients of the critical predictors (i.e. critical according to
the analysis teams’ self-reported inferential criteria) obtained from the
standardized models were used as the STANDARDIZED EFFECT SIZE
(ηi) of each reported model. Moreover, to account for the differing
degree of uncertainty around each standardized effect size, we used the
standard deviation of each standardized effect size as the STAN DARDIZED
STANDARD ERROR (sei). This enabled us to fit a so-called “measurement-
error” model, in which both the standardized effect sizes and their
respective standard errors are entered in the meta-analytic model. As
a desired consequence, effect sizes with a greater standard error are
weighted less than those with a smaller standard error in the meta-analytic
calculations.
After having obtained the standardized effect sizes ηiwith related
standard errors sei, for each critical predictor in each reported model,
we conducted a BAYE SI AN RANDOM-EFFECTS META-A NALYSI S using
a multi-level (intercept-only) regression model. The outcome variable
was the set of standardized effect sizes ηi. The likelihood of ηi
was assumed to correspond to a normal distribution (Knight 2000).
The analysis teams were entered as a group-level effect (i.e., (1 |
team), called random effect in the frequentist literature). The standard
errors seiwere included as the standard deviation of ηito fit a
measurement-error model, as discussed above. We used regularizing
weakly-informative priors for the intercept α(Normal(0,1)) and for
the group-level standard deviation σαt(Half Cauchy(0,1)). We fit this
model with 4 chains of Hamiltonian Monte-Carlo sampling for the
estimation of the joint posterior distribution, using the No U-Turn Sampler
Prepared using sagej.cls
26 Journal Title XX(X)
(NUTS) as implemented in Stan (Team 2021), and 4000 iterations
(2000 for warm-up) per chain, distributed across 8 processing cores
and 2 threads in within-chain parallelization. The model did not incur
any divergent transitions ( ˆ
Rwas not greater than 1) and the estimated
sample sizes were sufficient. The code used to run the model can
be found at https://many-speech-analyses.github.io/
many_analyses/scripts/r/06_meta-analysis_prereg.
The posterior distribution of the population-level intercept αallowed us
to estimate the range of probable values of the standardized effect size
ˆη. The posterior distribution further allowed us to investigate the effect
of a set of analytic and researcher-related predictors, detailed in the next
section. Crucially, the posterior distribution of the group-level standard
deviation σαt(i.e. the standard deviation of the group-level effect of team)
allowed us to quantify the degree of variation between the teams’ analyses
on a standardized scale.
Analytic and researcher-related predictors affecting effect sizes As
a second step, we investigated the extent to which the individual
standardized effect sizes are affected by a series of ANA LYTIC AND
RESEARCHER-RELATED PREDICTORS.
Analytic predictors. We estimated the influence of the following
predictors related to the analytic characteristics of each team’s reported
analysis:
•Measure of uniqueness of individual analyses for the set of predictors
in each model [numeric].
•Number of models the teams reported to have run [numeric].
•Major dimension that has been measured to answer the research
question [categorical].
•Temporal window that the measurement is taken over [categorical].
•Average peer-review rating, as the mean of the overall peer-review
ratings for each analysis [numeric].
Following Parker et al. (2020), the measure of uniqueness of predictors
was assessed by the Sørensen-Dice Index (SDI, Dice 1945; Sørensen
1948). The SDI is an index typically used in ecology research to
compare species composition across sites. It is a distance measure
similar to Euclidean distance measures, but is more sensitive to more
heterogeneous data sets and deemphesizes outliers. For our purposes, we
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 27
treated predictors as species and individual analyses as sites. For each pair
of analyses (X, Y )(across and within teams), the SDI was obtained using
the following formula:
SDI =2|X∩Y|
|X|+|Y|
where |X∩Y|is the number of variables common to both models in the
pair, and |X|+|Y|is the sum of the number of variables that occur in
each model. For example, if two pairs of models differ in either only
one predictor (e.g. DV ˜ typicality vs. DV ˜ typicality + trial) or in two
predictors (e.g. DV ˜ typicality vs. DV ˜ typicality + trial + speech rate),
the latter model pair would exhibit a larger SDI than the former. In order to
generate a unique SDI for each analysis team, we calculated the average
of all pairwise SDIs for all pairs of analyses using the beta.pair()
function in the betapart R package (Baselga et al. 2020).
The major measurement dimension of each analysis was categorized
according to the following possible groups: duration,intensity,f0,
other spectral properties (e.g. frequency, center of gravity, harmonics
difference, etc.), and other measures (e.g. derived measures such principal
components, vowel dispersion, etc.). The temporal window that the
measurement is taken over is defined by the target linguistic unit. We
assume the following relevant linguistic units: segment,syllable,word,
phrase,sentence. Since each analysis received more than one peer-review
rating, we calculated the mean rating and its standard deviation for each.
These were entered in the model formula as a measurement-error term
(me(mean, sd) in brms).
Researcher-related factors. We also included the following predictors:
•Research experience as the elapsed time from receiving the PhD.
Negative values will indicate that the person is a student or graduate
student [numeric].
•Initial belief in the presence of an effect of atypical noun-adjective
pairs on acoustics, as answered during the intake questionnaire
[numeric].
To obtain an aggregated research experience score and initial belief
score for each team based on the members’ individual scores, we
Prepared using sagej.cls
28 Journal Title XX(X)
calculated the mean and standard deviation of these predictors for each
team. These were entered in the model formula as a measurement-error
term (me(mean, sd) in brms). The expedient of using a measurement-
error term (which includes the teams’ standard deviation) ensures
information about within-team variance is not lost (which would be the
case if including the mean only).
We had initially planned to also include a measure of conservativeness
of the model specification, as the number of random/group-level
effects included and the number of post-hoc changes to the acoustic
measurements the teams reported to have carried out. When fitting the
model, we realized that the measure of conservativeness is related to the
standard error of the estimates (i.e. more group-level effects = higher
standard error). Moreover, there was no team that declared to have made
post-hoc changes to the analyses, thus we decided against including these
two preregistered predictors in the model.
Model specification. The model was fitted as a measurement-error
model, with the predictors detailed in the preceding paragraphs. The
outcome variables of the model were the standardized effect sizes and
related standard deviation.
A normal distribution was used as the likelihood function of αt[i]. The
mean of αt[i]was modeled on the basis of the overall intercept βand on the
coefficients of each predictor. The numeric predictors were centered and
scaled and the categorical predictors were sum coded. We used a normal
distribution with mean 0 and standard deviation 1 as the prior for the
intercept and the predictors. The model was run with the same settings
as with the meta-analytic model. The code used to run the model can
be found at https://many-speech-analyses.github.io/
many_analyses/scripts/r/06_meta-analysis_prereg.
Data management All relevant data, code, and materials have
been publicly archived on the Open Science Framework (https:
//osf.io/3bmcp/). Archived data include the original data set
distributed to all analysts, any edited versions of the data analyzed
by individual teams, and the data we analyzed with our meta-
analyses, which include the standardized effect sizes, the statistics
describing variation in model structure among analysis teams, and the
anonymized answers to our questionnaires of analysts. Similarly, we
archived both the analysis code used for each individual analysis and
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 29
the code from our meta-analyses. We also archived copies of our
survey instruments from analysts and peer-reviewers. Further documents
concerning the collaborative editing of the Registered Report can
be found at https://drive.google.com/drive/folders/
1-DOcj1qtEkvWfzu_FrsxkIGfPS0DyLXB?usp=sharing.
We excluded from our synthesis any individual analysis submitted after
peer review (Phase 3) or those unaccompanied by analysis files without
which it was not possible to follow the research protocol. We also excluded
any individual analysis that does not produce an outcome that could be
interpreted as an answer to our primary question. We also did no include
analyses for which we could not extract standardized effect sizes. For a
list of exclusion criteria, see Section Descriptive statistics below.
Phase 5: Collaborative Write-Up of Manuscript
The initiating authors discussed the limitations, results, and implications
of the study and collaborated with the analysts on writing the final
manuscript for review as a stage-2 Registered Report.||
Results
The results section is divided into three parts. We first provide a
statistical description of team composition, nature of acoustic analyses
and statistical approaches, and peer-review ratings. Second, we report
the results of the meta-analytic model, focusing on between-team and
between-model variability. Finally, we present the analysis of the effect
of analytic and researcher-related predictors on the meta-analytic effect.
The research compendium of the study, containing all the code and
data presented here, can be found in the GitHub repository linked
in the research compendium at https://osf.io/3bmcp/, in the
scripts/r/ folder. An interactive web application that allows the
interested reader to explore the data set is available at https://
many-speech-analyses.github.io/shiny.
∥The comment history can be found at https://docs.google.com/document/d/
1CFgRo93mRgifpuFOuQE3vNBeMW-H7ps9eD- -vxH-6CQ/edit?usp=sharing.
Prepared using sagej.cls
30 Journal Title XX(X)
Descriptive statistics
In the following sections, we will describe the characteristics of the
analysis teams that participated in the study and the analytic approaches
they adopted. An important aspect that emerges from the descriptive
analysis is the large variation in analytic strategies.
Characteristics of analysis teams Eighty-four teams initially signed up
to participate in the study, comprising 211 analysts. Thirty-eight of the
signed-up teams dropped out during the analysis phase.
Forty-six teams submitted their analyses by the established deadline.
Only analyses from which it was possible to extract an effect size were
included in the meta-analysis. Of the analyses submitted by the 46 teams,
the initiating authors identified 33 teams with submissions meeting the
criteria for inclusion in the meta-analytic model. Reasons for exclusion
were: use of Generalized Additive Models (4 teams) which do not lend
themselves easily to the meta-analytic methods employed in this study, use
of machine learning techniques (3 teams), use of typicality as the outcome
variable/response (3 teams), or use of other methods that returned statistics
that could not be included in the meta-analytic model. Note that due to the
unforeseen variability across teams, the latter exclusion criteria were not
preregistered and were applied after having seen all analytic strategies.
In what follows, we describe the characteristics of those teams
whose analyses were included in the meta-analytic model. A
complete summary of all the analyses from the 46 submitting
teams is available in the supplementary materials at https:
//many-speech-analyses.github.io/many_analyses/
RR_manuscript/supplementary_materials.pdf.
The included analyses were provided by 33 teams, comprising 120
analysts, with a median of 3.0 individuals per team. Upon sign-up, we
collected background information from each analyst through the intake
form, which was administered during Phase 1, prior to the data being
released to the teams. Analysts had a median of 5.4 years of experience
after completing their PhD, ranging from -3.8 years, i.e. PhD students
(or less experienced) to 12.4 years, suggesting that, on average, analysts
were experienced researchers. The analysts’ prior belief in the effect under
investigation, on a scale from 0 to 100, ranged from 46.4 to 92.0 with a
median of 70.0. We take this to suggest that, overall, analysts had a rather
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 31
high positive prior belief in the investigated relationship between acoustics
and word combination typicality.
At the end of Phase 2 (primary data analysis), the teams had submitted
a grand total of 115 individual models (including 192 critical model
coefficients, given that some models returned more than one critical
coefficient) to answer the research question, with a median of 3 models
per team. Table 1 provides a summary of the contributing teams and their
analyses.
Acoustic analysis The analytic teams differed in their approach to the
acoustic analysis of the speech signal, including choices related to specific
acoustic measures, the temporal window used, and how the measures were
transformed. Thirty-seven percent of the models used f0 as the outcome
variable, 33% used a measure of duration, 13% used vowel formants, 15%
intensity, and 3% other measures.
Forty-five percent of models used acoustic measures taken at the level
of the segment (e.g. comparing the acoustic profile of a vowel), 45% from
the word level (e.g. comparing the acoustic profile of Banane ‘banana’),
3% at the level of the phrase (e.g. the noun phrase including determiner
and adjective, e.g. “the green banana”), 3% from the whole sentence,
and 3% used a different time window. Based on a coarse coding of how
acoustic measures were operationalized, we find a total of 55 different
measurement specifications. For example, if we consider those analyses
that target f0, we find that it is operationalized in many different ways
including the minimum, the maximum, the mean, the median, as a range
in an interval or a ratio between two intervals. The measurement is
sometimes taken from the interval of a vowel in the article, the adjective
or noun; it is sometimes taken from the word interval of the article,
adjective or noun; or it is taken from either the noun phrase interval or
the entire sentence. Some of these measures were normalized relative to
other elements in the sentence or relative to the speaker.
Statistical analysis The large decision space related to how the acoustic
signal was measured is further expanded by the choices in the statistical
analysis, including the chosen inferential framework, the type of model,
and the model specification, including choice of predictors, interactions
and group-level effects.
The mean of the number of different predictors included in teams’
models was 2 (defined as variables or columns in the data table).
Prepared using sagej.cls
32 Journal Title XX(X)
This means that, in addition to the critical predictor (typicality of the
adjective noun combinations), models had on average one additional
predictor (range = 1 - 5). Possible information that was used as predictors
included the information structure of the sentence, trial number, semantic
dimensions of the referent, part of speech, and speaker gender.
The data given to the teams allowed them to operationalize the predictor
of interest, word typicality, in different ways. Among the possible
operationalizations, 69% of models contained typicality as a categorical
variable (e.g. atypical vs. typical), 28% used a continuous typicality scale
from 0-100 by calculating the mean typicality for each word combination
as obtained from the norming study, while 3% of the models used
the median typicality rating. Note that the design of the experiment
alongside its description indicated that the experiment was designed to
categorically operationalize typicality. This possibly explains the analysts’
strong preference.
The majority of models were run within a frequentist framework (84%).
Sixteen percent were run within a Bayesian framework. While teams
almost exclusively used linear models to analyze their data (98%), teams
differed drastically in how they accounted for dependencies within the
data.
The data contains several dependencies between data points, with
multiple data points coming from the same subject and with multiple
data points being associated with the same adjective or noun. An
appropriate way to account for this non-independence is by using models
that include so-called random or group level effects (e.g., Gelman and
Hill 2006; Schielzeth and Forstmeier 2009), variably known as mixed-
effects, hierarchical, multi-level, or nested models (among other names).
Nine percent of the linear models specified no random effects at all
(without pooling their data), effectively ignoring these non-independences
(Hurlbert 1984). Sixty-two percent specified random intercepts only, and
29% specified both random intercepts and random slopes to account
for the non-independence. On average, teams that specified random
effects included 2.5 random terms in their models. Based on statistical
framework, type of model, distribution family, fixed terms, and not
including random effects, there were a total of 52 different model
specifications.
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 33
When considering both acoustic and statistical analyses, we have found
a grand total of 119 different analytic pipelines. In other words, each
individual analysis submitted was unique.
Our quantitative assessment did not include other degrees of freedom,
all of which are additional sources of variation: Teams differed with
regard to how the acoustic signal was segmented ranging from fully
automated forced-alignment with minimal manual correction to complete
manual alignment performed by the analysts; teams differed in whether
the statistical analysis was based on a subset of the data or the
whole data set; and they differed whether and if so how measurements
were excluded based on both qualitative (i.e. whether specific speech
production instances were excluded or not) and quantitative grounds
(i.e. whether data were trimmed or not).
The question arises whether these unique analysis pipelines led to
different conclusions. Thirteen teams out of the thirty-three (39.4%)
reported to have found at least one statistically reliable effect (based on the
inferential criteria they specified). Of the 192 critical model coefficients,
45 were claimed to show a statistically reliable effect (23.4%).
Review ratings Teams reviewed each others’ acoustic and statistical
analyses. The mean rating of the acoustic analyses, on a scale from 0 to
100, is 71.5 (SD = 13.5). The mean rating of the statistical analysis is
69.4 (SD = 15.9). For reference, as mentioned in the Methods section, a
score of 75 was defined as “an imperfect analysis but the needed changes
are unlikely to dramatically alter the final interpretation”, indicating that
on average reviewers judged the provided analyses to be appropriate,
although “imperfect”.
Meta-analytic estimation
This section deals with the meta-analytic analysis of the results submitted
by the teams. As discussed above, the analyses of only 30 teams out of all
the submitted analysis were included in the meta-analytic model discussed
here. First, we report on the between-team variability estimate (i.e. the
meta-analytic group-level standard deviation σαt), which is the focus of
this study, followed by the meta-analytic estimate (i.e. the intercept of the
meta-analytic model, in other words, the estimated effect of typicality on
the acoustic production of adjective-noun combinations).
Prepared using sagej.cls
34 Journal Title XX(X)
Table 1. Descriptive statistics of teams, acoustic analyses, and statistical analyses included
in the meta-analysis. The data set included analyses from 33 teams and 120 analysts.
Team characteristics Range Median
Team size 1.0 – 12.0 3.0
Years after PhD -3.8 – 12.4 5.4
Prior belief 46.4 – 92.0 70.0
Acoustic analysis peer rating 41.2 – 88.3 73.8
Statistical analysis peer rating 33.0 – 93.3 73.2
Overall peer rating 39.0 – 88.7 70.8
Acoustic analyses n %
Outcome F0 44 37
Duration 39 33
Intensity 18 15
Formants 15 13
Other 3 3
Temporal window Segment 54 46
Word 53 45
Sentence 4 3
Phrase 3 3
Other 4 3
Typicality operationalization Categorical 82 69
Continuous (mean) 33 28
Continuous (median) 3 3
Statistical analyses n %
Framework Frequentist 100 84
Bayesian 19 16
Model Linear model 117 98
GAM 1 1
Other 1 1
Range Median
N Models 1 – 16 3
Predictors 1 – 5 2
Random terms 1 – 10 2
Intercept 1 – 10 2
Slope 0 – 4 0
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 35
Between-team variability The primary aim of this analysis is to assess
the degree of between-team variability. As a measure of between-
team variability, we chose to use the meta-analytic group-level standard
deviation (σαt).
According to the preregistered meta-analytic model, the group-level
standard deviation for teams is between 0.03 and 0.07 standard units at
95% credibility. In other words, the estimated range of variation across
teams lies somewhere between ±0.06 (0.03 * 1.96) and ±0.13 (0.07 *
1.96) standard units with 95% credibility.
Non-preregistered. However, in our preregistration we did not take
into account that teams might submit multiple analyses/models which,
if unaccounted for, violates the independence assumption. Teams were
explicitly instructed to only submit one effect size without enforcing
it. As a result, some teams followed the instruction and submitted
only one model while others submitted multiple models. To account
for this added layer of dependency, we have run a model with team
and model ID nested within team as group-level effects ((1|team) +
(1|team:model id)), which allows us to estimate both the between-
team variation and the between-analysis variation. This analysis was not
preregistered and should thus be interpreted with caution.**
The nested model yields a posterior 95% CrI for between-team
variability of 0 to 0.04 standard units (β= 0.02, SD = 0.01), corresponding
to a mean deviation range of about ±0to ±0.1standard units and 95%
probability. The posterior 95% CrI for between-analysis variability (nested
within teams) is 0.11 to 0.14 standard units (β= 0.132, SD = 0.01). For
the sake of illustration, these would correspond to an estimate of between-
model variability in segment and word durations that ranges between 7 to
14 ms for segments and between 7 and 33 ms for words at 95% credibility.
We interpret these values in more details in the Discussion section.
Taken together, the models suggest that the variability of reported effects
between any model (within team or across) is substantially larger than the
variability across individual teams. We return to this important observation
later.
∗∗Note that before fitting this model, we fitted a separate one in which model ID was the only (non-nested)
group-level effect. The estimated group-level effect of model ID is identical to that of the nested model, so we
will not discuss it further.
Prepared using sagej.cls
36 Journal Title XX(X)
Meta-analytic intercept After assessing the variation between teams and
analyses, we now turn to the meta-analytic estimate of the effect of
typicality on the acoustic realization of sentences with adjective-noun
combinations. The meta-analytic model estimates the range of probable
values of the standardized effect size to be between -0.026 and 0.016
standard units (95% CrI, mean = -0.005). In other words, our best guess is
that speakers might not encode typicality in the acoustic signal (e.g. by
duration, f0, etc,) or, if they do, they do so by a maximum of ±0.03
standard units.
Non-preregistered. As mentioned in the previous section, we have
run an additional model, using team and model ID nested within team
as group-level effects. In this non-preregistered model, the meta-analytic
intercept estimate is between -0.016 and 0.03 standard units (95% CrI,
β= 0.008). This suggests that the acoustic measures of typical word
combinations are 0.02 standard units lower to 0.03 standard units higher
than the measures of atypical word combinations, at 95% confidence. This
result is qualitatively similar to the results obtained in the preregistered
model.
The meta-analytic intercept conflates estimates from a variety of
responses taken from very different places in the utterance (nouns,
adjectives, determiners, entire phrases or sentences, etc). This means that
some of the effects on a particular response as observed in a specific
location within the utterance might naturally be positive, while other
negative, resulting in a meta-analytic intercept of about zero. We want
to stress, however, that our focus is not on the meta-analytic intercept per
se, but on the fact that a seemingly straightforward research question led
to so many possible outcomes. More on this in the Discussion section.
Figure 3 illustrates the individual intercepts for critical typicality
coefficients across models and teams, sorted in ascending order
based on their mean. Given the nature and wide variety of acoustic
operationalizations, there is no natural interpretation of the scale, so we
cannot interpret the direction of estimates. When looking at the raw
estimates and their variance (grey triangles and lines), it is striking how
much estimates differed. Estimates ranged from -0.7 to 1.01 standard
units.
While the majority of model estimates and their uncertainty after
shrinkage yields inconclusive results (i.e. are compatible with a point null
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 37
Figure 3. Standardized effect sizes across all critical coefficients provided by the teams. Raw
estimates are displayed in grey. Estimates after shrinkage as provided by the meta-analytic
model are displayed in black.
hypothesis), there are 27 model estimates for which the 95% credible
interval does not contain zero (14%).
Analytic and researcher-related predictors
After assessing the variability across teams and models, we now turn to
estimating the impact of a series of predictors on the reported standardized
effects. There is a large amount of variation between and within teams,
raising the question as to whether we can explain some of this variation or
whether it is purely idiosyncratic (Breznau et al. 2021).
We have run a model as described in Section Analytic and researcher-
related predictors affecting effect sizes above. Figure 4, panel C, displays
the coefficients for all predictors alongside their 80% and 95% credible
intervals. The model suggests that most team-specific predictors yield very
small deviations from the meta-analytic estimate and their 95% credible
intervals include zero, leaving us highly uncertain about their direction.
Neither analysts’ prior beliefs in the phenomenon (β= -0.01, 95% CrI
= [-0.04, 0.01]), nor their seniority in terms of years after completing
their PhD (β= 0.01, 95% CrI = [-0.02, 0.04]) seem to affect model
estimates. Similarly, the evaluation of the quality of the analysis from their
Prepared using sagej.cls
38 Journal Title XX(X)
peers yielded a rather small effect magnitude, again characterized by large
uncertainty (β= 0.02, 95% CrI = [-0.01, 0.05]). Interestingly, the model
uniqueness, i.e. how unique the choice and combination of predictors are,
affects the analysts’ estimate, with more unique models producing higher
positive estimates (β= 0.04, 95% CrI = [0.02, 0.07]).
Looking at the most important choices during measurement, both the
acoustic parameter under investigation (e.g. f0 or duration) and the choice
of measurement window affected the results. Panels A and B of Figure
4 display the posterior estimates for the measurement outcome (i.e. what
acoustic dimension was measured, panel A) and measurement window
(i.e. what is the unit over which the outcome was measured, panel B).
If, on one hand, an acoustic dimension related to f0 was measured,
estimates are lower than the meta-analytic estimate. If, on the other
hand, duration was measured, estimates are higher than the meta-analytic
estimate. Similarly, if acoustic parameters were measured across the entire
sentence, estimates are lower than the meta-analytic estimate. In other
words, depending on the choice of measurement and the measurement
window, analysts might have arrived at different conclusions about how
and if typicality is expressed acoustically.
It is due of the latter patterns that we need to interpret the results of the
model with great caution. Since there are combinations of analytic choices
that appear to systematically result in lower or higher estimates and the
fact that predictors are not fully crossed (i.e. we do not have the same
amount of data for all combinations of e.g. outcome and measurement
window), the estimates for certain predictors might be biased if predictors
are collinear. This bias might be amplified by the fact that the scale
has no natural way of being interpreted across all teams with different
measurements cancelling each other out. We checked correlations between
predictors and while predictors do not seem to be highly collinear, the
estimates might still be biased.
Discussion
Summary
We gave 46 analyst teams the same speech data set to answer the same
research question: Do speakers acoustically modify utterances to signal
atypical word combinations? In order to answer this question, teams had
to interpret the research question by operationalizing constructs within
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 39
Figure 4. The effects of analytic and researcher-related predictors on the reported
standardized effect sizes. (A) Posterior samples for the four most frequent outcome variables;
(B) Posterior samples for the four most frequent temporal windows: Black points indicate
medians; shaded areas represents 50/80/95% highest density intervals. (C) Mean posterior
samples (white circles) and 80/95% credible intervals for all predictors grouped into predictors
related to (1) temporal window, (2) outcome variable, and (3) team/analysis.
multidimensional signals, operationalizing and choosing appropriate
model predictors, and constructing appropriate statistical models. This
complex process has led to a vast “garden of forking paths”, i.e. to a
wide range of combinations of possible analytic decisions. The submitted
analyses exhibited at least 52 unique ways of operationalizing the acoustic
signal alongside 55 unique ways of constructing the statistical model.
By multiplying the numbers of acoustic and model specifications, there
are in principle 2860 possible unique combinations. Note that this is
a conservative estimate of the number of possible analytic choices for
our research question, ignoring many other degrees of freedom like
e.g. acoustic parameter extraction, outlier treatment, and transformations,
all of which might have an impact on the final results (Breznau et al. 2021).
Prepared using sagej.cls
40 Journal Title XX(X)
Different analysis paths led to different categorical conclusions with
39.4% of teams reported to have found at least one statistically
reliable effect. To gain a better understanding of whether the observed
quantitative variability can result in theoretically different claims, we
will contextualize them in actual acoustic measures. We calculated the
standard deviation of a selection of acoustic measurements, as submitted
by the analysis teams: duration, f0 and intensity, taken from different
time windows. These standard deviations can be considered as a coarse
indication of the variability in the obtained acoustic measures. We can
now use these values to interpret the meta-analytic estimates, which are in
standardized units, by transforming the standardized units to measures of
duration, f0 and intensity.††
For example, for those analyses that investigated the duration of
vowels (e.g. the duration of the stressed vowel in Ban ´
ane), the reported
duration measures exhibit standard deviations that range from 33.4
to 51.4 ms. These standard deviations allow us to convert the meta-
analytic estimates into milliseconds by multiplying those values with
the the standard unit values of the meta-analytic estimates. The reported
effect estimates from teams varied between -0.7 and 1.01 standard
units, which corresponds to estimated segment duration differences (for
atypical vs typical combinations) ranging from -23.34 to 33.84 ms. A
more conservative approach is to convert the meta-analytic estimates of
between-model variation, thus obtaining an estimate of between-model
variability in milliseconds that ranges between 7.2 and 14.1 ms at 95%
credibility. The calculation is thus: the minimum standard deviation of
duration multiplied by the lower limit of the 95% CrI of the between-
models variability estimate, times 1.96 to obtain a 95% CrI: 33.4 * 0.11 *
1.96 = 7.2 ms; the maximum standard deviation of duration multiplied by
the upper limit of the 95% CrI of the between-models variability estimate,
times 1.96: 51.4 * 0.14 * 1.96 = 14.1 ms.
While this might not immediately strike one as highly variable, it crosses
several theoretically relevant thresholds for perception and articulation:
for example, the widely studied phenomenon of incomplete neutralization
involves vowel duration effects ranging from 7 to 15 ms (Nicenboim
et al. 2018). This particular phenomenon has sparked long-lasting
††Note that these categories necessarily refer to a variegated set of measures, for example the domain “word”
includes words that differed along several dimensions, including their length and their metrical structure.
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 41
Table 2. Estimated 95 percent CrIs of deviation from the meta-analytic effect in acoustic
measures, based on the lower and upper limits of the between-model variation.
Outcome Temporal window Lower Upper Unit
Duration Segment 7-10.8 9.3-14.4 ms
Duration Word 6.9-25.3 9.1-33.4 ms
f0 Segment 0.9-9.4 1.2-12.4 hz
f0 Word 0.8-9.9 1.1-13.2 hz
Intensity Segment 0.7-1.5 0.9-2 dB
Intensity Word 0.7-0.9 1-1.2 dB
methodological and theoretical debates about the very nature of linguistic
representations (Port and Leary 2005) and has been replicated several
times in both production and perception. Vowel duration differences
within this range have also been reported across phenomena associated
with segmental contrasts (Coretta 2019), reduction phenomena (Nowak
2006), and biomechanical reflexes of prominence (M¨
ucke and Grice
2014). Thus, variation between different analyst teams of 7.2 to 14.1 ms
in one or the other direction can be theoretically relevant and might lead
to opposing theoretical conclusions.
While one might find it obvious that measuring different parts of
the speech signal can lead to different results, the fact that analysts
(and reviewer alike) considered all these data analytic pipelines valid
ways of answering the same research question points to a lack of
theoretical consensus on what parts of the speech signal correspond to
what types of communicative functions. Importantly, even if analysts
choose to measure more or less the same acoustic property within the same
measurement window, they arrive at different estimates: For example, six
teams measured f0 in the adjective and predicted f0 based on typicality
as a categorical predictor. Their standardized effect estimates ranged
from -0.11 to 0.38 standard deviations. While these teams in principle
measured the same thing, they differed in analytical details of how f0 was
operationalized (i.e. mean, mininum, maximum, or range) and how their
statistical model was constructed (i.e. the number of predictors ranged
from 1-3 and the number of random effect terms ranged from 1-4). As
shown by Breznau et al. (2021), even seemingly inconsequential analytical
choices can affect conclusions in non-trivial ways.
Prepared using sagej.cls
42 Journal Title XX(X)
The observed variation does not seem to be systematic. For example,
variation between teams was not predicted by the analysts’ prior
expectations about the phenomenon. In fact, teams on average rated
the plausibility of the effect as rather high before receiving access to
the data. The observed variation was neither predicted by the analysts’
experience in the field nor by the perceived quality of the analysis as
judged by other teams. Analyses received overall high peer-ratings for
both the acoustic and the statistical analysis, suggesting that reviewers
were generally satisfied with the other teams’ approaches.
These findings are very much in line with previous crowd-sourced
projects that suggest variation between teams is neither driven by
perceived quality of the analysis nor by analysts’ biases or experience
(e.g., Silberzahn et al. 2018; Breznau et al. 2021). Following Breznau
et al. (2021, p. 9), we are bound to conclude that “[. . . ] idiosyncratic
uncertainty is a fundamental feature of the scientific process that is
not easily explained by typically observed researcher characteristics or
analytic decisions”. Idiosyncratic variation across researchers might be
a fact of life which we have to acknowledge and integrate into how we
evaluate and present evidence.
While properties of the teams did not seem to systematically affect the
results, teams’ estimates seem to highly depend on certain measurement
choices. Human speech entails complex multidimensional signals.
Researchers need to make choices about what to measure, how to measure
it and which temporal unit to measure it in. Some of these choices seem to
result in estimates in one direction while others seem to result in estimates
into another. For example, measurements related to f0 tended to result in
lower estimates while measurements related to duration tended to yield
higher estimates.
The asymmetry observed in the effect direction of different
measurements can have several causes. First, there could be a true
underlying relationship between typicality and the speech signal that
manifests itself in some measures but not others and/or manifests itself
negatively in one acoustic measure but positively in another.
Secondly and orthogonal to a possible true relationship, certain
measurement choices might be associated with stronger expectations
relative to the research question, which might lead to stronger
researcher biases. Many analysts targeted measures related to f0, likely
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 43
because similar functional relationships like information structure and
predictability can be expressed by f0 (e.g. Grice et al. 2017; Turnbull
2017). Moreover, prior work has actually suggested a relationship
between typicality and f0 (e.g. Dimitrova et al. 2008, 2009). Participating
analysts could have been aware of those findings, which might have,
subconsciously or otherwise, nudged their choices into one particular
direction.
Regardless of the cause of these systematic effects, we have to conclude
that depending on the choice of how the speech signal is operationalized,
researchers might find evidence for or against a theoretically relevant
prediction. This conclusion is further supported by the fact that between-
team variability was lower than between-model variability. This is an
important observation when put into context of the fact that most teams
submitted many different models. Teams submitted up to 16 different
models to test for a possible relationship between typicality and the
speech signal. The complexity of the speech signal lends itself to multiple
approaches, but this plurality of hypothesis tests invites bias and can
dramatically increase the rate of falsely claiming the presence of an effect
(Roettger 2019; Simmons et al. 2011). We of course are not arguing that
exploratory analyses should not be employed. Rather, we simply want
to point out that if the theoretical underpinnings of the field were much
clearer, different teams would have converged towards a limited set of
analyses despite of a less specific research question.
In relation to this aspect, one team coordinator decided to drop out of the
project because of its approach being too top-down. The coordinator also
expressed a preference to be able to explore and run a variety of descriptive
analyses followed up with inferential statistics. We find that this attitude
speaks to the main objective of the current study: investigate researchers’
degrees of freedom in the speech sciences. Based on our personal
experience with research in the field, it is common practice to test many
different types of models, using many different types of measurements,
to answer one research hypothesis. While this is a valid way to explore
data and generate new hypotheses, it is not suitable for hypothesis testing.
When operating within the frequentist inferential framework, testing the
same hypothesis with different dependent variables is known to increase
the false-positive (Type-I error) rate. The well-established solution to
this problem is to apply a correction for family-wise error (i.e., alpha
Prepared using sagej.cls
44 Journal Title XX(X)
correction). However, less clear-cut degrees of freedom such as observed
in the present study can not be corrected for in a straightforward way.
If uncorrected for, these degrees of freedom can nevertheless drastically
inflate the false positive rate, even if different choices are highly correlated
(Roettger 2019). Another possible outcome of analytic flexibility as seen
in this study is selective reporting of those tests that yield a desirable
outcome (Kerr 1998; John et al. 2012; Simmons et al. 2011), while
null results remain unreported (Sterling 1959; Rosenthal 1979). Fields
such as the speech sciences that make theoretical advances based on
multidimensional data should be aware of this flexibility and calibrate
their confidence in empirical claims accordingly.
Looking at our results, one might argue (and this interpretation has
been articulated by several teams during the collaborative write-up)
that our sample of speech scientists actually converged on a qualitative
conclusion, i.e. there is no evidence for a relationship. However, if
there truly was no underlying relationship, our results would suggest a
concerning false positive rate with 39.4% of teams reported to have found
at least one statistically reliable effect. This rate is substantially higher
than the conventionally accepted 5% false positive rate in for example
null hypothesis significance testing frameworks. If, on the other hand,
there actually was an underlying relationship, our results would suggest
a concerning false negative rate of -38.4%, with the majority of teams not
detecting the effect. If the latter was true, the fact that the majority of teams
arrived at a null result might also simply be a consequence of the sample
size in the data set being too small to reliably detect an effect (which is
unknown to us). Thus, we do not think that our study provides convincing
evidence that speech researchers converged on the same qualitative answer
to a broad research question.
Lessons for the methodological reform movement
The current results point to important barriers to the successful
accumulation of knowledge. The replication crisis has brought attention
to scientific practices that lead to unreliable and biased claims in the
literature (Vazire 2017; Fidler and Wilcox 2018). One of the suggested
paths forward is for researchers to directly replicate previous studies more
often (Open Science Collaboration 2015; Camerer et al. 2018). While we
agree with the importance of direct replications, our study (and similar
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 45
crowd-sourced analyses before us) suggest that replicating more is simply
not enough. There is only limited value in learning that a particular
procedure is replicable if the idiosyncratic nature of the procedure itself
might not yield a representative result relative to all possible procedures
that could have been applied to the research question. Thus beyond
a mere replication crisis, quantitative disciplines are going through an
“inference crisis” (Rotello et al. 2015; Starns et al. 2019). As shown
by the peer-ratings of the analyses reported in this study, well-trained
and experienced speech researchers not only applied completely different
approaches to the same research question, but also considered most of
these alternative approaches acceptable. Being aware of this idiosyncratic
variation between analysts should lead to more nuanced claims and a
certain level of epistemic humility (see Campbell 1975, for an overview
of the concept).
A desired outcome of knowing that different but reasonable
measurement choices or statistical approaches might lead to different
interpretations of research data is to calibrate our (un)certainty in
the strength of the collected evidence and, in turn, communicate that
(un)certainty appropriately. The fact that the choice of measurement,
measurement window, and predictor choice affect the answer to
the research question further suggests that research assumptions and
hypotheses should be formulated in much greater detail, particularly
so in regards to how measurement systems (here, the acoustic signal)
and underlying conceptual constructs (here, the phonetic expression of
typicality) relate to each other.
We should ideally specify the link between conceptual construct and
quantitative system—the “derivation chain” (Dubin 1970; Meehl 1990)—
prior to data collection and analysis, including defining constructs and
their relationship within the quantitative system, specifying auxiliary
assumptions and boundary conditions, and defining target measurements,
statistical expectations and possible (and impossible) effect magnitudes.
Without well-defined derivation chains, we “are not even wrong” (Scheel
2022) because falsified expectations cannot tell us much about the
conceptual constructs they are based on when the relationship between the
two is underspecified. Some of the analysis teams explicitly recognized
and acknowledged the need to formulate a more precise version of the
research question by preregistering their planned data analysis pipeline.
Prepared using sagej.cls
46 Journal Title XX(X)
Preregistration, i.e. a time-stamped document in which researchers specify
how they plan to collect their data and/or how they plan to conduct their
confirmatory analysis, is can be a useful tool to safeguard researchers
against the urge to explore many different analytical paths before choosing
the one that, in hindsight, seems most justified. However, as long as
the theoretical landscape does not allow for more precise hypotheses,
the value of preregistration is limited and we need to find ways to
appropriately calibrate the confidence in our claims.
Through sharing of materials, data and statistical protocols, we can
make our idiosyncratic choices transparent to others (Munaf`
o et al. 2017;
Vazire 2017). Sharing further enables the evaluation and verification
of underlying claims and allows for the evaluation of empirical,
computational and statistical reproducibility (LeBel et al. 2018). It allows
for alternative analyses to establish analytic robustness (Steegen et al.
2016) and strengthens attempts to synthesize evidence via meta-analyses
(e.g., Nicenboim et al. 2018). Given that minor procedural changes can
sometimes drastically affect the final interpretation of the results (Breznau
et al. 2021), we should ideally share a detailed documentation of the
data collection procedure, the measurement choices, the data extraction,
and statistical analyses. Within fields that deal with speech data, open
source software that permits the extraction of acoustic parameters via
reproducible scripts can help other researchers to trace back seemingly
inconsequential choices during the measurement process (e.g., Praat:
Boersma and Weenink 2021; EMU: Winkelmann et al. 2017; the Montreal
Forced Aligner: McAuliffe et al. 2017).
Making analytic pathways completely re-traceable and preregistering
them in advance does not change the fact that different analysts might
apply different analytic approaches (preregistered or not). Crowd-sourced
projects such as the current one can shed light on the range of degrees of
freedom during analysis and could possibly help produce a consensual
estimated effect if the research hypothesis is specific enough. Crowd-
sourcing analyses is obviously not always feasible in terms of required
resources and time, but could be a consideration for claims that have large
epistemological or practical consequences.
If we develop a good understanding of relevant analytic degrees of
freedom, we could apply all conceivable analytic strategies and compare
the results across all combinations of these choices. Such an analysis can
Prepared using sagej.cls
Multidimensional signals and analytic flexibility 47
provide insight into how much the conclusions change due to analytic
choices as well as which choices have neglible or large impact on the
result. This approach is called a “multiverse analysis” (e.g, Steegen et al.
2016; Harder 2020) and has recently gained popularity across disciplines.
Finally, neither crowd-sourcing nor multiverse analyses will guarantee
that all relevant pathways are explored. Crowd-sourcing is limited by
the sampled analysts and their biases. Multiverse analyses are limited
even further by the group of researchers who define possible analytic
pathways. Eventually, a mature scientific discipline needs to develop
a set of detailed quantitative hypotheses of how conceptual constructs
manifest themselves in the measured system, i.e. in the present case
how communicative pressures of certain functions are expressed in the
acoustic signal. Possible tools to strengthen theoretical development relate
to mathematically formalizing verbal expectations or using computational
models (e.g., van Rooij and Blokpoel 2020; Guest and Martin 2021;
Scheel et al. 2021; Devezer et al. 2021). Although conceptually promising,
in their current state, such formalized models typically work in spaces
that are much lower in dimensionality than the complex systems in which
we measure. Thus, future research should spend resources on attempting
to quantitatively relate the abstract theoretical space to the complex
measurement space.
Caveats
Our study has several limitations that need to be considered when
interpreting our results.
First, while the total number of analyses is larger than most earlier
crowd-sourcing projects, it is likely to be too small to reliably estimate
the impact of certain predictors. Since predictors’ values were not
systematically distributed across teams, our estimates are characterized
by large uncertainty.
Second, uncertainty is further inflated by the fact that the research
question presented to the teams was vague, despite being of a kind
normally found in the speech science literature: Do speakers acoustically
modify utterances to signal atypical word combinations? Interpreting
the research question/hypothesis differently in terms of its statistical
consequences has recently been shown to explain some variation
between analysis teams in many-analyses projects (Auspurg and Br¨
uderl
Prepared using sagej.cls
48 Journal Title XX(X)
2021). The analysts might also have tried to answer different specific
manifestations of the research question that was given to them, leading
to different choices down the line (e.g. Do speakers modify f0 in
atypical adjectives?). It could be argued that some teams would have
not specified such a vague research question to begin with which would
have reduced the possible degrees of freedom substantially. However,
this very underspecification of research hypotheses in the field of speech
science (and beyond, see Scheel 2022) is very common. For example,
researchers seem to have not yet agreed on how to acoustically measure
cross-linguistically common phenomena such as word stress (e.g. Gordon
and Roettger 2017). Research on acoustic markers of clinical conditions
such as depression and schizophrenia are often difficult to compare due to
the wide variety of different acoustic measures employed (e.g. Cummins
et al. 2015; Parola et al. 2022).
Third, the design of this crowd-sourced study has artificially inflated
the variability between teams by encouraging anti-coordination strategies.
Teams knew that there will other analyst teams and therefore might have
chosen a “less canonical” analysis. Since analysts were guaranteed to
become co-authors of a (in principle) guaranteed publication, such an anti-
coordination approach was not explicitly disincentivized.
Forth, our sample is an opportunity sample. We have advertised the
project through online platforms which might have led to the exclusion of
certain potential researcher groups. The sampling strategy also might have
given access to researchers who were less experienced in particular aspects
of the data analysis, possibly introducing uncommon analytic choices or
poor quality analyses. However, to our knowledge, neither the peer review
among teams nor the information gathered through our questionnaires
indicated any obvious cases of what one might consider incompetent
analyses.
In light of both the observed large variability between teams, and
possible sources of bias, a field can benefit from explicit positionality
statements (e.g., Jafar 2018;