ArticlePDF Available

Magnitude-based inference: What is it? How does it work and is it appropriate?

Authors:

Abstract and Figures

This article provides a short explanation of Magnitude-Based Inference. It also provides a link to a Youtube video that explains Magnitude-Based Inference in more detail and to a second video where I discuss the critique on Magnitude-Based Inference. I also suggest some alternative approaches.
Content may be subject to copyright.
Magnitude-based inference: What is it? How does it work and is it appropriate?
Magnitude-based inference: What is it? How does
it work and is it appropriate?
B. Van Hooren 1
1NUTRIM School of Nutrition and Translational Research in Metabolism, Maastricht University Medical Centre+, Department of Nutrition and Movement Sciences, Maastricht,
The Netherlands
Magnitude-based inference |Video |Tutorial
Headline
Research in the field of sports science is frequently per-
formed on a relatively small number of individuals. We
are usually however interested in knowing whether the effect
found in our sample of individuals also applies to a larger
group: the population from which the sample is drawn. For
this purpose, we use statistical inferential methods. There
are several statistical inferential methods available. The most
widely used method is arguably null-hypothesis significance
testing. This method has been widely criticized since its in-
troduction, most prominently because statistically significant
results are not necessarily clinically relevant and statistically
non-significant results can still be clinically relevant (Figure
1). (1, 2)
Magnitude-Based Inference. Motivated by the limitations of
null-hypothesis significance testing, Batterham and Hopkins
(1) developed a new statistical inferential method in 2006 enti-
tled “Magnitude-Based Inference”. In this method, confidence
intervals are interpreted in relation to a smallest worthwhile
change (Figure 1). The method has seen a large uptake in the
sports science community and is also increasingly used in other
research fields. Despite this large uptake, not all researchers
and practitioners fully understand how the method works. Un-
derstanding how Magnitude-Based Inference works is however
important as it helps researchers and practitioners to correctly
interpret the results of studies that have used this method. In
a new video, I therefore explain what Magnitude-Based Infer-
ence is and how it works. Click here for the link to the video.
Criticism
Several researchers have criticized Magnitude-Based Inference,
in particular for interpreting a frequentist confidence interval
as a Bayesian credible interval and for having high rates of
type I errors (false positive were you conclude there is a sub-
stantial effect while there is no substantial effect).(3-6) These
researchers therefore advised to use other statistical inferential
methods such as a full Bayesian analysis (which has recently
been performed with Magnitude-Based Inference (7)) or equiv-
alence testing (see for example (8)). Batterham and Hopkins
have responded to these criticisms. (9-13) They justify their
Bayesian inferences with the confidence interval by claiming to
use Bayesian methods, but with a non-informative prior dis-
tribution which results in the Bayesian credible interval to be
equivalent to the frequentist confidence interval, when several
other assumptions are met.(5) Further, they re-analysed the
inferential error rates using different definitions of the errors
(12) as used by the other researchers (4) and argued that the
definitions of errors used in a recent critique paper (6) are also
not entirely appropriate.(13)
The debate around these statistical issues may be diffi-
cult to follow for a sport scientist who does not have a
strong background in statistics. Understanding this de-
bate is however important as it allows researchers to de-
cide on whether they should use Magnitude-Based Infer-
ence or other statistical inferential methods for their stud-
ies. Therefore, in a second video I discuss some of the crit-
icisms on Magnitude-Based Inference and the responses by
Batterham and Hopkins in a (hopefully) understandable way.
Click here for the link to the second video.
Other approaches
Researchers that like the idea of Magnitude-Based Inference,
but who do not want to use it based on the criticisms can use
several other methods which are roughly similar. For exam-
ple, Mengersen, Drovandi, Robert, Pyne and Gore (7) recently
performed a full Bayesian analysis of Magnitude-Based Infer-
ence with a flat prior distribution. Other approaches that
are roughly similar to Magnitude-Based Inference include us-
ing regions of practical equivalence (ROPE) in a Bayesian ap-
proach or equivalent testing in a frequentists approach (8, 14).
Twitter: Follow Bas Van Hooren @BasVanHooren
Fig. 1. Magnitude-Based Inference. Decisions in Magnitude-Based
Inference are made based on confidence intervals (represented by the blue horizontal
lines) in relation to a smallest worthwhile change (represented by the dashed vertical
lines on each side of the trivial area). Consider the following example: a study has
investigated the effects of 4 weeks resistance training on back squat 1 repetition max-
imum performance. Any increase or decrease larger than 5 kg is considered relevant,
while all changes smaller than 5 kg are too small to be of practical relevance (i.e.,
trivial). In Magnitude-Based Inference, confidence intervals define the likely range of
the population value. If the study finds the effect illustrated by the first confidence
interval, the conclusion is therefore that the intervention is (very likely) effective as
the confidence interval is entirely in the beneficial area. For the second interval, the
confidence interval overlaps the trivial and beneficial areas. However, the overlap in
the beneficial area is larger and the intervention is therefore more likely to be beneficial
as trivial. The conclusion could therefore be to use the intervention because it might
be beneficial and in the worst case scenario the intervention will have a trivial effect.
In the 3rd and 4th interval, the overlap of the confidence interval into the trivial area
has increased, so the intervention could have a trivial effect, but it could also be ben-
eficial. If the training intervention would not require much time and money, a coach
could still decide to use the intervention. However, the results are not statistically
significant. Conversely, in the 5th interval the intervention has likely only a trivial
effect, but it is significant. These latter examples illustrate the mismatch between
practical relevance and statistical significance. Figure adapted from Batterham and
Hopkins (1).
sportperfsci.com 1 SPSR - 2018 |May |x|v1
Magnitude-based inference: What is it? How does it work and is it appropriate?
References
1. Batterham AM, Hopkins WG. Making meaningful in-
ferences about magnitudes. Int J Sports Physiol Perform
2006;1(1):50-57.
2. Stang A, Poole C, Kuss O. The ongoing tyranny of sta-
tistical significance testing in biomedical research. Eur J Epi-
demiol 2010;25(4):225-30.
3. Barker RJ, Schofield MR. Inference about magnitudes of
effects. Int J Sports Physiol Perform 2008;3(4):547-57.
4. Welsh AH, Knight EJ. ”Magnitude-based inference”: a
statistical review. Med Sci Sports Exerc 2015;47(4):874-84.
5. Butson M. Will the numbers really love you back: Re-
examining Magnitude-based Inference. 2017
6. Sainani KL. The Problem with ”Magnitude-Based Infer-
ence”. Med Sci Sports Exerc 2018
7. Mengersen KL, Drovandi CC, Robert CP, et al. Bayesian
Estimation of Small Effects in Exercise and Sports Science.
PLoS One 2016;11(4):e0147311.
8. Lakens D. Equivalence Tests: A Practical Primer for t
Tests, Correlations, and Meta-Analyses. Soc Psychol Personal
Sci 2017;8(4):355-62.
9. Hopkins WG, Batterham AM. An imaginary Bayesian
monster. Int J Sports Physiol Perform 2008;3(4):411-2.
10. Batterham AM, Hopkins WG. The case for magnitude-
based inference. Med Sci Sports Exerc 2015;47(4):885.
11. Hopkins WG, Batterham AM. Magnitude-based inference
under attack 2014 [Available from: sportsci.org accessed 28-2-
15.
12. Hopkins WG, Batterham AM. Error Rates, Decisive Out-
comes and Publication Bias with Several Inferential Methods.
Sports Med 2016;46(10):1563-73.
13. Hopkins WG, Batterham AM. The Vindication of
Magnitude-Based Inference (draft 2) 2018 [Available from:
sportsci.org/2018/mbivind.htm accessed 16-5- 2018.
14. Kruschke JK. Rejecting or Accepting Parameter Values
in Bayesian Estimation. Advances in Methods and Practices
in Psychological Science: 251524591877130, 2018.
Copyright: The articles published on Science Performance and Science
Reports are distributed under the terms of the Creative Commons Attribu-
tion 4.0 International License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if
changes were made. The Creative Commons Public Domain Dedication
waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the
data made available in this article, unless otherwise stated.
sportperfsci.com 2 SPSR - 2018 |May |x|v1
... The MBI method applied in this manuscript was proposed by Batterham and Hopkins [10] in response to the limitation of the null-hypothesis significant testing (i.e., p < 0.05 despite the magnitude of the effect is trivial). However, its application in sport science is not without critics [28]. First, the same mean effect can have different magnitudes due to differences in the standard deviation [28] (e.g., Polar H10: 100 ± 20 ms to 120 ± 20 ms is ES = 1.0; ...
... However, its application in sport science is not without critics [28]. First, the same mean effect can have different magnitudes due to differences in the standard deviation [28] (e.g., Polar H10: 100 ± 20 ms to 120 ± 20 ms is ES = 1.0; Polar Ignite: 100 ± 40 ms to 120 ± 40 ms is ES = 0.5). ...
... The random errors obtained from both devices through the training program (≈ 30 ms) could partially explain these differences when MBI is applied despite the large correlation (r = 0.714) and good reliability (ICC = 0.817) obtained. Second, establishing what is a worthwhile change can affect inferences [28]. In this case, the first two weeks of the training program were used to determine the SWC (mean ± 0.2SD) where the lightest micro-cycle load was applied. ...
Article
Full-text available
Monitoring heart rate variability has been commonly performed by different devices which differ in their methods (i.e., night recording vs. upon awakening measure, pulse vs. R waves, and software signal processing), Thus, the purpose of this study was to determine the level of agreement between different methods of heart rate variability monitoring, represented in two different systems (i.e., the Polar Nightly Recharge™ function present in Polar sport watches and the Polar H10 chest strap synchronized with the Kubios app). A group of 11 recreational athletes performed a concurrent training program for eight weeks and heart rate variability was daily monitored through both devices. Very large correlation (r = 0.714) and good reliability (ICC = 0.817) were obtained between devices through the entire training program. The magnitude-based inference method was also applied to determine the likelihood of the change concerning the smallest worthwhile change. From a baseline corresponding to the first two weeks of the training program, the weekly heart rate variability changes of the following six weeks were determined for each participant with each device. Despite the large correlation and good reliability between devices, there was a 60.6% of discordance in the likelihood interpretation of the change for the 66 weeks evaluated, explained by the random errors found. Thus, practitioners should be aware of these differences if their training groups use different devices or if an athlete interchanges them. The different nuances of each device can condition the heart rate variability data variation which could compromise the interpretation of the autonomic nervous system modulation.
... Magnitude-based decision was used to evaluate the clinical significance of intervetion. It helps to understand the relation of confidence intervals and the smallest significant effect (beneficial, trivial, or harmful) (Hooren, 2018;Hopkins, 2019). The smallest significant effect in muscle activation was set at 10% (Kolář, 2009). ...
... Using the 'magnitude-based inference' method [41], the authors stated that decreases in HR and RPE responses to jump rope workouts can be used to predict improvements in running performance. Magnitude-based inference is an alternative method to the null-hypothesis testing where confidence intervals are interpreted in relation to the 'smallest worthwhile change' [42]. However, magnitude-based inference has recently received criticism for its lack of a sound theoretical framework [43][44][45][46]. ...
Article
Full-text available
In adults, ratings of perceived exertion (RPE) can be used to predict maximal oxygen uptake, estimate time to exhaustion, assess internal training load and regulate exercise intensity. However, the utility of RPE in children is less researched and therefore, warrants investigation. The purpose of this scoping review is to map out the literature around the application of RPE specifically during aerobic exercise in paediatric populations. Seven bibliographic databases were systematically searched. Grey literature searching and pearling of references were also conducted. To be included for the review, studies were required to comply with the following: (1) participants aged ≤ 18 years asymptomatic of any injuries, disabilities or illnesses; (2) applied RPE in aerobic exercise, testing and/or training; (3) included at least one measure of exercise intensity; and (4) be available in English. The search identified 22 eligible studies that examined the application of RPE in children. These studies involved a total of 718 participants across ten different countries. Nine different types of RPE scales were employed. Overall, the application of RPE in paediatric populations can be classified into three distinct themes: prediction of cardiorespiratory fitness/performance, monitoring internal training loads, and regulation of exercise intensity. The utility of RPE in paediatric populations remains unclear due to the small body of available research and inconsistencies between studies. However, findings from the included studies in this scoping review may show promise. Further research focussing on child-specific RPE scales across various sports, subgroups, and in field-based settings is needed.
Article
Full-text available
Scientists should be able to provide support for the absence of a meaningful effect. Currently, researchers often incorrectly conclude an effect is absent based a nonsignificant result. A widely recommended approach within a frequentist framework is to test for equivalence. In equivalence tests, such as the two one-sided tests (TOST) procedure discussed in this article, an upper and lower equivalence bound is specified based on the smallest effect size of interest. The TOST procedure can be used to statistically reject the presence of effects large enough to be considered worthwhile. This practical primer with accompanying spreadsheet and R package enables psychologists to easily perform equivalence tests (and power analyses) by setting equivalence bounds based on standardized effect sizes and provides recommendations to prespecify equivalence bounds. Extending your statistical tool kit with equivalence tests is an easy way to improve your statistical and theoretical inferences.
Article
Full-text available
The aim of this paper is to provide a Bayesian formulation of the so-called magnitude-based inference approach to quantifying and interpreting effects, and in a case study example provide accurate probabilistic statements that correspond to the intended magnitude-based inferences. The model is described in the context of a published small-scale athlete study which employed a magnitude-based inference approach to compare the effect of two altitude training regimens (live high-train low (LHTL), and intermittent hypoxic exposure (IHE)) on running performance and blood measurements of elite triathletes. The posterior distributions, and corresponding point and interval estimates, for the parameters and associated effects and comparisons of interest, were estimated using Markov chain Monte Carlo simulations. The Bayesian analysis was shown to provide more direct probabilistic comparisons of treatments and able to identify small effects of interest. The approach avoided asymptotic assumptions and overcame issues such as multiple testing. Bayesian analysis of unscaled effects showed a probability of 0.96 that LHTL yields a substantially greater increase in hemoglobin mass than IHE, a 0.93 probability of a substantially greater improvement in running economy and a greater than 0.96 probability that both IHE and LHTL yield a substantially greater improvement in maximum blood lactate concentration compared to a Placebo. The conclusions are consistent with those obtained using a 'magnitude-based inference' approach that has been promoted in the field. The paper demonstrates that a fully Bayesian analysis is a simple and effective way of analysing small effects, providing a rich set of results that are straightforward to interpret in terms of probabilistic statements.
Article
Full-text available
Background Statistical methods for inferring the true magnitude of an effect from a sample should have acceptable error rates when the true effect is trivial (type I rates) or substantial (type II rates). Objective The objective of this study was to quantify the error rates, rates of decisive (publishable) outcomes and publication bias of five inferential methods commonly used in sports medicine and science. The methods were conventional null-hypothesis significance testing [NHST] (significant and non-significant imply substantial and trivial true effects, respectively); conservative NHST (the observed magnitude is interpreted as the true magnitude only for significant effects); non-clinical magnitude-based inference [MBI] (the true magnitude is interpreted as the magnitude range of the 90 % confidence interval only for intervals not spanning substantial values of the opposite sign); clinical MBI (a possibly beneficial effect is recommended for implementation only if it is most unlikely to be harmful); and odds-ratio clinical MBI (implementation is also recommended when the odds of benefit outweigh the odds of harm, with an odds ratio >66). Methods Simulation was used to quantify standardized mean effects in 500,000 randomized, controlled trials each for true standardized magnitudes ranging from null through marginally moderate with three sample sizes: suboptimal (10 + 10), optimal for MBI (50 + 50) and optimal for NHST (144 + 144). Results Type I rates for non-clinical MBI were always lower than for NHST. When type I rates for clinical MBI were higher, most errors were debatable, given the probabilistic qualification of those inferences (unlikely or possibly beneficial). NHST often had unacceptable rates for either type II errors or decisive outcomes, and it had substantial publication bias with the smallest sample size, whereas MBI had no such problems. Conclusion MBI is a trustworthy, nuanced alternative to NHST, which it outperforms in terms of the sample size, error rates, decision rates and publication bias.
Article
Full-text available
Purpose: We consider "magnitude-based inference" and its interpretation by examining in detail its use in the problem of comparing two means. Methods: We extract from the spreadsheets, which are provided to users of the analysis (http://www.sportsci.org/), a precise description of how "magnitude-based inference" is implemented. We compare the implemented version of the method with general descriptions of it and interpret the method in familiar statistical terms. Results and conclusions: We show that "magnitude-based inference" is not a progressive improvement on modern statistics. The additional probabilities introduced are not directly related to the confidence interval but, rather, are interpretable either as P values for two different nonstandard tests (for different null hypotheses) or as approximate Bayesian calculations, which also lead to a type of test. We also discuss sample size calculations associated with "magnitude-based inference" and show that the substantial reduction in sample sizes claimed for the method (30% of the sample size obtained from standard frequentist calculations) is not justifiable so the sample size calculations should not be used. Rather than using "magnitude-based inference," a better solution is to be realistic about the limitations of the data and use either confidence intervals or a fully Bayesian analysis.
Article
Full-text available
Since its introduction into the biomedical literature, statistical significance testing (abbreviated as SST) caused much debate. The aim of this perspective article is to review frequent fallacies and misuses of SST in the biomedical field and to review a potential way out of the fallacies and misuses associated with SSTs. Two frequentist schools of statistical inference merged to form SST as it is practised nowadays: the Fisher and the Neyman-Pearson school. The P-value is both reported quantitatively and checked against the alpha-level to produce a qualitative dichotomous measure (significant/nonsignificant). However, a P-value mixes the estimated effect size with its estimated precision. Obviously, it is not possible to measure these two things with one single number. For the valid interpretation of SSTs, a variety of presumptions and requirements have to be met. We point here to four of them: study size, correct statistical model, correct causal model, and absence of bias and confounding. It has been stated that the P-value is perhaps the most misunderstood statistical concept in clinical research. As in the social sciences, the tyranny of SST is still highly prevalent in the biomedical literature even after decades of warnings against SST. The ubiquitous misuse and tyranny of SST threatens scientific discoveries and may even impede scientific progress. In the worst case, misuse of significance testing may even harm patients who eventually are incorrectly treated because of improper handling of P-values. For a proper interpretation of study results, both estimated effect size and estimated precision are necessary ingredients.
Article
Full-text available
A study of a sample provides only an estimate of the true (population) value of an outcome statistic. A report of the study therefore usually includes an inference about the true value. Traditionally, a researcher makes an inference by declaring the value of the statistic statistically significant or nonsignificant on the basis of a P value derived from a null-hypothesis test. This approach is confusing and can be misleading, depending on the magnitude of the statistic, error of measurement, and sample size. The authors use a more intuitive and practical approach based directly on uncertainty in the true value of the statistic. First they express the uncertainty as confidence limits, which define the likely range of the true value. They then deal with the real-world relevance of this uncertainty by taking into account values of the statistic that are substantial in some positive and negative sense, such as beneficial or harmful. If the likely range overlaps substantially positive and negative values, they infer that the outcome is unclear; otherwise, they infer that the true value has the magnitude of the observed value: substantially positive, trivial, or substantially negative. They refine this crude inference by stating qualitatively the likelihood that the true value will have the observed magnitude (eg, very likely beneficial). Quantitative or qualitative probabilities that the true value has the other 2 magnitudes or more finely graded magnitudes (such as trivial, small, moderate, and large) can also be estimated to guide a decision about the utility of the outcome.
Article
This article explains a decision rule that uses Bayesian posterior distributions as the basis for accepting or rejecting null values of parameters. This decision rule focuses on the range of plausible values indicated by the highest density interval of the posterior distribution and the relation between this range and a region of practical equivalence (ROPE) around the null value. The article also discusses considerations for setting the limits of a ROPE and emphasizes that analogous considerations apply to setting the decision thresholds for p values and Bayes factors.
Article
Purpose: A statistical method called "magnitude-based inference" (MBI) has gained a following in the sports science literature, despite concerns voiced by statisticians. Its proponents have claimed that MBI exhibits superior type I and type II error rates compared with standard null hypothesis testing for most cases. I have performed a reanalysis to evaluate this claim. Methods: Using simulation code provided by MBI's proponents, I estimated type I and type II error rates for clinical and nonclinical MBI for a range of effect sizes, sample sizes, and smallest important effects. I plotted these results in a way that makes transparent the empirical behavior of MBI. I also reran the simulations after correcting mistakes in the definitions of type I and type II error provided by MBI's proponents. Finally, I confirmed the findings mathematically; and I provide general equations for calculating MBI's error rates without the need for simulation. Results: Contrary to what MBI's proponents have claimed, MBI does not exhibit "superior" type I and type II error rates to standard null hypothesis testing. As expected, there is a tradeoff between type I and type II error. At precisely the small-to-moderate sample sizes that MBI's proponents deem "optimal," MBI reduces the type II error rate at the cost of greatly inflating the type I error rate-to two to six times that of standard hypothesis testing. Conclusions: Magnitude-based inference exhibits worrisome empirical behavior. In contrast to standard null hypothesis testing, which has predictable type I error rates, the type I error rates for MBI vary widely depending on the sample size and choice of smallest important effect, and are often unacceptably high. Magnitude-based inference should not be used.
Article
In a recent commentary on statistical inference, Batterham and Hopkins advocated an approach to statistical inference centered on expressions of uncertainty in parameters. After criticizing an approach to statistical inference driven by null hypothesis testing, they proposed a method of "magnitude-based" inference and then claimed that this approach is essentially Bayesian but with no prior assumption about the true value of the parameter. In this commentary, after we address the issues raised by Batterham and Hopkins, we show that their method is "approximately" Bayesian and rather than assuming no prior information their approach has a very specific, but hidden, joint prior on parameters. To correctly adopt the type of inference advocated by Batterham and Hopkins, sport scientists need to use fully Bayesian methods of analysis.