ArticlePDF Available

New Author Guidelines for Displaying Data and Reporting Data Analysis and Statistical Methods in Experimental Biology

Authors:
  • GraphPad Software Inc.

Abstract and Figures

The American Society for Pharmacology and Experimental Therapeutics has revised the Instructions to Authors for Drug Metabolism and Disposition, Journal of Pharmacology and Experimental Therapeutics, and Molecular Pharmacology These revisions relate to data analysis (including statistical analysis) and reporting but do not tell investigators how to design and perform their experiments. Their overall focus is on greater granularity in the description of what has been done and found. Key recommendations include the need to differentiate between preplanned, hypothesis-testing, and exploratory experiments or studies; explanations of whether key elements of study design, such as sample size and choice of specific statistical tests, had been specified before any data were obtained or adapted thereafter; and explanation of whether any outliers (data points or entire experiments) were eliminated and when the rules for doing so had been defined. Variability should be described by S.D. or interquartile range, and precision should be described by confidence intervals; S.E. should not be used. P values should be used sparingly; in most cases, reporting differences or ratios (effect sizes) with their confidence intervals will be preferred. Depiction of data in figures should provide as much granularity as possible, e.g., by replacing bar graphs with scatter plots wherever feasible and violin or box-and-whisker plots when not. This editorial explains the revisions and the underlying scientific rationale. We believe that these revised guidelines will lead to a less biased and more transparent reporting of research findings.
Content may be subject to copyright.
1521-0103/372/1/136147$35.00 https://doi.org/10.1124/jpet.119.264143
THE JOURNAL OF PHARMACOLOGY AND EXPERIMENTAL THERAPEUTICS J Pharmacol Exp Ther 372:136147, January 2020
Copyright ª2019 by The American Society for Pharmacology and Experimental Therapeutics
Commentary
New Author Guidelines for Displaying Data and Reporting Data
Analysis and Statistical Methods in Experimental Biology
Martin C. Michel, T.J. Murphy, and Harvey J. Motulsky
Department of Pharmacology, Johannes Gutenberg University, Mainz, Germany (M.C.M.); Partnership for the Assessment and
Accreditation of Scientific Practice, Heidelberg, Germany (M.C.M.); Department of Pharmacology and Chemical Biology, Emory
University, Atlanta, Georgia (T.J.M.); and GraphPad Software, Los Angeles, California (H.J.M.)
Received November 22, 2019; accepted November 22, 2019
ABSTRACT
The American Society for Pharmacology and Experimental
Therapeutics has revised the Instructions to Authors for Drug
Metabolism and Disposition,Journal of Pharmacology and
Experimental Therapeutics,andMolecular Pharmacology.
These revisions relate to data analysis (including statistical
analysis) and reporting but do not tell investigators how to
design and perform their experiments. Their overall focus is
on greater granularity in the description of what has been
done and found. Key recommendations include the need to
differentiate between preplanned, hypothesis-testing, and
exploratory experiments or studies; explanations of whether
key elements of study design, such as sample size and choice
of specific statistical tests, had been specified before any
data were obtained or adapted thereafter; and explanation of
whether any outliers (data points or entire experiments) were
eliminated and when the rules for doing so had been defined.
Variability should be described by S.D. or interquartile range,
and precision should be described by confidence intervals;
S.E. should not be used. Pvalues should be used sparingly; in
most cases, reporting differences or ratios (effect sizes) with
their confidence intervals will be preferred. Depiction of data
in figures should provide as much granularity as possible,
e.g., by replacing bar graphs with scatter plots wherever
feasible and violin or box-and-whisker plots when not. This
editorial explains the revisions and the underlying scientific
rationale. We believe that these revised guidelines will lead to
a less biased and more transparent reporting of research
findings.
Introduction
Numerous reports in recent years have pointed out that
published results are often not reproducible, and that pub-
lished statistical analyses are often not performed or inter-
preted properly (e.g., Prinz et al., 2011; Begley and Ellis, 2012;
Collins and Tabak, 2014; Freedman and Gibson, 2015;
National Academies of Sciences Engineering and Medicine,
2019). Funding agencies, journals, and academic societies
have addressed these issues with best practice statements,
guidelines, and researcher checklists (https://acmedsci.ac.uk/
file-download/38189-56531416e2949.pdf; Jarvis and Wil-
liams, 2016).
In 2014, the National Institutes of Health met with editors
of many journals and established Principles and Guidelines
for Reporting Preclinical Research. These guidelines were
rapidly adopted by more than 70 leading biomedical research
journals, including the journals of the American Society for
Pharmacology and Experimental Therapeutics (ASPET). A
statement of support for these guidelines was published in the
Societys journals (Vore et al., 2015) along with updated
Instructions to Authors (ItA). Additionally, a statistical anal-
ysis commentary was simultaneously published in multiple
pharmacology research journals in 2014, including The Jour-
nal of Pharmacology and Experimental Therapeutics,British
Journal of Pharmacology,Pharmacology Research & Perspec-
tives, and Naunyn-Schiedebergs Archives of Pharmacology,to
strengthen best practices in the use of statistics in pharma-
cological research (Motulsky, 2014b).
In its continuing efforts to improve the robustness and
transparency of scientific reporting, ASPET has updated the
portion of the ItA regarding data analysis and reporting for
three of its primary research journals, Drug Metabolism and
Disposition,Journal of Pharmacology and Experimental
Therapeutics, and Molecular Pharmacology.
These ItA are aimed at investigators in experimental
pharmacology but are applicable to most fields of experimen-
tal biology. The new ItA do not tell investigators how to design
and execute their studies but instead focus on data analy-
sis and reporting, including statistical analysis. Here, we
https://doi.org/10.1124/jpet.119.264143.
ABBREVIATIONS: ASPET, American Society for Pharmacology and Experimental Therapeutics; CI, confidence interval; ItA, Instructions to
Authors.
136
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
summarize and explain the changes in the ItA and also
include some of our personal recommendations. We wish to
emphasize that guidelines are just that, guidelines. Authors,
reviewers, and editors should use sound scientific judgment
when applying the guidelines to a particular situation.
Explanation of the Guidelines
Guideline: Include Quantitative Indications of Effect
Sizes in the Abstract. The revised ItA state that the
Abstract should quantify effect size for what the authors deem
to be the most important quantifiable finding(s) of their study.
This can be either a numerical effect (difference, percent
change, or ratio) with its 95% confidence interval (CI) or
a general description, such as inhibited by about half,
almost completely eliminated,or approximately tripled.
It is not sufficient to report only the direction of a difference
(e.g., increased or decreased) or whether the difference
reached statistical significance. A tripling of a response has
different biologic implications than a 10% increase, even if
both are statistically significant. It is acceptable (but not
necessary) to also include a Pvalue in the Abstract. Pvalues in
the absence of indicators of effect sizes should not be reported
in the Abstract (or anywhere in the manuscript) because even
a very small Pvalue in isolation does not tell us whether an
observed effect was large enough to be deemed biologically
relevant.
Guideline: State Which Parts (if Any) of the Study
Test a Hypothesis According to a Prewritten Protocol
and Which Parts Are More Exploratory. The revised ItA
state that authors should explicitly say which parts of the
study present results collected and analyzed according to
a preset plan to test a hypothesis and which were done in
a more exploratory manner. The preset plan should include all
key aspects of study design (e.g., hypothesis tested, number of
groups, intervention, sample size per group), study execution
(e.g., randomization and/or blinding), and data analysis (e.g.,
any normalizing or transforming of data, rules for omitting
data, and choice and configuration of the statistical tests). A
few of these planning elements are discussed below as part of
other recommendations.
All other types of findings should be considered exploratory.
This includes multiple possibilities, including the following:
Analysis of secondary endpoints from a study that was
preplanned for its primary endpoint;
Results from post hoc analysis of previously obtained
data; and
Any findings from experiments in which aspects of
design, conduct, or analysis have been adapted after
initial data were viewed (e.g., when sample size has been
adapted).
A statistically significant finding has a surprisingly high
chance of not being true, especially when the prior probability
is low (Ioannidis, 2005; Colquhoun, 2014). The more un-
expected a new discovery is, the greater the chance that it is
untrue, even if the Pvalue is small. Thus, it is much easier to
be fooled by results from data explorations than by experi-
ments in which all aspects of design, including sample size and
analysis, had been preplanned to test a prewritten hypothesis.
Even if the effect is real, the reported effect sizes are likely to
be exaggerated.
Because exploratory work can lead to highly innovative
insights, exploratory findings are welcome in ASPET journals
but must be identified as such. Only transparency about
preplanned hypothesis testing versus exploratory experi-
ments allows readers to get a feeling for the prior probability
of the data and the likely false positive rate.
Unfortunately, this planning rule is commonly broken in
reports of basic research (Head et al., 2015). Instead, analyses
are often done as shown in Fig. 1 in a process referred to as
P-hacking (Simmons et al., 2011), an unacceptable form of
exploration because it is highly biased. Here, the scientist
collects and analyzes some data using a statistical test. If the
outcome is not P,0.05 but shows a difference or trend in the
hoped-for direction, one or more options are chosen from
the following list until a test yields a P,0.05.
Collect some more data and reanalyze. This will be
discussed below.
Use a different statistical test. When comparing two
groups, switch between unpaired ttest, the Welch
corrected (for unequal variances) ttest, or the Mann-
Whitney nonparametric test. All will have different
Pvalues, and it isnt entirely predictable which will be
smaller (Weissgerber et al., 2015). Choosing the test
with the smallest Pvalue will introduce bias in the
results.
Switch from a two-sided (also called two-tailed) Pvalue
to a one-sided Pvalue, cutting the Pvalue in half (in
most cases).
Remove one or a few outliers and reanalyze. This is
discussed in a later section. Although removal of outliers
may be appropriate, it introduces bias if this removal is
not based on a preplanned and unbiased protocol.
Transform to logarithms (or reciprocals) and reanalyze.
Redefine the outcome by normalizing (say, dividing
by each animals weight) or normalizing to a different
control and then reanalyze.
Use a multivariable analysis method that compares one
variable while adjusting for differences in another.
If several outcome variables were measured (such as
blood pressure, pulse, and cardiac output), switch to
a different outcome and reanalyze.
If the experiment has two groups that can be designated
control,switch to the other one or to a combination of
the two control groups.
If there are several independent predictor variables, try
fitting a multivariable model that includes different
subsets of those variables.
Separately analyze various subgroups (say male and
female animals) and only report the comparison with the
smaller Pvalue.
Investigators doing this continue to manipulate the data
and analysis until they obtain a statistically significant result
or until they run out of money, time, or curiosity. This behavior
ignores that the principle goal of science is to find the correct
answers to meaningful questions, not to nudge data until the
desired answer emerges.
In some cases, investigators dont actually analyze data in
multiple ways. Instead, they first look at a summary or graph
of the data and then decide which analyses to do. Gelman and
Loken (2014) point out that this garden of forking pathsis
Reporting Data Analysis and Statistical Methods 137
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
a form of P-hacking because alternative analysis steps would
have been chosen had the data looked different.
As might be imagined, these unplanned processes artifi-
cially select for higher effect sizes and lower Pvalues than
would be observed with a preplanned analysis. Therefore, the
ItA require that the authors state their analysis methodology
in detail so that any P-hacking (or practices that borderline on
P-hacking) are disclosed, allowing reviewers and readers to
take this into account when evaluating the results. In many
cases, it makes sense to remove outliers or to log transform
data. These steps (and others) just need to be part of a planned
analysis procedure and not be done purely because they lower
the Pvalue.
This guideline also ensures that any HARKing(Hypothesiz-
ing After the Result is Known. Kerr, 1998) is clearly labeled.
HARKing occurs when many different hypotheses are tested
(say, by using different genotypes or different drugs) and
an intriguing relationship is discovered, but only the data
supporting the intriguing relationship are reported. It
appears that the hypothesis was stated before the data were
collected. This is a form of multiple comparisons in which each
comparison risks some level of type I error (Berry, 2007). This
has been called double dipping, as the same data are used to
generate a hypothesis and to test it (Kriegeskorte et al., 2009).
Guideline: Report Whether the Sample Size Was
Determined before Collecting the Data. The methods
section or figure legends should state if the sample size was
determined in advance. Only if the sample size was chosen in
advance can readers (and reviewers) interpret the results at
face value.
It is tempting to first run a small experiment and look at the
results. If the effect doesnt cross a threshold (e.g., P,0.05),
increase the sample size and analyze the data again. This
approach leads to biased results because the experiments
wouldnt have been expanded if the results of the first small
experiment resulted in a small Pvalue. If the first small
experiment had a small Pvalue and the experiment were
extended, the Pvalue might have gotten larger. However, this
would not have been seen because the first small Pvalue
stopped the data collection process. Even if the null hypothesis
were true, more than 5% of such experiments would yield
P,0.05. The effects reported from these experiments tend
to be exaggerated. The results simply cannot be interpreted at
face value.
Methods have been developed to adapt the sample size
based on the results obtained. The increased versatility in
sample size collection results in wider CIs, so larger effects are
required to reach statistical significance (Kairalla et al., 2012;
https://www.fda.gov/media/78495/download). It is fine to use
these specialized sequentialor adaptivestatistical techni-
ques so long as the protocol was preplanned and the details are
reported.
Unlike the British Journal of Pharmacology requirements
(Curtis et al., 2018), equal sample sizes are not required in the
ASPET ItA because there are situations in which it makes
sense to plan for unequal sample size (Motulsky and Michel,
2018). It makes sense to plan for unequal sample size when
comparing multiple treatments to a single control. The control
should have a larger n. It also makes sense to plan for unequal
sample size when one of the treatments is much more
expensive, time consuming, or risky than the others and
therefore should have a smaller sample size than the others.
In most cases, it makes sense to have the same sample size in
each treatment group.
Situations exist in which a difference in sample sizes
between groups was not planned but emerges during an
experiment. For instance, an investigator wishes to measure
ion currents in freshly isolated cardiomyocytes from two
groups of animals. The number of successfully isolated
cardiomyocytes suitable for electrophysiological assessment
from a given heart may differ for many reasons. It is also
possible that some samples from a planned sample size
undergo attrition, such as if more animals in the diseased
group die than in the control group. This difference in attrition
across groups may in itself be a relevant finding.
The following are notes on sample size and power.
Sample size calculations a priori are helpful, sometimes
essential, to those planning a major experiment or to
those who evaluate that plan. Is the proposed sample
size so small that the results are likely to be ambiguous?
If so, the proposed experiment may not be worth the
effort, time, or risk. If the experiment uses animals, the
ethical implications of sacrificing animals to a study that
Fig. 1. P-hacking refers to a series of
analyses in which the goal is not to
answer a specific scientific question, but
rather to find a hypothesis and data
analysis method that results in a Pvalue
less than 0.05.
138 Michel et al.
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
is unlikely to provide clear results should be considered.
Is the sample size so large that it is wasteful? Evaluating
the sample size calculations is an essential part of a full
review of a major planned experiment. However, once
the data are collected, it doesnt really matter how the
sample size was decided. The method or justification
used to choose sample size wont affect interpretation of
the results. Some readers may appreciate seeing the
power analyses, but it is not required.
Some programs compute post hoc poweror observed
powerfrom the actual effect (difference) and S.D.
observed in the experiment. We discourage reporting of
post hoc or observed power because these values can be
misleading and do not provide any information that is
useful in interpreting the results (Hoenig and Heisey,
2001; http://daniellakens.blogspot.com/2014/12/observed-
power-and-what-to-do-if-your.html; Lenth, 2001; Levine
and Ensom, 2001).
The British Journal of Pharmacology requires a minimum
of n55 per treatment group (Curtis et al., 2018). The ItA
do not recommend such a minimum. We agree that in
most circumstances, a sample size ,5 is insufficient for
a robust conclusion. However, we can imagine circum-
stances in which the differences are large compared with
the variability, so smaller sample sizes can provide useful
conclusions (http://daniellakens.blogspot.com/2014/12/
observed-power-and-what-to-do-if-your.html). This es-
pecially applies when the overall conclusion will be
obtained by combining data from a set of different types
of experiments, not just one comparison.
Guideline: Provide Details about Experimental and
Analytical Methods. The methods section of a scientific
article has the following two purposes:
To allow readers to understand what the authors have
done so that the results and conclusions can be
evaluated,
To provide enough information to allow others to repeat
the study.
Based on our experience as editors and reviewers, these
goals are often not achieved. The revised ItA emphasize the
need to provide sufficient detail in the methods section,
including methods used to analyze data. In other words, the
description of data analysis must be of sufficient granularity
that anyone starting with the same raw data and applying the
same analysis will get exactly the same results. This includes
addressing the following points.
Which steps were taken to avoid experimental bias, e.g.,
prespecification, randomization, and/or blinding? If
applicable, this can be a statement that such measures
were not taken.
Have any data points or entire independent experimen-
tal replicates (outliers) been removed from the analy-
sis, and how was it decided whether a data point or
experiment was an outlier? For explanations, see the
next section.
The full name of statistical tests should be stated, for
example, two-tailed, paired ttestor repeated-measures
one-way ANOVA with Dunnettsmultiplecomparisons
tests.Just stating that a ttestor ANOVAhas been
used is insufficient. When comparing groups, it should be
stated whether Pvalues are one or two sided (same as one
or two tailed). One-sided Pvalues should be used rarely
and only with justification.
The name and full version number of software used to
perform nontrivial analyses of data should be stated.
We realize that including more details in the methods
section can lead to long sections that are difficult to read. To
balance the needs of transparency and readability, general
statements can be placed in the methods section of the main
manuscript, whereas details can be included in an online
supplement.
Guideline: Present Details about Whether and How
Bad Experimentsor Bad ValuesWere Removed
from Graphs and Analyses. The removal of outliers can be
legitimate or even necessary but can also lead to type I errors
(false positive) and exaggerated results (Bakker and Wicherts,
2014; Huang et al., 2018).
Before identifying outliers, authors should consider the
possibility that the data come from a lognormal distribution,
which may make a value look as an outlier on a linear but not
on a logarithmic scale. With a lognormal distribution, a few
really high values are expected. Deleting those as outliers
would lead to misleading results, whereas testing log-
transformed values is appropriate (Fig. 2).
If outlier removal is done by gut feelrather than preset
rules, it can be highly biased. Our brains come up with many
apparently good reasons why a value we do not like in the first
place should be considered an outlier! Therefore, we recom-
mend that outlier removal should be based on prespecified
criteria. If no such rules had been set, a person blinded to
group allocation may be less biased.
The choice of the appropriate method for handling apparent
outliers depends on the specific circumstances and is up to the
investigators. The ItA ask authors to state in the methods or
results section what quality control criteria were used to
remove bad experimentsor outliers, whether these criteria
were set in advance, and how many bad points or experiments
were removed. It may also make sense to report in an online
supplement the details on every value or experiment removed
as outliers, and to report in that supplement how the results
would differ if outliers were not removed.
Guideline: Report Confidence Intervals to Show
How Large a Difference (or Ratio) Was and How
Precisely It Was Determined. When showing that a treat-
ment had an effect, it is not enough to summarize the response
to control and treatment and to report whether a Pvalue was
smaller than a predetermined threshold. Instead (or addition-
ally), report the difference or ratio between the means (effect
size) and its CI. In some cases, it makes sense to report the
treatment effect as a percent change or as a ratio, but a CI
should still be provided.
The CI provides a range of possible values for an estimate of
some effect size. This has the effect of quantifying the precision
of that estimate. Based onthe outer limits of the CI, readers can
determine whether even these can still be considered biologi-
cally relevant. For instance, a novel drug in the treatment of
obesity may lower body weight by a mean of 10%. In this field,
a reduction of at least 5% is considered biologically relevant by
many. But consider two such studies with a different sample
size. The smaller study has a 95% CI ranging from a 0.5% re-
duction to a 19.5% reduction. A 0.5% reduction in weight would
Reporting Data Analysis and Statistical Methods 139
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
not be considered biologically relevant. Because the CI
ranges from a trivial effect to a large effect, the results are
ambiguous. The 95% CI from the larger study ranges from
8% to 12%. All values in that range are biologically rele-
vant effect sizes, and with such a tight confidence interval,
the conclusions from the data are clearer. Both studies have
P,0.05 (you can tell because neither 95% confidence
interval includes zero) but have different interpretations.
Reporting the CI is more informative than just the Pvalue.
When using a Bayesian approach, report credible intervals
rather than CI.
Guideline: Report PValues Sparingly. One of the
most common problems we observe as statistical reviewers is
overuse of Pvalues. Reasons to put less emphasis on Pvalues
were reviewed by Greenland et al. (2016) and Motulsky
(2018).
Pvalues are often misunderstood. Part of the confusion
is that the question a Pvalue answers seems backward.
The Pvalue answers the following question: If the null
hypothesis is true (as well as other assumptions), what is
the chance that an experiment of the same size would
result in an effect (difference, ratio, or percent change) as
large as (or larger than) that observed in the completed
experiment? Many scientists think (incorrectly) that the
Pvalue is the probability that a given result occurred by
chance.
Statisticians dont entirely agree about the best use of P
values. The American Statistical Association published
a statement about Pvalues introduced by Wasserstein
and Lazar (2016) and accompanied it with 21 commen-
taries in an online supplement. This was not sufficient to
resolve the confusion or controversy, so a special issue of
The American Statistician was published in 2019 with 43
articles and commentaries about Pvalues introduced by
Wasserstein et al. (2019).
Pvalues are based on tentatively assuming a null
hypothesis that is typically the opposite of the biologic
hypothesis. For instance, when the question would be
whether the angiotensin-converting enzyme inhibitor
captopril lowers blood pressure in spontaneously hyper-
tensive rats, the null hypothesis would be that it does
not. In most pharmacological research, the null hypoth-
esis is often false because most treatments cause at least
some change to most outcomes, although that effect may
be biologically trivial. The relevant question, therefore,
is how big the difference is. The Pvalue does not address
this question. Pvalues say nothing about how important
or how large an effect is. With a large sample size, the P
value will be tiny even if the effect size is small and
biologically irrelevant.
Even with careful replication of a highly repeatable
phenomenon, the Pvalue will vary considerably from
experiment to experiment (Fig. 3). Pvalues are fickle
(Halsey et al., 2015). It is not surprising that random
sampling of data leads to different Pvalues in different
experiments. However, we and many scientists were
surprised to see how much Pvalues vary from
experiment to experiment (over a range of more than
three orders of magnitude). Note that Fig. 3 is the best
case, when all variation from experiment to experiment
is due solely to random sampling. If there are additional
reasons for experiments to vary (perhaps because of
changes in reagents or subtle changes in experimental
methods), the Pvalues will vary even more between
repeated experiments.
Guideline: Dichotomous Decisions Based on a Single
PValue Are Rarely Helpful. Scientists commonly use
aPvalue to make a dichotomous decision (the data are, or
are not, consistent with the null hypothesis). There are many
problems with doing this.
Fig. 2. Hypothetical data illustrating how data points may appear as outliers on a linear scale but not after log transformation. The five data tests are
all randomly drawn from a lognormal distribution. The left panel uses a linear scale. Some of the points look like outliers. The right panel shows the same
data on a logarithmic axis. The distribution is symmetrical, as expected for lognormal data. There are no outliers.
140 Michel et al.
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
Interpretation of experimental pharmacology data often
requires combining evidence from different kinds of
experiments, so it demands greater nuance than just
looking at whether a Pvalue is smaller or larger than
a threshold.
When dichotomizing, many scientists always set the
threshold at 0.05 for no reason except tradition. Ideally,
the threshold should be set based on the consequences of
false-positive and false-negative decisions. Benjamin
et al. (2018) suggest that the default threshold should
be 0.005 rather than 0.05.
Sometimes, the goal is to show that two treatments are
equivalent. Just showing a large Pvalue is not enough
for this purpose. Instead, it is necessary to use special
statistical methods designed to test for equivalence or
noninferiority. These are routinely applied by clinical
pharmacologists in bioequivalence studies.
Many scientists misinterpret results when the Pvalue is
greater than 0.05 (or any prespecified threshold). It is
not correct to conclude that the results prove there is no
effectof the treatment. The corresponding confidence
interval quantifies how large the effect is likely to be.
With small sample sizes or large variability, the Pvalue
could be greater than 0.05, even though the difference
could be large enough to be biologically relevant
(Amrhein et al., 2019).
When dichotomizing, many scientists misinterpret
results when the Pvalueislessthan0.05orany
prespecified threshold. One misinterpretation is believ-
ing that the chance that a conclusion is a false positive is
less than 5%. That probability is the false-positive rate.
Its value depends on the power of the experiment and
the prior probability that the experimental hypothesis
was correct, and it is usually much larger than 5%.
Because it is hard to estimate the prior probability, it is
hard to estimate the false positive rate. Colquhoun
(2019) proposed flipping the problem by computing the
prior probability required to achieve a desired false-
positive rate. Another misinterpretation is that a Pvalue
less than 0.05 means the effect was large enough to be
Fig. 3. Variability of Pvalues. If the null hypothesis is true, then the distribution of Pvalues is uniform. Half the Pvalues will be less than 0.50, 5% will
be less than 0.05, etc. But what if the null hypothesis is false? The figure shows data randomly sampled from two Gaussian populations with the S.D.
equal to 5.0 and populations means that differ by 5.0. Top: three simulated experiments. Bottom: the distribution of Pvalues from 2500 such simulated
experiments. Not counting the 2.5% highest and lowest Pvalues, the middle 95% of the Pvalues range from 0.00016 to 0.73, a range covering almost
3.5 orders of magnitude!
Reporting Data Analysis and Statistical Methods 141
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
biologically relevant. In fact, a trivial effect can lead to
aPvalue less than 0.05.
In an influential Nature paper, Amrhein et al. (2019)
proposed that scientists retirethe entire concept of making
conclusions or decisions based on a single Pvalue, but this
issue is not addressed in the ASPET ItA.
To summarize, P,0.05 does not mean the effect is true and
large, and P.0.05 does not prove the absence of an effect or
that two treatments are equivalent. A single statistical
analysis cannot inform us about the truth but will always
leave some degree of uncertainty. One way of dealing with this
is to embrace this uncertainty, express the precision of the
parameter estimates, base conclusions on multiple lines of
evidence, and realize that these conclusions might be wrong.
Definitive conclusions may only evolve over time with multi-
ple lines of evidence coming from multiple sources. Data can
be summarized without pretending that a conclusion has been
proven.
Guideline: Beware of the Word Significant..The
word significanthas two meanings.
In statistics, it means that a Pvalue is less than a preset
threshold (statistical a).
In plain English, it means suggestive,”“important,or
worthy of attention.In a pharmacological context, this
means an observed effect is large enough to have
physiologic impact.
Both of these meanings are commonly used in scientific
papers, and this accentuates the potential for confusion. As
scientific communication should be precise, the ItA suggests
not using the term significant(Higgs, 2013; Motulsky,
2014a).
If the plain English meaning is intended, it should be
replaced with one of many alternative words, such as impor-
tant,”“relevant,”“big,”“substantial,or extreme.If the
statistical meaning is intended, a better wording is P,
0.05or P,0.005(or any predetermined threshold), which
is both shorter and less ambiguous than significant.
Authors who wish to use the word significantwith its
statistical meaning should always use the phrase statisti-
cally significant.
Guideline: Avoid Bar Graphs. Because bar graphs only
show two values (mean and S.D.), they can be misleading.
Very different data distributions (normal, skewed, bimodal, or
with outliers) can result in the same bar graph (Weissgerber
et al., 2015). Figure 4 shows alternatives to bar graphs (see
Figs. 79 for examples for smaller data sets). The preferred
option for small samples is the scatter plot, but this kind of
graph can get cluttered with larger sample sizes (Fig. 4). In
these cases, violin plots do a great job of showing the spread
and distribution of the values. Box-and-whisker plots show
more detail than a bar graph but cant show bimodal
distributions. Bar graphs should only be used to show data
expressed as proportions or counts or when showing scatter,
violin, or box plots would make the graph too busy. Examples
for the latter include continuous data when comparing many
groups or showing X versus Y line graphs (e.g., depicting
a concentration-response curve). In those cases, showing mean
6S.D. or median with interquartile range is an acceptable
option.
A situation in which bar graphs are not helpful at all is the
depiction of results from a paired experiment, e.g., before-after
comparisons. In this case, before-after plots are preferred, in
which the data points from a single experimental unit are
connected by a line so that the pairing effects become clear (see
Fig. 8). Alternately, color can be used to highlight individual
replicate groups. Authors may consider additionally plotting
the set of differences between pairs, perhaps with its mean
and CI.
Guideline: Dont Show S.E. Error Bars. The ItA dis-
courage use of S.E. error bars to display variability. The
reason is simple; the S.E. quantifies precision, not variability.
The S.E. of a mean is computed from the S.D. (S.D. that
quantifies variation) and the sample size. S.E. error bars are
smaller than S.D., and with large samples, the S.E. is always
tiny. Thus, showing S.E. error bars can be misleading, making
small differences look more meaningful than they are, partic-
ularly with larger sample sizes (Weissgerber et al., 2015)
(Fig. 5).
Variability should be quantified by using the S.D. (which
can only be easily interpreted by assuming a Gaussian
distribution of the underlying population) or the inter-
quartile range.
When figures or tables do not report raw data but instead
report calculated values (differences, ratios, EC
50
s), it is
important to also report how precisely those values have been
determined. This should be done with CI rather than S.E. for
two reasons. First, CIs can be asymmetric to better show
uncertainty of the calculated value. In contrast, reporting
Fig. 4. Comparison of bar graph (mean and S.D.), box and whiskers, scatter plot, and violin plot for a large data set (n51335). Based on data showing
number of micturitions in a group of patients seeking treatment (Amiri et al., 2018). Note that the scale of the y-axis is different for the bar graph than for
the other graphs.
142 Michel et al.
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
a single S.E. cannot show asymmetric uncertainty. Second,
although a range extending from the computed value minus
one S.E. to that value plus one S.E. is sort of a CI, its exact
confidence level depends on sample size. It is better to report
a CI with a defined confidence level (usually 95% CI).
Guideline: ASPET Journals Accept Manuscripts
Based on the Question They Address and the Quality
of the Methods and Not Based on Results. Although
some journals preferentially publish studies with a positive
result, the ASPET journals are committed to publishing
papers that answer important questions irrespective of
whether a positiveresult has been obtained, so long as the
methods are suitable, the sample size was large enough, and
all controls gave expected results.
Why not just publish positive results,as they are more
interesting than negative results?
Studies with robust design, e.g., those including ran-
domization and blinding, have a much greater chance of
finding a neutralor negativeoutcome (Sena et al.,
2007; MacLeod et al., 2008). ASPET doesnt want to
discourage the use of the best methods because they are
more likely to lead to negativeresults.
Even if there is no underlying effect (of the drug or
genotype), it is possible that some experiments will end
up with statistically significant results by chance. If only
these studies, but not the negativeones, get published,
there is selection for false positives. Scientists often
reach conclusions based on multiple studies, either
informally or by meta-analysis. If the negative results
are not published, scientists will only see the positive
results and be misled. This selection of positive results in
journals is called publication bias(Dalton et al., 2016).
If there is a real effect, the published studies will
exaggerate the effect sizes. The problem is that results
will vary from study to study, even with the same
underlying situation. When random factors make the
effect large, the Pvalue is likely to be ,0.05, so the
paper gets published. When random factors make
the effect small, the Pvalue will probably be .0.05,
and that paper wont get published. Thus, journals that
only publish studies with small Pvalues select for
studies in which random factors (as well as actual
factors) enhance the effect seen. So, publication bias
leads to overestimation of effect sizes. This is demon-
strated by simulations in Fig. 6. Gelman and Carlin
(2014) call this a Type M (Magnitude) Error.
If journals only only accept positive results, they have
created an incentive to find positive results, even if
P-hacking is needed. If journals accept results that
answer relevant questions with rigorous methodology,
they have created an incentive to ask important
questions and answer them with precision and rigor.
It is quite reasonable for journals to reject manuscripts
because the methodology is not adequate or because the
controls do not give the expected results. These are not
negative data but rather bad data that cannot lead to any
valid conclusions.
Examples of How to Present Data and Results
The examples below, all using real data, show how we
recommend graphing data and writing the methods, results,
and figure legends.
Example: Unpaired tTest
Figure 7 shows a reanalysis of published data (Frazier et al.,
2006) comparing maximum relaxation of urinary bladder
Fig. 5. Comparison of error bars. Based on Frazier et al. (2006) showing maximum relaxation of rat urinary bladder by norepinephrine in young and old
rats; the left panel shows the underlying raw data for comparison as scatter plot.
Reporting Data Analysis and Statistical Methods 143
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
strips in young and old rats by norepinephrine. The left half of
the figure shows the data (and would be sufficient), but the
results are more complete with the addition of the right half
showing an estimation plot (Ho et al., 2019) showing the
difference between means and its 95% CI..
Suggested Wording of Statistical Methods. Relaxa-
tion was expressed as percent, with 0% being the force
immediately prior to the start of adding norepinephrine and
100% a force of 0 mN. Assuming sampling from a Gaussian
distribution, maximum relaxation by norepinephrine in the
two groups was compared by an unpaired two-tailed ttest.
Results. The mean maximum relaxation provoked by norepi-
nephrine was 24percentage points smaller (absolute difference) in
old rats than in young rats (95% CI: 9% to 38%; (P50.0030).
Fig. 6. How Pvalue selection from underpowered studies and publication bias conspire to overestimate effect size. The simulations draw random data
from a Gaussian (normal) distribution. For controls, the theoretical mean is 4.0. For treated, the theoretical mean is 5.0. So, the true difference between
population means is 1.0. The S.D. of both populations was set to 1.0 for the simulations in (A) and was set to 0.5 for those in (B). For each simulation, five
replicates were randomly drawn for each population, an unpaired ttest was run, and both the difference between means and the two-sided Pvalue were
tabulated. Each panel shows the results of 1000 simulated experiments. The left half of each panel shows the difference between means for all the
simulated experiments. Half the simulated experiments have a difference greater than 1.0 (the simulated population difference), and half have
a difference smaller than 1.0. There is more variation in (A) because the S.D. was higher. There are no surprises so far. The right half of each panel shows
the differences between means only for the simulated experiments in which P,0.05. In (A), this was 32% of the simulations. In other words, the power
was 32%. In (B), there was less experimental scatter (lower S.D.), so the power was higher, and 79% of the simulated experiments had P,0.05. Focus on
(A). If the sample means were 4.0 and 5.0 and both sample S.D.s were 1.0 (in other words, if the sample means and S.D.s match the population exactly),
the two-sided Pvalue would be 0.1525. Pwill be less than 0.05 only when random sampling happens to put larger values in the treated group and smaller
values in the control group (or random sampling leads to much smaller S.D.s). Therefore, when P,0.05, almost all of the effect sizes (the symbols in the
figure) are larger than the true (simulated) effect size (the dotted line at Y 51.0). On average, the observed differences in (A) were 66% larger than the
true population value. (B) shows that effect magnification also occurs, but to a lesser extent (11%), in an experimental design with higher power. If only
experiments in which P,0.05 (or any threshold) are tabulated or published, the observed effect is likely to be exaggerated, and this exaggeration is
likely to be substantial when the power of the experimental design is low.
Fig. 7. Unpaired ttest example.
144 Michel et al.
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
Figure Legend. In the left panel, each symbol shows data
from an individual rat. The lines show group means. The
analysis steps had been decided before we looked at the data.
The right panel shows the difference between the mean and its
95% CI. The sample sizes were unequal at the beginning of the
experiment (because of availability), and the sample sizes did
not change during the experiment.
Example: Paired tTest
Figure 8 shows a reanalysis of published data (Okeke et al.,
2019) on cAMP accumulation in CHO cells stably transfected
with human b
3
-adrenoceptors pretreated for 24 hours with
10 mM isoproterenol (treated) or vehicle (control) and then
rechallenged with freshly added isoproterenol.
Suggested Wording of Statistical Methods. The log
E
max
of freshly added isoproterenol (as calculated from a full
concentration-response curve) was determined in each exper-
iment from cells pretreated with isoproterenol or vehicle. The
two sets of E
max
values were compared with a two-tailed ratio-
paired ttest (equivalent to a paired ttest on log of E
max
).
Suggested Wording of Results. Pretreating with iso-
proterenol substantially reduced maximum isoproterenol-
stimulated cAMP accumulation. The geometric mean of the
ratio of E
max
values (pretreated with isoproterenol divided
by control) was 0.26 (95% confidence interval: 0.16 to 0.43;
P50.0002 in two-tailed ratio-paired ttest).
Suggested Wording of Figure Legend. In the left panel,
each symbol shows the E
max
of isoproterenol-stimulated cAMP
accumulation from an individual experiment. Data points
from the same experiment, pretreated with isoproterenol
versus pretreated with vehicle, are connected by a line.
Sample size had been set prior to the experiment based on
previous experience with this assay. In the right panel, each
symbol shows the ratio from an individual experiment. The
error bar shows the geometric mean and its 95% confidence
interval.
Example: Nonlinear Regression
Figure 9 shows a reanalysis of published data (Michel,
2014) comparing relaxation of rat urinary bladder by iso-
proterenol upon a 6-hour pretreatment with vehicle or 10 mM
isoproterenol.
Suggested Wording of Statistical Methods. Relaxa-
tion within each experiment was expressed as percent, with
0% being the force immediately prior to start of adding
isoproterenol and 100% a force of 0 mN. Concentration-
response curves were fit to the data of each experiment (where
X is the logarithm of concentration) based on the equation
Y5bottom 1ðtop bottomÞ=ð1110ðlogEC50 XÞÞ
to determine top (maximum effect, E
max
) and log EC
50,
with
bottom constrained to equal zero. Note that this equation does
not include a slope factor, which is effectively equal to one.
Unweighted nonlinear regression was performed by Prism (v.
8.1; GraphPad, San Diego, CA). Values of E
max
and 2log EC
50
fit to each animals tissue with pretreatment with vehicle or
10 mM isoproterenol were compared by unpaired, two-tailed
ttest.
Suggested Wording of Results. Freshly added isopro-
terenol was similarly potent in bladder strips pretreated with
vehicle or 10 mM isoproterenol but was considerably less
effective in the latter. The mean difference in potency (2log
EC
50
) between control and pretreated samples was 20.22
(95% CI: 20.52 to 10.09; P50.1519). The mean absolute
difference for efficacy (E
max
) was 26 percentage points
(95% CI: 14% to 37%; P50.0005).
Suggested Wording of Figure Legend. The left panel
shows mean of relaxation from all experiments and a curve fit
to the pooled data (for illustration purposes only). The center
Fig. 8. Paired ttest example.
Fig. 9. Nonlinear regression example.
Reporting Data Analysis and Statistical Methods 145
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
and right panels show 2log EC
50
and E
max
, respectively, for
individual experiments. The lines show means of each group.
All analysis steps and the sample size of n56 per group had
been decided before we looked at the data. Note that the y-axis
is reversed, so bottom (baseline) is at the top of the graph.
Example: Multiple Comparisons
Figure 10 is based on a reanalysis of published data (Michel,
2014) comparing relaxation of rat urinary bladder by iso-
proterenol upon a 6-hour pretreatment with vehicle or 10 mM
isoproterenol, fenoterol, CL 316,243, or mirabegron.
Suggested Wording of Statistical Methods. Relaxa-
tion within each experiment was expressed as percent, with
0% being the force immediately prior to the start of adding
isoproterenol and 100% a lack of contractile force (measured
tension 0 mN). Concentration-response curves were fit to the
data of each experiment (where X is logarithm of concentra-
tion) based on the equation
Y5bottom 1ðtop bottomÞ=ð1110ðlogEC50 XÞÞ
to determine top (maximum effect; E
max
) and log EC
50
with
bottom constrained to equal zero. Note that this equation does
not include a slope factor, which is effectively equal to one.
Unweighted nonlinear regression was performed by Prism (v.
8.1; GraphPad). E
max
and 2log EC
50
values from tissue with
pretreatment with a b-adrenergic agonist were compared with
those from preparations pretreated with vehicle by one-way
analysis of variance, followed by Dunnetts multiple compar-
ison test and reporting of multiplicity-adjusted Pvalues and
confidence intervals. As sample sizes had been adapted during
the experiment, the Pvalues and CIs should be considered
descriptive and not as hypothesis-testing.
Suggested Wording of Results. Freshly added isopro-
terenol was similarly potent in bladder strips pretreated with
vehicle or any of the b-adrenergic agonists. The E
max
of freshly
added isoproterenol was reduced by pretreatment with iso-
proterenol (absolute mean difference 22 percentage points;
95% CI: 6% to 39%; P50.0056) or fenoterol (24% points; 8%
to 41%; P50.0025) but less so by pretreatment with CL
316,243 (11% points; 26% to 28%; P50.3073) or mirabegron
(14% points; 21% to 29%; P50.076).
Suggested Wording of Figure Legend. The left panel
shows E
max
for individual rats, and the lines show means of
each group. The right panel shows the mean difference
between treatments with multiplicity-adjusted 95% confi-
dence intervals. All analysis steps had been decided before
we looked at the data, but the sample size had been adapted
during the course of the experiments.
Summary
The new ItA of ASPET journals do not tell investigators how
to design and execute their studies but instead focus on data
analysis and reporting, including statistical analysis. Some of
the key recommendations are as follows.
Provide details of how data were analyzed in enough
detail so the work can be reproduced. Include details
about normalization, transforming, subtracting base-
lines, etc. as well as statistical analyses.
Identify whether the study (or which parts of it) was
testing a hypothesis in experiments with a prespecified
design, which includes sample size and data analysis
strategy, or was exploratory.
Explain whether sample size or number of experiments
was determined before any results were obtained or had
been adapted thereafter.
Explain whether statistical analysis, i.e., which specific
tests to use and which groups being compared statisti-
cally, was determined before any results were obtained
or had been adapted thereafter.
Explain whether any outliers (single data points or
entire experiments) were removed from the analysis. If
so, state the criteria used and whether the criteria had
been defined before any results were obtained.
Describe variability around the mean or median of
a group by reporting S.D. or interquartile range; describe
precision, e.g., when reporting effect sizes as CI. S.E.
should not be used.
Use Pvalues sparingly. In most cases, reporting effect
sizs (difference, ratio, etc.) with their CIs will be
sufficient.
Fig. 10. Multiple comparisons example.
146 Michel et al.
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
Make binary decisions based on a Pvalue rarely and
define that decision.
Beware of the word significant.It can mean that a P
value is less than a preset threshold or that an observed
effect is large enough to be biologically relevant. Either
avoid the word entirely (our preference) or make sure its
meaning is always clear to the reader.
Create graphs with as much granularity as is reasonable
(e.g., scatter plots).
Acknowledgments
Work on data quality in the laboratory of M.C.M. is funded by the
European Quality In Preclinical Data (EQIPD) consortium as part of
the Innovative Medicines Initiative 2 Joint Undertaking [Grant
777364], and this Joint Undertaking receives support from the
European Unions Horizon 2020 research and innovation program
and European Federation of Pharmaceutical Industry Associations..
M.C.M. is an employee of the Partnership for Assessment and
Accreditation of Scientific Practice (Heidelberg, Germany), an orga-
nization offering services related to data quality. T.J.M. declares no
conflict of interest in matters related to this content. H.J.M. is founder,
chief product officer, and a minority shareholder of GraphPad
Software LLC, the creator of the GraphPad Prism statistics and
graphing software.
Authorship Contributions
Wrote or contributed to the writing of the manuscript: All authors.
References
Amiri M, Murgas S, and Michel MC (2018) Do overactive bladder symptoms exhibit
a Gaussian distribution? Implications for reporting of clinical trial data. Neurourol
Urodyn 37 (Suppl 5):S397S398.
Amrhein V, Greenland S, and McShane B (2019) Scientists rise up against statistical
significance. Nature 567:305307.
Bakker M and Wicherts JM (2014) Outlier removal, sum scores, and the inflation of
the Type I error rate in independent samples t tests: the power of alternatives and
recommendations. Psychol Methods 19:409427.
Begley CG and Ellis LM (2012) Drug development: raise standards for preclinical
cancer research. Nature 483:531533.
Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, Berk R,
Bollen KA, Brembs B, Brown L, Camerer C, et al. (2018) Redefine statistical sig-
nificance. Nat Hum Behav 2:610.
Berry DA (2007) The difficult and ubiquitous problems of multiplicities. Pharm Stat
6:155160.
Collins FS and Tabak LA (2014) Policy: NIH plans to enhance reproducibility. Nature
505:612613.
Colquhoun D (2014) An investigation of the false discovery rate and the mis-
interpretation of p-values. R Soc Open Sci 1:140216.
Colquhoun D (2019) The false positive risk: a proposal concerning what to do about
p-values. Am Stat 73 (Suppl 1):192201.
Curtis MJ, Alexander S, Cirino G, Docherty JR, George CH, Giembycz MA, Hoyer D,
Insel PA, Izzo AA, Ji Y, et al. (2018) Experimental design and analysis and their
reporting II: updated and simplified guidance for authors and peer reviewers. Br
J Pharmacol 175:987993.
Dalton JE, Bolen SD, and Mascha EJ (2016) Publication bias: the elephant in the
review. Anesth Analg 123:812813.
Frazier EP, Schneider T, and Michel MC (2006) Effects of gender, age and hyper-
tension on b-adrenergic receptor function in rat urinary bladder. Naunyn
Schmiedebergs Arch Pharmacol 373:300309.
Freedman LP and Gibson MC (2015) The impact of preclinical irreproducibility on
drug development. Clin Pharmacol Ther 97:1618.
Gelman A and Carlin J (2014) Beyond power calculations: assessing type S (Sign) and
type M (Magnitude) errors. Perspect Psychol Sci 9:641651.
Gelman A and Loken E (2014) The statistical crisis in science. Am Sci 102:460465.
Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, and Altman
DG (2016) Statistical tests, P values, confidence intervals, and power: a guide to
misinterpretations. Eur J Epidemiol 31:337350.
Halsey LG, Curran-Everett D, Vowler SL, and Drummond GB (2015) The fickle P
value generates irreproducible results. Nat Methods 12:179185.
Head ML, Holman L, Lanfear R, Kahn AT, and Jennions MD (2015) The extent and
consequences of p-hacking in science. PLoS Biol 13:e1002106.
Higgs MD (2013) Macroscope: do we really need the S-word? Am Sci 101:69.
Ho J, Tumkaya T, Aryal S, Choi H, and Claridge-Chang A (2019) Moving beyond P
values: Everyday data analysis with estimation plots. Nature Methods 16:565566,
doi: 10.1038/s41592-019-0470-3.
Hoenig JM and Heisey DM (2001) The abuse of power. Am Stat 55:1924.
Huang M-W, Lin W-C, and Tsai C-F (2018) Outlier removal in model-based missing
value imputation for medical datasets. J Healthc Eng 2018:1817479.
Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2:
e124.
Jarvis MF and Williams M (2016) Irreproducibility in preclinical biomedical re-
search: perceptions, uncertainties, and knowledge gaps. Trends Pharmacol Sci 37:
290302.
Kairalla JA, Coffey CS, Thomann MA, and Muller KE (2012) Adaptive trial designs:
a review of barriers and opportunities. Trials 13:145.
Kerr NL (1998) HARKing: hypothesizing after the results are known. Pers Soc
Psychol Rev 2:196217.
Kriegeskorte N, Simmons WK, Bellgowan PSF, and Baker CI (2009) Circular anal-
ysis in systems neuroscience: the dangers of double dipping. Nat Neurosci 12:
535540.
Lenth RV (2001) Some practical guidelines for effective sample size determination.
Am Stat 55:187193.
Levine M and Ensom MHH (2001) Post hoc power analysis: an idea whose time has
passed? Pharmacotherapy 21:405409.
Macleod MR, van der Worp HB, Sena ES, Howells DW, Dirnagl U, and Donnan GA
(2008) Evidence for the efficacy of NXY-059 in experimental focal cerebral is-
chaemia is confounded by study quality. Stroke 39:28242829.
Michel MC (2014) Do b-adrenoceptor agonists induce homologous or heterologous
desensitization in rat urinary bladder? Naunyn Schmiedebergs Arch Pharmacol
387:215224.
Motulsky H (2014a) Opinion: never use the word significantin a scientific paper.
Adv Regen Biol 1:25155.
Motulsky H (2018) Intuitive Biostatistics, 4th ed, Oxford University Press, Oxford,
UK.
Motulsky HJ (2014b) Common misconceptions about data analysis and statistics.
J Pharmacology and Experimental Therapeutics 351:200205, doi: 10.1124/
jpet.114.219170 25204545.
Motulsky HJ and Michel MC (2018) Commentary on the BJPs new statistical
reporting guidelines. Br J Pharmacol 175:36363637.
National Academies of Sciences Engineering and Medicine (2019) Reproducibility
and Replicability in Science, The National Academies Press, Washington, DC.
Okeke K, Michel-Reher MB, Gravas S, and Michel MC (2019) Desensitization of
cAMP accumulation via human b
3
-adrenoceptors expressed in human embryonic
kidney cells by full, partial, and biased agonists. Front Pharmacol 10:596.
Prinz F, Schl ange T, and Asadullah K (2011) Believe it or not: how much can we
rely on publi shed data on potential drug targets? Nat Rev Drug Discov 10:
712713.
Sena E, van der Worp HB, Howells D, and Macleod M (2007) How can we improve the
pre-clinical development of drugs for stroke? Trends Neurosci 30:433439.
Simmons JP, Nelson LD, and Simonsohn U (2011) False-positive psychology: un-
disclosed flexibility in data collection and analysis allows presenting anything as
significant. Psychol Sci 22:13591366.
Vore M, Abernethy D, Hall R, Jarvis M, Meier K, Morgan E, Neve K, Sibley DR,
Traynelis S, Witkin J, et al. (2015) ASPET journals support the National Institutes
of Health principles and guidelines for reporting preclinical research. J Pharmacol
Exp Ther 354:8889.
Wasserstein RL and Lazar NA (2016) The ASA statement on p-values: context,
process, and purpose. Am Stat 70:129133.
Wasserstein RL, Schirm AL, and Lazar NA (2019) Moving to a world beyond p,
0.05.Am Stat 73:119.
Weissgerber TL, Milic NM, Winham SJ, and Garovic VD (2015) Beyond bar and line
graphs: time for a new data presentation paradigm. PLoS Biol 13:e1002128.
Address correspondence to: Dr. Harvey J. Motulsky, GraphPad Software,
1100 Glendon Ave., 17th floor, Los Angeles, CA 90024. E-mail: hmotulsky@
graphpad.com
Reporting Data Analysis and Statistical Methods 147
at ASPET Journals on December 28, 2019jpet.aspetjournals.orgDownloaded from
... In line with recent guidelines for enhanced robustness of data analysis (25,26), we considered all data to be exploratory, that is, not testing a prespecified statistical null hypothesis; inherently, a posthoc analysis cannot be hypothesis testing as that would require a random sample. Therefore, as recommended by leading statisticians (27,28), we do not report p-values and focus on effect sizes with the presentation of 95% CI. ...
... In line with recommendations from leading statisticians (27,28), the underlying main study had been designed as exploratory and did not address a pre-specified statistical null hypothesis. The post-hoc nature of the present analyses further contributes to an exploratory character. ...
Article
Full-text available
Introduction Dysmenorrhea symptoms are frequent and often self-treated using non-prescription medicines. Methods To further characterize women with dysmenorrhea using a combination of hyoscine butylbromide plus paracetamol (PLUS) for self-management of their complaints, we performed a secondary analysis of a published pharmacy-based patient survey. Results A total of 314 women (mean age: 32.3 years) with dysmenorrhea reported a pain and cramps intensity of 7.45 ± 2.13 (means ± SD) on a 0–10 Likert scale, which was reduced to 2.86 ± 1.81 upon treatment. Associated impairments of work/daily chores, leisure activities, and sleep were improved by 64.6, 62.2, and 70.4%, respectively. The onset of symptom relief was within 60 min in 84.7%. Tolerability was rated as very good or good by 97.2%; 82.8% were repeat users, 97.5% reported their intention to purchase the product again, and 97.1% reported their intention to recommend it to relatives, friends, and colleagues. Discussion These findings confirm the efficacy and tolerability data on PLUS from randomized controlled trials in a larger group of women conducting self-management of their dysmenorrhea in a real-world setting. Future studies should compare PLUS to other non-prescription treatments.
... All data are expressed as mean ± standard error of mean (SEM) unless noted otherwise and analyzed using GraphPad Prism (GraphPad Software; La Jolla, CA) and presented following the recommendations made by Harvey J. Motulsky (Michel et al., 2020;Motulsky, 2015). Details of all statistical analysis are enumerate in the Figure Legends.Full comparative and between-group statistics of all measured end-points are listed in Supplemental Table 2. ...
... While the limited sample size of this study could be construed as a limitation, the only measurements that an increase in sample size could influence are that of plasma SAA, plasma cholesterol and hepatic TGs. Since the changes observed are exactly the opposite of what was expected and without any precedent in published literature, any additions to sample size in order to tailor the means and standard deviations is not appropriate (Michel et al., 2020;Motulsky, 2015). This study was designed as a preliminary one and the statistical analysis and sample size were pre-determined before any interventions. ...
Article
Full-text available
Leukotrienes are potent mediators of the inflammatory response and 5-lipoxygenase, the enzyme responsible for their synthesis, is dependent on its interaction with 5-lipoxygenase activating protein for optimum catalysis. Previous studies had demonstrated that macrophage infiltration into adipose tissue is associated with obesity and atherosclerosis in LDLR-/-mice fed a high fat-high carbohydrate diet. The present study was undertaken to determine whether inhibition of 5-lipoxygenase activating protein is efficacious in attenuating adipose tissue inflammation in LDLR-/-mice fed a high fat-high carbohydrate diet. 10-week old male LDLR-/-mice were fed a high fat-high carbohydrate diet for 22-weeks, with or without MK886 (40 mg/kg/day, ad libitum) a well-established 5-lipoxygenase activating protein inhibitor. All mice had an approximate 2-fold increase in total body weight, but a 6-week course of MK886 treatment had differential effects on adipose tissue size, without affecting macrophage accumulation. MK886 exacerbated the dyslipidemia, increased serum amyloid A content of high-density lipoproteins and caused a profound hepatomegaly. Dyslipidemia and increased serum amyloid A were concomitant with increases in atherosclerosis. In conclusion, MK886 paradoxically exacerbated hyperlipidemia and the pro-inflammatory phenotype in a mouse model of diet-induced atherosclerosis, possibly via a disruption of hepatic lipid metabolism and increased inflammation.
... Conversely, the presence of alcohol and substance use disorders served as exclusion criteria. Individuals with alcohol use were considered social drinkers and social drinking and tobacco use were not considered as exclusion criteria Because alcohol consumption guidelines vary significantly around the world, it was quantified as one standard drink per day for women and two standard drinks per day for men (a standardized drink is defined as containing 10−14 g of ethanol, which is equivalent to a bottle of beer (350 mL), a glass of wine (150 mL), or a shot of tequila, raki, vodka, or whiskey (44 mL)) [25,26]. Participants in the BD-M group were not undergoing pharmacological treatment at the time of their presentation to our clinic. ...
Article
Full-text available
Objective: In our study, we aimed to investigate the differences in metabolic parameters, serum asprosin and peptide tyrosine tyrosine (PYY) levels in a bipolar disorder manic (BD-M) group, a euthymic group and in healthy controls; we also aimed to evaluate the relationship of asprosin and PYY levels with metabolic parameters and psychopathology in patients. Methods: The study included 54 manic patients, 40 euthymic patients and 39 healthy controls. The sociodemographic characteristics of the participants were recorded, and biochemical parameters and asprosin and PYY levels were measured. The Young Mania Rating Scale (YMRS) and the Hamilton Depression Rating Scale (HAM-D) were completed. Results: Body mass index (BMI) showed significant differences between the three groups (p < 0.001); the lowest was found in the control group and the highest in the euthymic group. Triglyceride levels were significantly higher in the euthymic group compared with the BD-M group and controls (p = 0.003). Glucose levels were significantly higher in the BD-M group compared with euthymic (pmanic-euthymic = 0.008) and controls (pmanic-control < 0.001). Asprosin (pmanic-control < 0.001, peuthymic-control = 0.046, pmanic-euthymic = 0.015) and PYY (pmanic-control < 0.001, peuthymic-control = 0.037, pmanic-euthymic = 0.002) levels were significantly different between the three groups, with the lowest levels in the BD-M group and the highest levels in the control group. The eta squared = 0.18 for asprosin and 0.21 for PYY. In the BD-M group, a moderate negative correlation was found between YMRS and asprosin (r = −0.345; p = 0.011) and PYY (r = −0.376; p = 0.005) levels. ROC analysis results showed that asprosin and PYY could be used to predict the manic period in BD-I (AUCasprosin:0.775; AUCPYY:0.760). After adjusting for asprosin as a covariate using ANCOVA, the difference in PYY between groups remained significant (manic–euthymic groups, p = 0.040; manic–control groups, p = 0.013). Conclusions: The study results revealed that asprosin and PYY levels were low, and metabolic parameters were impaired in the patients. Low asprosin and PYY levels may be indicators of impaired energy homeostasis in BD-I. PYY may be a state marker for manic episodes.
... All data analyses and graph plotting were done with GraphPad Prism, Version 10 (GraphPad Software, United States) following guidelines in experimental pharmacology for displaying data and statistical methods (Michel et al., 2020). Results are reported as mean values ±standard error of the mean (SEM); individual symbols are shown for each rat within bar-graphs. ...
Article
Full-text available
Background Few studies have previously evaluated the long-term impact of initiating the combined use of alcohol and cocaine early-in-life during adolescence. Our preclinical study characterized changes in affective-like behavior and/or voluntary ethanol consumption emerging later on in adulthood induced by a prior adolescent drug exposure, as well as tested therapeutical interventions (i.e., cannabidiol or ketamine) to prevent the observed effects. Methods We performed three independent studies with male and female Sprague-Dawley rats, treated in adolescence (postnatal days, PND 29–38) with non-contingent paradigms of ethanol, cocaine, their combination or vehicle. Later on, adult rats were (1) scored for their affective-like state (forced-swim, elevated-plus maze, novelty-suppressed feeding, sucrose preference), (2) allowed to freely drink ethanol for 6 weeks (two-bottle choice), or (3) treated with cannabidiol or ketamine before given access to ethanol in adulthood. Results No signs of increased negative affect were observed in adulthood following the adolescent treatments. However, adolescent ethanol exposure was a risk-factor for later developing an increased voluntary ethanol consumption in adulthood, both for male and female rats. This risk was similar when ethanol was combined with adolescent cocaine exposure, since cocaine alone showed no effects on later ethanol intake. Finally, rats exposed to adolescent ethanol and pretreated in adulthood with cannabidiol (and/or ketamine, but just for females) reduced their ethanol voluntary consumption. Conclusion Our data provided two therapeutical options capable of preventing the impact of an early drug initiation during adolescence by decreasing voluntary ethanol consumption in adult rats
Article
Enzymatically oxygenated phospholipids (eoxPL) from lipoxygenases (LOX) or cyclooxygenase (COX) are prothrombotic. Their generation in arterial disease, and their modulation by cardiovascular therapies is unknown. Furthermore, the Lands cycle acyl-transferases that catalyze their formation are unidentified. eoxPL were measured in platelets and leukocytes from an atherosclerotic cardiovascular disease (ASCVD) cohort and retrieved human arterial thrombi from three anatomical sites. The impact of age, gender, and aspirin was characterized in platelets from healthy subjects administered low-dose aspirin. The role of lysophosphatidylcholine acyltransferase 3 (LPCAT3) in eoxPL biosynthesis was tested using an inhibitor and a cell-free assay. Platelets from ASCVD patients generated lower levels of COX-derived eoxPL but elevated 12-LOX-diacyl forms, than platelets from healthy controls. This associated with aspirin and was recapitulated in healthy subjects by aspirin supplementation. P2Y12 inhibition had no impact on eoxPL. LPCAT3 inhibition selectively prevented 12-LOX-derived diacyl-eoxPL generation. LPCAT3 activity was not directly altered by aspirin. P2Y12 inhibition or aspirin had little impact on eoxPL in leukocytes. Complex aspirin-dependent gender and seasonal effects on platelet eoxPL generation were seen in healthy subjects. Limb or coronary (ST-elevation myocardial infarction, STEMI) thrombi displayed a platelet eoxPL signature while carotid thrombi had a white cell profile. EoxPL are altered in ASCVD by a commonly used cardiovascular therapy, and LPCAT3 was identified as the acyltransferase generating aspirin-sensitive 12-LOX diacyl forms. These changes to the phospholipid composition of blood cells in humans at risk of thrombosis may be clinically significant where the procoagulant membrane plays a central role in driving elevated thrombotic risk.
Article
Full-text available
Post-traumatic stress disorder (PTSD) is caused by exposure to a traumatic or stressful event. Symptoms related to this disorder include persistent re-experiencing of memories and fear generalization. Current pharmacological treatments for PTSD are insufficient, with fewer than 30% of patients reporting symptom remission. This study aims to determine the efficacy of acute (R,S) ketamine and chronic fluoxetine (FLX) in reducing fear memory and fear generalization. In rodents, fear conditioning (FC) is commonly used in the literature to induce behaviors related to symptoms of PTSD, and the open field test (OFT) can assess anxiety and fear generalization behaviors during the exploration of a novel environment. In this study, FC consisted of a white noise cue stimulus and four inescapable foot shocks. Treatments began 4 hours after FC. Fear and anxiety behaviors were recorded during re-exposure to the FC stimuli at 24 hours and 2 weeks. The OFT was conducted one day before the last FC re-exposure. Results support the combined use of acute ketamine and chronic FLX as a treatment for reducing behaviors indicative of fear memory during re-exposure at 2 weeks, but not behaviors indicative of anxiety and fear generalization in the OFT. FLX alone was most effective in reducing behaviors related to fear generalization. This study contributes to the existing literature on pharmacological treatment for fear and anxiety behaviors relating to fear memory and fear generalization. Continued research is necessary to replicate results, optimize treatment protocols, and investigate the molecular adaptations to trauma and treatment. Significance Statement Up to 6% of people in the United States will develop PTSD within their lifetime, and less than half of those individuals will find relief from their symptoms given the current therapeutic options. This study offers preliminary support for the efficacy of ketamine and FLX in reducing PTSD-like behaviors induced by fear-conditioning in mice. Compared to current standard treatments, results indicate the potential for a more effective therapeutic option for those with stress-related disorders, such as PTSD.
Article
Full-text available
Objective Data sharing promotes the scientific progress. However, not all data can be shared freely due to privacy issues. This work is intended to foster FAIR sharing of sensitive data exemplary in the biomedical domain, via an integrated computational approach for utilizing and enriching individual datasets by scientists without coding experience. Methods We present an in silico pipeline for openly sharing controlled materials by generating synthetic data. Additionally, it addresses the issue of inexperience to computational methods in a non-IT-affine domain by making use of a cyberinfrastructure that runs and enables sharing of computational notebooks without the need of local software installation. The use of a digital twin based on cancer datasets serves as exemplary use case for making biomedical data openly available. Quantitative and qualitative validation of model output as well as a study on user experience are conducted. Results The metadata approach describes generalizable descriptors for computational models, and outlines how to profit from existing data resources for validating computational models. The use of a virtual lab book cooperatively developed using a cloud-based data management and analysis system functions as showcase enabling easy interaction between users. Qualitative testing revealed a necessity for comprehensive guidelines furthering acceptance by various users. Conclusion The introduced framework presents an integrated approach for data generation and interpolating incomplete data, promoting Open Science through reproducibility of results and methods. The system can be expanded from the biomedical to any other domain while future studies integrating an enhanced graphical user interface could increase interdisciplinary applicability.
Article
Full-text available
β3-Adrenoceptors couple not only to cAMP formation but, at least in some cell types, also to alternative signaling pathways such as phosphorylation of extracellular signal-regulated kinase (ERK). β3-Adrenoceptor agonists are used in long-term symptomatic treatment of the overactive bladder syndrome; it is only poorly understood which signaling pathway mediates the clinical response and whether it undergoes agonist-induced desensitization. Therefore, we used human embryonic kidney cells stably transfected with human β3-adrenoceptors to compare coupling of ligands with various degrees of efficacy, including biased agonists, to cAMP formation and ERK phosphorylation, particularly regarding desensitization. Ligands stimulated cAMP formation with a numerical rank order of isoprenaline ≥ L 755,507 ≥ CL 316,243 > solabegron > SR 59,230 > L 748,337. Except for the weakest agonist, L 748,337, pretreatment with any ligand reduced cAMP responses to freshly added isoprenaline or forskolin to a similar extent. On the other hand, we were unable to detect ERK phosphorylation despite testing a wide variation of conditions. We conclude that a minor degree of efficacy for cAMP formation may be sufficient to induced full desensitization of that response. Transfected human embryonic kidney cells are not suitable to study desensitization of ERK phosphorylation by β3-adrenoceptor stimulation.
Article
Full-text available
Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.
Article
Full-text available
This article updates the guidance published in 2015 for authors submitting papers to British Journal of Pharmacology (Curtis et al., 2015) and is intended to provide the rubric for peer review. Thus, it is directed towards authors, reviewers and editors. Explanations for many of the requirements were outlined previously and are not restated here. The new guidelines are intended to replace those published previously. The guidelines have been simplified for ease of understanding by authors, to make it more straightforward for peer reviewers to check compliance and to facilitate the curation of the journal's efforts to improve standards.
Article
Full-text available
It is widely acknowledged that the biomedical literature suffers from a surfeit of false positive results. Part of the reason for this is the persistence of the myth that observation of p < 0.05 is sufficient justification to claim that you have made a discovery. Unfortunately there has been no unanimity about what should be done about this problem. It is hopeless to expect users to change their reliance on p-values unless they are offered an alternative way of judging the reliability of their conclusions. If the alternative method is to have a chance of being adopted widely, it will have to be easy to understand and to calculate. One such proposal is based on calculation of false positive risk. This is likely to be accepted by users because many of them already think, mistakenly, that the false positive risk is what the p- value tells them, and because it is based on the null hypothesis that the true effect size is zero, a form of reasoning with which most users are familiar. It is suggested that p-values and confidence intervals should continue to be given, but that they should be supplemented by a single additional number that conveys the strength of the evidence better than the p-value. This number could be the prior probability that it would be necessary to believe in order to achieve a false positive risk of, say. 0.05 (which is what many users think, mistakenly, is what the p-value achieves). Alternatively, the (minimum) false positive risk could be specified based on the assumption of a prior probability of 0.5 (the largest value that can be assumed in the absence of hard prior data).
Article
Full-text available
Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.
Article
This article considers a practice in scientific communication termed HARKing (Hypothesizing After the Results are Known). HARKing is defined as presenting a post hoc hypothesis (i.e., one based on or informed by one's results) in one's research report as if it were, in fact, an a priori hypotheses. Several forms of HARKing are identified and survey data are presented that suggests that at least some forms of HARKing are widely practiced and widely seen as inappropriate. I identify several reasons why scientists might HARK. Then I discuss several reasons why scientists ought not to HARK. It is conceded that the question of whether HARKing's costs exceed its benefits is a complex one that ought to be addressed through research, open discussion, and debate. To help stimulate such discussion (and for those such as myself who suspect that HARKing's costs do exceed its benefits), I conclude the article with some suggestions for deterring HARKing.