ArticlePDF Available

How Large Are Your G -Values? Try Gosset’s Guinnessometrics When a Little “ p ” Is Not Enough

Taylor & Francis on behalf of the American Statistical Association
The American Statistician
Authors:

Abstract

A crisis of validity has emerged from three related crises of science, that is, the crises of statistical significance and complete randomization, of replication, and of reproducibility. Guinnessometrics takes commonplace assumptions and methods of statistical science and stands them on their head, from little p-values to unstructured Big Data. Guinnessometrics focuses instead on the substantive significance which emerges from a small series of independent and economical yet balanced and repeated experiments. Originally developed and market-tested by William S. Gosset aka “Student” in his job as Head Experimental Brewer at the Guinness Brewery in Dublin, Gosset’s economic and common sense approach to statistical inference and scientific method has been unwisely neglected. In many areas of science and life, the 10 principles of Guinnessometrics or G-values outlined here can help. Other things equal, the larger the G-values, the better the science and judgment. By now a colleague, neighbor, or YouTube junkie has probably shown you one of those wacky psychology experiments in a video involving a gorilla, and testing the limits of human cognition. In one video, a person wearing a gorilla suit suddenly appears on the scene among humans, who are themselves engaged in some ordinary, mundane activity such as passing a basketball. The funny thing is, prankster researchers have discovered, when observers are asked to think about the mundane activity (such as by counting the number of observed passes of a basketball), the unexpected gorilla is frequently unseen (for discussion see Kahneman 2011 Kahneman, D. (2011), Thinking Fast and Slow, New York: Farrar, Straus and Giroux. [Google Scholar]). The gorilla is invisible. People don’t see it.
Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=utas20
The American Statistician
ISSN: 0003-1305 (Print) 1537-2731 (Online) Journal homepage: https://www.tandfonline.com/loi/utas20
How Large Are Your G-Values? Try Gosset’s
Guinnessometrics When a Little “p” Is Not Enough
Stephen T. Ziliak
To cite this article: Stephen T. Ziliak (2019) How Large Are Your G-Values? Try Gosset’s
Guinnessometrics When a Little “p” Is Not Enough, The American Statistician, 73:sup1, 281-290,
DOI: 10.1080/00031305.2018.1514325
To link to this article: https://doi.org/10.1080/00031305.2018.1514325
© 2019 The Authors. Published with license
by Taylor & Francis Group, LLC.
Published online: 20 Mar 2019.
Submit your article to this journal
View Crossmark data
THE AMERICAN STATISTICIAN
2019, VOL. 73, NO. S1, 281–290: Statistical Inference in the 21st Century
https://doi.org/10.1080/00031305.2018.1514325
How Large Are Your G-Values? Try Gosset’s Guinnessometrics When a Little “p”IsNot
Enough
Stephen T. Ziliak
Roosevelt University, College of Arts and Sciences, Department of Economics, Chicago, IL; Newcastle Business School, University of Newcastle, NSW,
Australia
Abstract
A crisis of validity has emerged from three related crises of science, that is, the crises of statistical signicance
and complete randomization, of replication, and of reproducibility. Guinnessometrics takes commonplace
assumptions and methods of statistical science and stands them on their head, from little p-values to
unstructured Big Data. Guinnessometrics focuses instead on the substantive signicance which emerges
from a small series of independent and economical yet balanced and repeated experiments. Originally
developed and market-tested by William S. Gosset aka “Student” in his job as Head Experimental Brewer
at the Guinness Brewery in Dublin, Gosset’s economic and common sense approach to statistical inference
and scientic method has been unwisely neglected. In many areas of science and life, the 10 principles of
Guinnessometrics or G-values outlined here can help. Other things equal, the larger the G-values, the better
the science and judgment.
By now a colleague, neighbor, or YouTube junkie has probably shown you one of those wacky psychology
experiments in a video involving a gorilla, and testing the limits of human cognition. In one video, a person
wearing a gorilla suit suddenly appears on the scene among humans, who are themselves engaged in
some ordinary, mundane activity such as passing a basketball. The funny thing is, prankster researchers
have discovered, when observers are asked to think about the mundane activity (such as by counting the
number of observed passes of a basketball), the unexpected gorilla is frequently unseen (for discussion see
Kahneman 2011). The gorilla is invisible. People don’t see it.
ARTICLE HISTORY
Received May 2018
Revised August 2018
KEYWORDS
Balanced design; Economic
approach to uncertainty;
External validity; Fisher’s p;
Randomization;
Reproducibility; Statistical
signicance; “Student ’st
1. Invisible Giant of Statistics Inside The Archives
(And Out)
I am an economic statistician and historian who has studied
the history of statistical signicance and experimental design
in what is considered by most people to be a foreign and
ancient land—the archives, that is, historical archival libraries—
for nearly 25 years. I’ve logged countless hours in some of
the grandest and oldest reading rooms and their opposites in
England, Ireland, and the United States. On and o since 2008,
my research eorts have been especially focused in Dublin,
Ireland, primarily in the archives of the Guinness Storehouse
Museum (wicked, I know: nice work if you can get it).
Ihavefoundbycomparingthearchiveswithpublishedliter-
ature a similar pattern in the history of statistics, an “unseeing”
of a statistical giant whose methods, when heeded, could help
out much of the economy, from Big Pharm and agri-business to
the World Bank and higher education, more or less instantly.
Statisticians and their clients have narrowed their focus on
the wrong thing, the back and forth of statistical signicance
testing,“pass”or“nopass.Wehavebecomesoroutinizedin
our focus on bright-line signicance levels such as p<0.05 (an
CONTACT Stephen T. Ziliak sziliak@roosevelt.edu Roosevelt University, College of Arts and Sciences, Department of Economics, 430 S. Michigan Ave., Chicago, IL
60605.
example of what the behavioral economist Daniel Kahneman
calls “thinking fast”) that we are oen blinded to larger, perhaps
more important things—such as estimation of the substantive
meaning of our results (requiring “slower”, deliberate thought).
In fact, I argue that this unseen giant of statistics holds the
keysforunlockingourcurrentcrisesinscience:thecrisesof
replication,ofreproducibility, and of statistical signicance;in
sum,closestudyofthisgiantsmethodscanhelptoundothecri-
sis of validity currently haunting science and society, including
business and law (Goodman 2002;ZiliakandMcCloskey2008;
Greenland, et al. 2016; Wasserstein and Lazar, 2016;Pierson,
Broman, et al. 2017).
2. How a Giant Statistician and Brewer Sidestepped
the Crisis of Validity
Whoisthisinvisiblegiantwithaheroscape?“Student”is
the name which graces Student’s t,themostpopulartestof
statistical signicance in the world, and the foundation from
which Fishers p-values were originally derived (Student 1908a,
1925;Fisher1925;Ziliak2008). Most statisticians have heard
© 2019 The Author. Published with license by Taylor & Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which
permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered,transformed, or built upon in any way.
282 S. T. ZILIAK
that “Student” is in reality the pen name of William Sealy Gosset
(1876–1937), a brewer of Guinness beer.1
But even professional statisticians struggle to distinguish
“Students” actual methods from those of Fisher, erroneously
conating the ideas and opinions of the two men (Ziliak
and McCloskey 2008, Chaps. 20–23). The much older and
far more experienced “Student, who had been comparing
random with balanced layouts in agricultural plots since
1905 was not, as some observers seem to think, just another
“Fisher”-replicate donning a brewer’s costume, a minor gure
appearing on the scene to spice up a boring statistics lecture.
Brewing and shipping unpasteurized beer for prot, taste, and
quality assurance on the global scale is serious business—
the economic element is fundamental, and a lot can be lost,
“Student” knew and warned. Conation of Students methods
with the admittedly hackneyed mathematical and antieconomic
re-interpretation of them by Fisher (1925,1933,1935)and
most other textbook authors is both a historical blind spot and
scientic blunder with, as I have shown elsewhere, enormous
practical and human consequences, many of them far from
good.2I call this valuable if neglected approach to statistics,
“Guinnessometrics”.
3. G-Values: The 10 Principles of Guinnessometrics
Guinnessometrics is an experimental philosophy of inference
and decision-making innovated and market-tested between
1904 and 1937 by William S. Gosset, a self-trained statistician
who rose to Head Brewer of Guinness during the decades when
Guinness was the largest brewery in the world (Ziliak 2008). At
Guinness the scientic brewers, including Gosset, were allowed
bythecompanytopublishresearchsolongastheydidnot
mention (1) beer, (2) Guinness, or (3) their own surname.
Ironically the hundred-million gallon a year brewery did not
rely on statistical signicance, Student’s t,randomizedtrials,
or Big Data. Closer to the opposite. Guinnessometrics takes
a repeated small-sample economic approach to experimental
statistics and decisions, in cooperation with agents up and
down supply chains, all with real “skin in the game” (Taleb
2018). Gosset’s Guinnessometrics inverts the usual matrix of
statistical science and emerges with little or no need for a
p-value or placebo-controlled randomized trial. And as any
Guinness drinker can tell you, results of this science are not
only repeatable and replicable, they are reproducible. Crisis
averted.
The crisis of validity in the statistical sciences has been
caused largely, though not entirely, by the following common
yet erroneous practices undermining trust in data-based
decisions:
Attempted falsication of an assumed-to-be true null
hypothesis without a loss function;
1Harold Hotelling (1930, p. 189), a vice president of the American Statistical
Association and a teacher of many leading economists and statisticians,
wrote: “I have heard guesses in this country, identifying‘Student ’with Egon
S. Pearson and the Prince of Wales.
2See also Student 1942;Pearson1990; Ziliak 2014,2010a,2010b;Ziliakand
Tea th er -Pos ad as 2016.
Statistical signicance testing at a bright line level (such as
p<0.05 or t>1.96) independent of the substantive meaning
of the result, coecient, or model;
Running a randomized, placebo controlled trial (RCT)
assuming the independence of observations;
Making decisions based on one, large-scale experiment, ran-
dom or not (“One and done”);
Accepting or rejecting a hypothesis based on a single sample
of convenience and arbitrary size (the majority of social
science, and much of life science);
Promoting what turns out to be irreproducible results;
Investing prematurely in “Big Data”;
Assuming in statistical tests there is “no prior” subject-matter
knowledge; and
Making false binary and merely qualitative assessments
based on the alleged bright line “signicance” or “insignif-
icance” of a result (important/unimportant, yes/no, exists/
does not exist)
Guinnessometrics reverses these misleading yet widespread
practices and replaces them with 10 principles or G-values,
just as Gosset always said. In listicle form, the Ten Principles of
Guinnessometrics are:
G-10 Consider the Purpose of the Inquiry, and Compare with
Best Practice
Falsication of a null hypothesis is not the main purpose
of the experiment or observational study. Making money or
beer or medicine—ideally more and better than the competi-
tion and best practice—is. Estimating the importance of your
coecient relative to results reported by others, is. To repeat,
as the 2016 ASA Statement on Statistical Signicance and P-
values makes clear, merely falsifying a null hypothesis with a
qualitative yes/no, exists/does not exist, signicant/not signif-
icant answer, is not itself signicant science, and should be
eschewed.
G-9 Estimate the Stakes (or Eat Them)
Estimation of magnitudes of eects, and demonstrations
of their substantive meaning, should be the center of most
inquiries. Failure to specify the stakes of a hypothesis is the rst
step toward eating them (gulp).
G-8 StudyCorrelatedData:ABBA,TakeaChanceonMe
Most regression models assume “i.i.d.” error terms—
independently and identically distributed—yet most data
in the social and life sciences are correlated by systematic,
nonrandom eects—and are thus not independent. Gosset
solved the problem of correlated soil plots with the “ABBA
layout, maximizing the correlation of paired dierences between
the AsandBswithaperfectlybalancedchiasmicarrangement
(Ziliak 2014).
G-7 Minimize “Real Error” with the 3 R’s: Represent, Replicate,
Reproduce
A test of signicance on a single set of data is nearly valueless.
Fisher’s p, Students t, and other tests should only be used when
there is actual repetition of the experiment. “One and done” is
scientism, not scientic. Random error is not equal to real error,
and is usually smaller and less important than the sum of non-
random errors. Measurement error, confounding, specication
error, and bias of the auspices, are frequently larger in all the
testing sciences, agronomy to medicine. Guinnessometrics min-
THE AMERICAN STATISTICIAN 283
imizes real error by repeating trials on stratied and balanced yet
independent experimental units, controlling as much as possible
for local xed eects.
G-6 Economize With “Less Is More”: Small Samples of Inde-
pendent Experiments
Small-sample analysis and distribution theory has an eco-
nomic origin and foundation: changing inputs to the beer on
the large scale (for Guinness, enormous global scale) is risky,
with more than money at stake. But smaller samples, as Gosset
showed in decades of barley and hops experimentation, does not
mean “less than”, and Big Data is in any case not the solution for
many problems.
G-5 Keep Your Eyes on the Size Matters/How Much? Question
There will be distractions but the expected loss and/or prot
functions rule, or should. Are regression coecients or dier-
ences between means large or small? Compared to what? How
do you know?
G-4 Visualize
Parameter uncertainty is not the same thing as model uncer-
tainty. Does the result hit you between the eyes? Does the
study show magnitudes of eects across the entire distribu-
tion? Advances in visualization soware continue to outstrip
advances in statistical modeling, making more visualization a
no brainer.
G-3 Consider Posteriors and Priors too (“It pays to go Bayes”)
The sample on hand is rarely the only thing that is “known”.
Subject matter expertise is an important prior input to statistical
design and aects analysis of “posterior” results. For example,
Gosset at Guinness was wise to keep quality assurance metrics
and bottom line prot at the center of his inquiry. How does
prior information t into the story and evidence? Advances in
Bayesian computing soware make it easier and easier to do
a Bayesian analysis, merging prior and posterior information,
values, and knowledge.
G-2 Cooperate Up, Down, and Across (Networks and Value
Chains)
For example, where would brewers be today without the con-
tinued cooperation of farmers? Perhaps back on the farm and
not at the brewery making beer. Statistical science is social, and
cooperation helps. Guinness nanced a large share of modern
statistical theory, and not only by supporting Gosset and other
brewers with academic sabbaticals (Ziliak and McCloskey 2008,
Chp. 22). And last but not least:
G-1 Answer the Brewer’s Original Question (“How Should you
set the odds?”)
No bright-line rule of statistical signicance can answer the
brewer’s question. As Gosset said way back in 1904, how you set
the odds depends on “the importance of the issues at stake” (the
expected benet and cost, for example) together with the cost of
obtaining new material.
No one could plausibly claim that the 10 G-values are the
end-all, be-all of statistical science. Only that for a great vari-
ety of business, medical, and scientic purposes, the Guinnes-
sometric approach to data and decision-making answers far
more questions, and far more satisfactorily, than the conven-
tional, unrepeated observational study or placebo controlled
RCTjudgedbythelevelofap-value (compare Ziliak 2010a,
2010b; Senn 2010).
4. How to Get Large G-Values
Gosset’s routine was to produce as many of the 10 G-values as
possible, and to labor at them at maximum possible strength,
so long as doing so does not subtract too much from one or
more of the other G-values. We briey illustrate below the value
of each G-value. For example, throwing too many resources
into G-value Number 7 by replicating too much (such as by
spending $10 million on a 12th replication of the overrated
“invisible gorilla” experiment made famous by YouTube) would
encroach too negatively on G-value Number 6: the need to
“Economize.” Economizing in Gosset’s sense means studying
a series of small but independent and well-structured samples
(rather than plunking down millions for Big, unstructured Data,
today’s fashion) to learn about regression input Xor dependent
variable Y. It should be said that the validity of G-values is
completely general and does not depend on whether or not one
fancies a beverage by Guinness. Thus:
5. G-10 Consider The Purpose Of The Inquiry, And
Compare With Best Practice
Falsication of a null hypothesis is not, we have said, the “pur-
pose” of a study or experiment helped along by statistical meth-
ods. Saving lives or money or malted barley, or advancing the
health and wealth of schools and nations, is. In his magiste-
rial Theory of Probability Harold Jereys (1961 [1961], p. 377)
wrote in agreement with Gosset that the null test-and-p-value
procedure advocated by the Fisher School “is merely something
to set up like a coconut to stand until it is hit.” “Hence the
hypotheses made by “Student” are completely equivalent to
mine; they have merely been introduced in a dierent order”
(Jereys 1961, p. 380).
Presumably there is prior interest, otherwise, why test? Why
invest in an experiment that is thought in advance to make no
dierence whatsoever? In their hearts, most investigators do
not. But in their statistical science most are in Jereyss sense
dedicated coconut swatters.
Whatever the purpose of the experiment, best practice
research compares a novel treatment or variable with best
practice and/or prevailing wisdom, not with an assumed-to-
be-true null hypothesis or blank placebo. At the largest brewery
in the world measured by annual output and sales, Gosset was
determined, indeed he was incentivized, to mash the beer, not a
low hanging coconut. At Guinness, he could not aord to spend
his days taking whacks at easy to hit coconuts. At stake was
nothing less than 100 million gallons of Guinness stout pro-
duced and sold annually by one of the most recognized brands
in the world, rst established by Arthur Guinness in 1759.
In an important letter of 1905, Gosset told Karl Pearson that
one can “aim at” the odds of attaining some result by way of
repeated experimentation. He told Pearson he decided that one
cannot judge the “signicance” of results—or decide a course
of business action—without, in eect, employing some scale
of human values capable of balancing the utility of expected
gains against the disutility of losses (Gosset 1905). The point is
fundamental:
284 S. T. ZILIAK
When I rst reported on the subject [of “The Application
of the ‘Law of Error’ to the Work of the Brewery” (1904)], I
thoughtthatperhapstheremightbesomedegreeofprobabil-
ity which is conventionally treated as sucient in such work
as ours and I advised that some outside authority [in mathe-
matics, such as Karl Pearson] should be consulted as to what
certainty is required to aim at in large scale work. However it
would appear that in such work as ours the degree of certainty
to be aimed at must depend on the pecuniary advantage to
be gained by following the result of the experiment, compared
with the increased cost of the new method, if any, and the cost
of each experiment.
Gosset upped the odds of farming and brewing success by
designing and repeating small samples of stratied and balanced
experiments, reducing what he called the “real error.” The object
or purpose of an experiment has little or nothing to do with the
“signicance” of a null hypothesis in and of itself, independent
of some scale of values and range of action.
In a seminal Biometrika article “On Testing Varieties of Cere-
als” Gosset wrote (Student 1923):
The object of testing varieties of cereals is to nd out which
will pay the farmer best. This may depend on quality, but
in general it is an increase of yield which is protable, and
since yield is very variable from year to year and from farm
to farm it is a dicult matter upon which to obtain conclusive
evidence.
Citing his own research undertaken in cooperation with
the Irish Department of Agriculture and Technical Instruction
(IDATI), Gosset continued:
Yet it is certain that very considerable improvements in yield
have been made as the result of replacing the native cereals by
improved varieties; as an example of this I may cite the case
of Ireland, where varieties of barley have been introduced
which were shown by experiment [since 1898] to have an
average yield of 15 to 20% above those which they replaced.
This represents, probably a gain to the country of not less
than £ 250,000 per year. As the cost of experiments from
the commencement to the present time [namely, about 25
years] cannot have reached £ 40,000 the money has been well
spent.
InthesamearticleGossettoldtheassembledattheRoyalSta-
tistical Society—including Fisher, Neyman, and Egon Pearson—
that the net gain to Ireland alone aer 25 years of continuous
experimentation using balanced (not RCT) designs was approx-
imately:
Gain: +£ 250,000 per year ×25 years
Cost: - £ 40,000
That is, a ballpark net gain =£ 6.21 million on a £ 40,000
investment
In his last published paper, “Comparison of Random and
Balanced Arrangements of Field Plots,” Gosset said (Student
1938, p. 206):
Ipersonallychoosethemethodwhichismostlikelytobe
protable when designing the experiment rather than use Prof.
Fisher’s system of a posteriori choice* which has always seemed
to me to savour rather too much of “heads I win, tails you lose.
6. G-9 Estimate The Stakes (Or Eat Them)
Thus, the purpose of an experiment puts something substantive
at stake—a belief, an input, a technology, or policy variable. The
purpose of the experiment is not to reject a null hypothesis; the
purpose of a study is usually not even “statistical” in meaning.
Science is substantive. Its meaning lies in magnitudes of bees, or
beer,orbasketball,oflove,orfamily,orfamousmoviestars.Not
in columns and rows of Student’s tor Fisher’s p.
The stakes of a study are normally speaking the substantive
stakes of life—a gamble on a new pill or product or pilot study—
with expected magnitudes of gains and losses expressed in the
form of an expected loss function (Manski 2018,Press2005),
Jereys’s minimum dierence (Jereys 1961), a Ramsey bet
(Ramsey 1926)orGossetgamble(Gosset1905)onascaleof
valuesnotcapturedbytablesofprobabilityalone.AsRamsey
(1926,p.51)wrotein“TruthandProbability:
We all agree that a man who did not make inductions would
be unreasonable: the question is only what this means.
Like Gosset, Ramsey believed that probability “is a measure-
ment of belief qua basis of action.3Quantities of substantive
gain or loss are central to the Guinnessometric approach.
The much-admired Frank Plumpton Ramsey (1926)—who by
chance had attended the same school as Gosset (Winchester
College) and studied mathematics with one of Gosset’s friends
(the mountain-climber, Robert Irving)—observed that the
expected value” of an experiment or a set of observations is not
a reliable guide for action. (Most economists rst learned the
point from Savage, though Savage credits Ramsey.) Expected
valueisanunreliablemetricforgauginghumanbehavior,
though many social and life scientists interpret regression
models as if. As the stakes rise, or as the gamble is rephrased
or otherwise altered rhetorically and thus psychologically, risk
taking behavior changes, Ramsey perceived. And this fact has
ledeconomistsandstatisticianstofavor“expectedutility”over
expectedvalue(instatistics,seeLindley(1991)andPress(2005);
also see Taleb (2018) for the weakness of utility under “ruin
probabilities). The expected value of something is the sum of
all the possible outcomes (in nominal terms) weighted by their
respective probabilities of occurrence.
It is easy to see Ramsey’s point in a comparison of three
dierent gambles subject to one random ip of a fair coin
(illustrated by Frank 2017, pp. 179–180):
Gamble 1: If a coin ip shows “heads”, win $100; if tails, lose
$0.50
Gamble 2: If a coin ip shows “heads”, win $200; if tails, lose
$100
Gamble 3: If a coin ip shows “heads, win $20,000; if tails,
lose $10,000.
NowGamble1wouldbeacceptedbymany.Assumingas
convention does that the probability of heads and tails are equal
(at 1/2 each) the expected value of the gamble is (1/2)(win $100)
+(1/2)(lose $0.50) for an expected win of $49.75. The worst case
scenario is “lose 50 cents”, which many would be willing to risk
for an equal chance at winning $100.
Gamble 2 will draw fewer contestants, though many will still
accept the gamble. Gamble 3 has the highest expected value but
3https://plato.stanford.edu/entries/probability-interpret/
THE AMERICAN STATISTICIAN 285
this bet would be rejected by many (though not all) even though
the payo structure is identical to Gamble 2 (both have a 2-to-
1 win-loss ratio). The reason is that, although winning $20,000
could be lovely, losing $10,000 from a single coin ip could be
tragic.
Suppose now that subjects are expected utility maximizers
instead of expected value maximizers. Utility theory has
problems and limitations, too, but it oers something more.
Expected utility is a scale of values which accounts for, among
other things, attitudes toward risk and the diminishing marginal
utility of money (for example Savage 1954;Lindley1991;Press
2005).Supposeyouareaconventionalrisk-aversepersonwith
a strictly concave utility function, such as the square-root
function, which is concave in money accumulation (EU =
money).
Your initial amount of money (or wealth) is $10,000. What is
the expected utility of each of the three gambles? Notice that the
“value” ranking of the gambles is now reversed.
The expected utility of Gamble 1 is highest even though its
expected value is lowest ((1/2)(10,100) +(1/2)(9999.50)
=100.248 utils) while the expected utility of Gamble 3 is lowest
even though its expected value is by far the highest (=86.60
utils).“Theoldestablishedwayofmeasuringapersonsbelief
is to propose a bet, and see what are the lowest odds which
he will accept. This method I regard as fundamentally sound”
(Ramsey 1926, p. 34). “[B]ut it suers from being insuciently
general, and from being necessarily inexact. It is inexact
partly because of the diminishing marginal utility of money,
partlybecausethepersonmayhaveaspecialeagernessor
reluctance to bet, because he either enjoys or dislikes excitement
or for any other reason, e.g., to make a book” (Ramsey,
pp. 34–35).
Yet from the point of view of a statistical scientist who is
charged with practical estimation and interpretation in a par-
ticular context, Ramsey’s approach is backward way around,
and too abstract. What Gosset’s expected loss function approach
lacks in abstract shine, it gains in economic prot and plain
common sense.
7. G-8 Study Correlated Data: Abba, Take a Chance on
Me
Completely randomized studies, where treatment and control
groups are determined as if by random coin ip, are now fash-
ionable. Yet, statisticians have long known that stratication
or blocking adds precision and eciency to a study otherwise
based on complete randomization. Gosset (Student 1911) used
blocking or stratication long before the synonymous words
existed in the statisticians’ vocabulary. He called his balanced
approach to eld layouts, “ABBA,” the closely arranged mirror
patternofthelayout,treatments,andcontrols,theAsandBs
being compared. Student (1923, p. 273) said:
The art of designing all experiments lies even more in arrang-
ing matters so that ρ[the correlation coecient] is as large as
possible than in reducing σ2
xand σ2
y[the variance].
The peculiar diculties of the problem lie in the fact that the
soil in which the experiments are carried out is nowhere really
uniform; however little it may vary from eye to eye, it is found
to vary not only from acre to acre but from yard to yard, and
even from inch to inch. This variation is anything but random
[Gosset himself noted], so the ordinary formulae for combining
errors of observation which are based on randomness are even
less applicable than usual.
As Deming (1938, p. 879), an admirer of Gosset, noted:
“Stratication is equivalent to blocking in the design of an
experiment.” Box, Hunter, and Hunter (2005, p. 92) explain that
A block is a portion of the experimental material (the two shoes
of one boy, two seeds in the same pot) that is expected to be
more homogenous than the aggregate (the shoes of all the boys,
all the seeds not in the same pot). By conning comparisons
to those within blocks (boys, girls), greater precision is usually
obtained because the dierences associated between the blocks
are eliminated.
Deming, who before turning to manufacturing did a long
stintasresearcherattheU.S.DepartmentofAgriculture,agreed
with Gosset’s nonrandom point: random sampling and random-
ized experiments are at best preliminary steps to scientic study.
Complete randomization has a purpose when the investigator
knows little or nothing at all about strata or when the cost of
being wrong is negligible. Said Deming (p. 879):
The primary aim of stratied sampling is to increase the
amount of information per unit of cost. A further aim may be
to obtain adequate information about certain strata of special
interest. One way to carry out stratication is to rearrange the
samplingunitsintheframesoastoseparatethemintoclasses,or
strata, and then to draw sampling units from each class. The goal
should be to make each stratum as homogeneous as possible,
within limitations of time and cost.4
Likewise in his book, Planning of Experiments,DavidCox
(1958) recommends “completely randomized arrangement
…[only] in experiments in which no reasonable grouping into
blocks suggests itself”—that is, when ignorance prevails, or
priors are at.
Normally speaking, ignorance does not prevail, and real
economic and statistical gains can be found by stratifying.
Deming (1938) and Tippett (1952) simplied Students (1911,
1923) proof that stratication (blocking) can reduce sample size
requirements by 40% or more, holding variance constant.5And
asTippettnoted,“Attheworstassumingtherarecasethat
calculated variance between strata is zero—“sampling in strata
is no better than random sampling, but it is never worse.
8. G-7 Minimize “Real Error” With The 3 R’S:
Represent, Replicate, Reproduce
Recently the ASA has adopted the following widely used deni-
tions of reproducibility and replicability in a set of “Recommen-
dations” designed to assuage the reproducibility “crisis” which
4Deming (1978, p. 879). Deming said he learned the technique from Neyman
(1934). In the seminal article Neyman demonstrates the statistical and eco-
nomic advantages of stratied sampling over random sampling (Neyman
1934, pp. 579-585). Neyman credits the idea of “purposive selection” to
earlier writers, such as Bowley and Gini and Galvani.
5Deming (1978, p. 880-881), Tippett (1958, p. 356). In a Riesling vine-and-
wine experiment, Meyers, Sacks, van Es, and Vanden Heuvel (2011) used
blocking, balancing, and repetition (at n=3 vineyards) to reduce sample
size requirements by up to 60%.
286 S. T. ZILIAK
is currently being observed throughout the sciences (Pierson,
Broman, et al. 2017):
1. Reproducibility:Astudyisreproducibleifyoucantakethe
original data and the computer code used to analyze the data
and reproduce all of the numerical ndings from the study.
This may initially sound like a trivial task but experience
hasshownthatitsnotalwayseasytoachievethisseemingly
minimal standard.
For Gosset and Guinnessometrics, “reproducibility” is the abil-
itytobrewGuinnessstoutorale(Smithwicks)insuchawayas
totasteandtootherwisebehavethesame,pintaerpint,gallon
aer gallon, millions of times over worldwide.
2. Replicability: This is the act of repeating an entire study,
independently of the original investigator without the use of
original data (but generally using the same methods)
In 35 years of barley yield and quality trials, Gosset and Guin-
ness commissioned experiments with Irish barley farmers scat-
tered across the dierent barley growing regions of the country.
On average there were 10 farmers running simultaneously the
same or nearly identical experimental design on new barley
(treatment) and one or more “old” barley. And for insurance
each planted a replicate locally, thus there were approximately
20 replications in the average year.]
These denitions suit well the Guinnessometric practice of
replication and reproducibility. Gosset explained the 3 R’s of
minimum real error in a letter of April 1937 to Egon S. Pearson,
who was his close friend and editor of Biometrika (quoted in
Pearson 1939, pp. 247–248):
Many thanks for yours of [April] 10th; I feel I’m wasting your
timebutaslongasyouaskquestionsyoumustexpecttoget
answers …Now I was talking about Cooperative experiments
and obviously the important thing in such is to have a low real
error, not to have a “signicant” result at a particular station.
The latter seems to me to be nearly valueless in itself. Even when
experiments are carried out only at a single station, if they are
not mere ve nger exercises, they will have to be part of a series
in time …. But in fact experiments at a single station [that is,
tests of statistical signicance on a single set of data] are almost
valueless; you can say “In heavy soils like Rabbitsbury potatoes
cannot utilise potash manures”, but when you are asked “What
are heavy soils like Rabbitsbury?” you have to admit—until you
have tried elsewhere—that what you mean is “At Rabbitsbury
etc.” And that, according to Xmay mean only “In the old cow
eld at Rabbitsbury”. What you really want to nd out is “In what
soil and under what conditions of weather do potatoes utilise the
addition of potash manures?{PRIVATE}
Todothatyoumusttryitoutatarepresentative sample of the
farms of the country and correlate with the characters of the soil
andweather.Itmaybethatyouhaveaneasyproblem,likeour
barleys which come out in much the same order whatever–in
reason–you grow them or like Crowther’s cotton which bene-
ttedveryappreciablyfromnitro-chalkinsevenstationsoutof
eight, but even then whatyoureallywantisalowrealerror.You
want to be able to say not only “We have signicant evidence that
if farmers in general do this they will make money by it”, but also
wehavefounditsoinnineteencasesoutoftwentyandweare
nding out why it doesn’t work in the twentieth”. To do that you
have to be as sure as possible which is the 20th—your real error
must be small (emphasis added).
Representation, the third “R,” we can dene in a number
of dierent ways but something like this: coverage and strat-
ication of all systematic sources of uctuation, whether the
source be a temporal, a spatial, or other exogenous force (such as
weather: rainy or not rainy, or soil quality: loamy or clay). These
denitions work as a rst approximation. It’s not about random
error:thatsthemainpoint.Wearetryingtominimizeand
control for the systematic errors, which are of larger importance
andnumberthanaretheassumed-to-be“randomsampling
errors.” These are the 3 R’s of Guinnessometrics and G-value
No. 7.
The reason for G-value Number 7, the 3 R’s of Minimum Real
Error is simple: the out-of-sample experience of life is dicult to
predict, and for both systematic and random reasons. Unless you
are a late night psychic broadcasting on cable television to well
sedated individuals, you do not hold the crystal ball for ensuring
external validity and economic prot should we decide to “scale
up” from an unbalanced, un-stratied village-level experiment
on eyeglasses or to nets to, let’s say, a whole nation (the lack
doesn’t stop some from donning a purple scarf at The World
Bank: Banerjee and Duo 2011;Glewweetal,2012).
Yet most statistical studies in the social sciences, economics
and psychology included, and many more in health and
medicine are of the “one and done” variety: one RCT (which
assumes independence) is conducted on a single sample; one
approachistakentoregressionmodeling,followedby100tests
of signicance on the data using NHST and p-values on the
single sample on oer. P-hack until you publish the paper (Ziliak
and Teather-Posadas 2016 document the ethical side of this).
This treating of single samples as if they are repeated samples
is old news to older psychologists, many of whom remember
the great Sterling (1959) survey showing that the probability of
replication decreased with the level of statistical signicance (the
lower the pvalue, the less likelihood the study will be replicated).
From 1901 until the start of World War II, Guinness (and
thus Gosset) invested heavily in the 3 Rs of G-values: that is
by repeating annually a small series of independent, representa-
tive, and balanced experiments (what sociologists call repeated,
stratied-random samples). The positive results of the Guinness
and Irish investment are nothing short of astonishing, with one
of the byproducts being that Gosset invented or inspired half of
the toolkit of modern statistics and experimental science.
9. G-6 Economize With “Less is More”: Small Samples
of Independent Experiments
Replication and reproduction does not always require millions
of dollars in grant money, though at the NIH, NSF, and else-
where it can.
Small samples are an economic and scientic choice, not a
mathematicalproblemtobesolvedinabstractmathematical
terms. For example, in a small-sample analysis, a brewer may
wish to know with 10 to 1 or better odds how many samples of
malt extract he needs to mix to be condent that the saccharine
level of the beer stays within 0.5 degrees of the 133 degree
standard he is targeting. The example is “Students”: brewing
THE AMERICAN STATISTICIAN 287
over 100 million gallons of Guinness stout per annum, “Stu-
dent” and Guinness stakeholders needed to know (Ziliak 2008,
p. 206). “Real” errorsin this context include uneven temperature
changes, heterogeneous barley malt, and mismeasurement of
saccharine levels—adding up to more error than is allegedly
described by por t.
Student rejected articial rules about signicance from the
beginning of his inquiries at the Brewery—at least four years
before he published the rst table and small sample test of
signicance (Gosset 1904; Student 1908a). In November, 1904,
Gosset—he would not be known as Student until three years
later—discussed his rst break-through on the economic mean-
ing of statistical signicance, in an internal report titled “The
Application of the ‘Law of Error’ to the Work of the Brewery.
The Apprentice Brewer said:
Results are only valuable when the amount by which they
probably dier from the truth is so small as to be insignicant
for the purposes of the experiment. What the odds should be
depends —
1. On the degree of accuracy which the nature of the experiment
allows, and
2. On the importance of the issues at stake.
Comparing the level of saccharine content in a series of malt
extracts which he and others’ mixed in the Experimental Brew-
ery with that found in malts being used in the Main Brewery,
Gosset brought attention to a positive correlation he found
between “the square root of the number of observations”— that
is, the number of calculated dierences in saccharine content
between Experimental and Main Brewery malts—and the level
of statistical signicance. Other things equal, he said “the greater
the number of observations of which means are taken [the
larger the sample size of extract dierences], the smaller the
[probable or standard] error” of the estimates. “And the curve
which represents their frequency of error,”he showed in a graph
and plot drawing, “becomes taller and narrower.”
PriortoGossettherelationbetweensamplesizeandthe
level of statistical signicance was rarely explored. For example,
while looking at biometric samples with up to thousands of
observations, Karl Pearson declared that a result departing by
more than three standard deviations is “denitely signicant.
The normal tables assumed very large samples. Yet Gosset, self-
trained in statistics, found by experiment that at such large sam-
ples nearly everything is statistically “signicant”— though not,
in Gosset’s terms, economically or scientically “important”.
And, likewise, Gosset foundthat a small number of observations
can be protable, though not statistically signicant in Pearson’s
conventional sense. Regardless, Gosset did not have the luxury
of large samples. One of his earliest experiments employed a
sample size of n=2, which helps to explain why in the original
1908 article Gosset calculated a zstatistic for n=2 (Student
1908a).
His 1904 article is worth exploring a bit further – especially
for the econometrician and real-world rm that wants to earn
more with less. Guinness malt was produced in Gosset’s time
primarily from Irish and English barley stock—Old Irish, Pren-
tice, Plumage Archer, and Spratt Archer were eective varieties.
Malt extract was measured by “degrees saccharine” per barrel of
168 pounds malt weight.
An extract in the neighborhood of 133 degrees saccharine
gave the targeted level of alcohol for Guinness’s beer. A much
higher degree of saccharine would aect the stability and life of
the beer, but it also increases alcohol content—which in turn
increases the excise tax which Guinness owes to the British
government, which—sad to say—ups the price of Dads pint.
If, on the other hand, the alcohol content comes in too low, if
the degree of saccharine is insucient, customers would riot,
or switch to Beamish and Becks. In Gosset’s view, +/0.5
degrees saccharine was a dierence or error in malt extract
which Guinness and its customers could swallow. “It might be
maintained,” he said, “that malt extract “should be [estimated]
within 0.5 of the true result with a probability of 10 to 1.” Gosset
calculated the odds of observing the stipulated accuracy for
small and then large numbers of extracts. He found that:
Odds in favour of smaller error than 0.5 [are with:]
2 observations 4:1
3 “ 7:1
4 “ 12:1
5 “ 19:1
82 “ practically innite
Thus, Gosset concluded, “In order to get the accuracy we
require [that is, 10 to 1 odds with 0.5 accuracy], we must,
therefore, take the mean of [at least] four determinations.” The
Guinness Board cheered. The Apprentice Brewer found an eco-
nomical way to assess the behavior of population parameters,
using very small samples.
Small samples and their analysis originate from a funda-
mental economic cause: scarcity—the economic scarcity and
expense of gaining new information about barley, malt, hops,
and other beer inputs. Thus, one can say in general that Gosset
took an economic approach to the logic of uncertainty, from the
choice of sample size on up.
10. G-5 Keep Your Eyes on The Size Matters/How
Much? Questions
We’re all in search of that “Goldilocks” zone, not too high, not
too low; not too hot, not too cold, et cetera. In 1995, some
cancer epidemiologists made history (discussed by Ziliak and
McCloskey 2008, pp. 184–186). The authors of 10 independent
and randomized clinical trials involving thousands of patients
in treatment and control groups had come to an agreement
on an eect size. Consensus on a mere direction of eect—
up or down, positive or negative—is rare enough in science.
Aer four centuries of public assistance for the poor in the
United States and Western Europe for example, economists, do
not speak with one voice on the direction of eect on labor
supply exerted by tax-nanced income subsidies. Medicine is no
dierent. Disagreement on the direction of eect—let alone the
size of eect—is more rule than exception.
So the Prostate Cancer Trialists’ Collaborative Group was
understandably eager to publicize the agreement. Each of the 10
studies showed that a certain drug “utamide” —for the treat-
ment of prostate cancer—can increase the likelihood of patient
survival by an average of 12% (the 95% condence interval in
the pooled data put an upper bound on utamide-enhanced
survival at about 20% [Rothman, Johnson, and Sugano 1999]).
288 S. T. ZILIAK
Odds of 5 in 100 is not the best news to deliver to a prostate
patient. But if castration followed by death is the next best
alternative, a noninvasive 12-to-20% increase in survival sounds
good.
But in 1998 the results of still another, 11th trial were pub-
lished in the New England Journal of Medicine (Eisenberger et al.
1998, pp. 1036–1042). The authors of the new study found a
similar size eect. But when the two-sided pvalue for their
oddsratiocameinat.14theydismissedtheecaciousdrug,
concluding “no clinically meaningful improvement” (pp. 1036,
1039). Kenneth Rothman, Eric Johnson, and David Sugano
examined the individual and pooled results of the 11 separate
studies, including the study conducted by Eisenberger et al..
Onemightsuspectthat[Eisenbergersetal.s]ndingswereat
odds with the results from the previous ten trials, but that is
not so. From 697 patients randomised to utamide and 685
randomised to placebo, Eisenberger and colleagues found an
OR of 0.87 (95% CI 0.70–1.10), a value nearly identical to that
from the ten previous studies. Eisenberger’s interpretation
that utamide is ineective was based on absence of sta-
tistical signicance. (Rothman, Johnson, and Sugano 1999,
p. 1184)
Rothman and coauthors display the utamide eect graphi-
cally in a manner consistent with a Gosset-Deming-and-Savage
approach to visualization. Does the eect hit you between the
eyes? Does it cause interocular trauma? Rothman and others
pool data from the separate studies and plot the utamide eect
(measured by an odds ratio, or the negative of the survival prob-
ability in a hazard function) together with the p-value function.
With the graphical approach, Rothman and his coauthors are
able to show pictorially how the pvalues vary with increasingly
positive and increasingly negative large eects of utamide on
patientsurvival.Andwhattheyshowissubstantivelysignicant:
Eisenberger’s new data only reinforce the ndings from the
earlier studies that utamide provides a small clinical benet.
Adding the latest data makes the pvalue function narrower,
which is to say that the overall estimate is now more precise, and
points even more clearly to a benet of about 12% in the odds
of surviving for patients receiving utamide.
Rothman and others conclude: “the real lesson” from the
latest study is “that one should eschew statistical signicance
testing and focus on the quantitative measurement of eects.
That sounds right. Statistical signicance is hurting people,
indeed killing them. It is leaving their illnesses and a defective
notion of signicance “unexplained.
Still, a recent correspondent points out that although the rela-
tive risk is high, the measured relative dierence is small, adding
only a month or so extra life. (Other things equal, perhaps even
alittlelifeisbetterthannone.)Ifyouwanttodrawalineinyour
science, in any case, draw a line of minimally important eect
or minimally important magnitude of a regression coecient.
Keep your eyes on the size.
11. G-4 Visualize
Andvisualize.Modeluncertaintyisnotthesamethingas
parameter uncertainty. From farming to pharmaceuticals, we
want to know what the entire distribution looks like from
the point of view of oomph and precision, magnitudes of
relationships, and attendant uncertainty. Not just the point
mean or median, with a superscript of asterisks declaring
“signicant” or “highly” so. Remember Stephen Jay Gould’s far-
above-the-median experience with surviving stomach cancer,
discussed in his essay “The Median Isn’t the Message” (Gould
1985). Gould’s doctor cited a median survival time from
diagnosisofabout8months;buttheprolicscholarandwriter
looked at the graphs showing estimates. He noticed a thin but
long right-hand tail, changed his lifestyle and eating habits,
andlivedandworkedforanother22years(rathermorethan8
months)!
An illuminating study by Soyer and Hogarth (2011)tested
the predictive ability of more than 200 econometricians using
linear models. Prediction was most accurate when the experts
were only given a theoretical regression line and scatter plot of
data. Take away the plots and their ability to relate model error
to levels of the dependent variable fell dramatically. For novice
andseasonedalike,theseveralbooksbyTueontheartand
science of visualization are invaluable.
The variance can vary, error bounds vary, degree and direc-
tion of uncertainty vary. Pictures help us to see by how much.
LikeGosset,weshouldpaymoreattentiontotheactualdis-
tribution of our data, not only to rejections of the normal or
exponential distribution (which few deign to discover) but also
simulations or other conrmations of the actual distribution.
12. G-3 Consider Posteriors And Priors Too (“It Pays
To Go Ba yes” )
Beexplicitabouthowpriorinformationisorisnotincorpo-
rated. Perhaps the most commonly used alternative to classical
tand pis the Bayes factor (Carlin and Louis 2008;Press2003).
Gosset, I have mentioned, was a Bayesian who for reasons of
eciency, and in a world without a supercomputer, resorted
frequently to frequentist methods (especially power: Student
1938). For discrete data and simple hypotheses, the Bayes factor
represents the ratio between the probability assigned to the data
under an alternative hypothesis and the null hypothesis (John-
son 2013). One big advantage of Bayesian analysis is that one
can compute the probability of a hypothesis, given the evidence,
whereas with the null hypothesis test of signicance, measured
by a pvalue, one can only speak to the probability of seeing data
more extreme than have actually obtained, assuming the null
hypothesis of “no dierence” (or whatever) to be true. As the
Bayesian Jereys noted (1961, p. 409):
Whether statisticians like it or not, their results are used
to decide between hypotheses, and it is elementary that if p
entails q,qdoes not necessarily entail p.Wecannotgetfrom
“the data are unlikely given the hypothesis” to”the hypothesis is
unlikely given the data” without some additional rule of thought.
Those that reject inverse probability have to replace it by some
circumlocution, which leaves it to the student to spot where the
change of data has been slipped in [, in] the hope that it will not
be noticed.
Jereys went on to explain that if one assigns prior odds
between the alternative and null hypotheses, multiplication
THE AMERICAN STATISTICIAN 289
of the Bayes factor by these prior odds yields the posterior
odds between the hypotheses. From the posterior odds between
hypotheses, scientists can compute the posterior probability that
anullhypothesisistrue(orinanycaseusefulorpersuasive)
relative to an explicit alternative. Classical tests of signicance,
measured by tand p,cannot.
6Johnson (2013)observesthat
in certain hypothesis tests the alternative hypothesis can be
specied so that an equivalence between Bayes factors and
p-values can be established. Technically speaking, Johnson
and others have shown, in one parameter exponential family
models in which a point null hypothesis has been specied
on the model parameter, specifying the size of the test is
equivalent to specifying a rejection threshold for the Bayes
factor, provided that it is further assumed that the alternative
hypothesis is specied so as to maximize the power of the test.
The correspondence between Bayes factors and p-values in this
setting is just one example of the false demarcation line between
objective and subjective.
When an alternative hypothesis exists—and that’s the usual
situation of science: otherwise, why test?—Bayes factors can
be easily reported. Bayes factors permit individual scientists
and consumers to use prior information or the principle of
insucient reason together with new evidence to compute the
posterior probability that a given hypothesis, H,istrue(orto
repeat, useful or persuasive) based on the prior probability that
they assign to each hypothesis. Aer all—fortunately—we do
not have to begin every new observation or experiment from
tabula rasa;weknowsomestu,butwewanttoknowmorestu,
however imperfectly. Bayes factors add that information into the
calculation comparing the likelihood of alternative hypotheses.
For example, Bayes factors provide a clear interpretation of the
evidence contained in the data in favor of or against the null: a
Bayesfactorof10simplymeansthatthedatawere10timesmore
likely under the alternative hypothesis than they were under the
null hypothesis. Better than mushy p’s.
13. G-2 Cooperate Up, Down, And Across (Networks
And Value Chains)
Cooperation was a hallmark of Irish agricultural development
(economic historians will think of Horace Plunkett, John Ben-
nett, William Gosset and others), the growth of the Guinness
brewery, and—in one very large positive externality—the devel-
opment of modern statistics. Cooperation includes the human
side, such as the academic Karl Pearson inviting the unknown
brewer, Mr. Gosset, to London for sabbatical in 1906–1907 to
work and study at University College London Biometrics Lab.
Despite a strong-man reputation which Guinness has held in
some quarters, for trying to monopolize, the legendary brew-
ery cooperated up and down supply chains, and sometimes
to a remarkable extent. The 40-something years of repeated
experiments on barley, already mentioned, is just one exam-
ple: Guinness subsidized one-half (50%) of barley and malt
investment by the Irish Department of Agriculture. There is
something to be learned from that: remember the yield and
net prot gures shared by Gosset above. Guinness sending
6Lavine and Schervich (1999) caution that Bayes factors can sometimes lead
to incoherence in the technical statistical sense of that term.
scientic brewers (Gosset was hardly the only one) on sab-
batical for postgraduate studies in statistics and chemistry and
genetics is another example. Sharing seeds and technical know-
how with the Carlsberg Brewery, in Denmark, is another (and
Carlsberg reciprocated generously by oering a stellar brewing
barleycalled“Prentice”forGuinnesstotry:itturnedoutto
be the mother of the greatest barley in early and midcentury
Europe).Onecouldwriteabookontheimportanceofcoop-
eration to the history and development of science and statistics.
And nally:
14. G-1 Answer The Brewer’s Original Question (“How
Should You Set The Odds?”)
Finally, how should you set the odds? As a 28-year-old brewer
and self-trained statistician told Karl Pearson way back in 1905:
“the degree of certainty to be aimed at must depend on the
pecuniary advantage to be gained by following the result of the
experiment, compared with the increased cost of the new method,
if any, and the cost of each experiment.Add to Gosset’s pecu-
niary advantage “lives saved,” “jobs gained,” “racism abolished,
“health crisis averted,” and the like and we have the makings of
a general approach.
Acknowledgment
Sincere thanks to the editors and reviewers, together with Roosevelt col-
leagues Gar y Langer and Justin Shea, for commenting on a pre vious version
of this paper. At the Guinness Storehouse Museum (Diageo) I have enjoyed
for many years the benet of assistance from archivist Eibhlin Colgan. The
seeds of the present paper weresown during a 2017 sabbatical visit at Trinity
CollegeDublin(TRISSResearchInstitute)andtheUniversityofOxford,
NewCollege.ManythankstoRonanLyons(TrinityCollege)andtoWarden
Miles Young (New College) for making those visits possible. Any errors are
my own.
References
Banerjee, A., and Duo, E.. (2011), Poor Economics, New York: Public
Aairs. [286]
Carlin, B., and Louis, T.. (2008), BayesandEmpiricalBayesMethodsfor
Empirical Analysi s (3rd rev. ed.), Lond on: Chapman and Hall/CRC Press.
[288]
De Finetti, B. (1971 [1976]), Comments on Savage’s “On Rereading R. A.
Fisher,Annals of Statistics, 4, 486–487.
Deming, W. E. (1938 [1943]), Statistical Adjustment of Data,NewYork:
Dover. [285]
(1961), Sample Design in Business Research, New York: Wiley.
(1982), Out of the Crisis, Cambridge, MA: MIT Center for
Advanced Engineering Study.
Eisenberger, M. A., et al. (1998), “Bilateral Orchiectomy with or With-
out Flutamide for Metastatic Protate Cancer,New England Journal of
Medicine, 339, 1036–1042. [288]
Fisher, R. A. (1925 [1928]), Statistical Methods for Research Workers,Edin-
burgh, UK: Oliver and Boyd. [281,282]
(1933), “The Contributions of Rothamsted to the Development
of the Science of Statistics,Annual Report of Rothamsted Experimental
Station, 43–50. [282]
(1935), The Design of Experiments,Edinburgh,UK:Oliver&Boyd.
[282]
Frank, R. (2017), Microeconomics and Behavior (9th ed.), New York:
McGraw-Hill. [284]
290 S. T. ZILIAK
Glewwe, P., Park, A., and Zhao, M. (2012), “Visualizing Development: Eye-
glasses and Academic Performance in Rural Primary Schools in China,
Working Paper WP12-2, Center for International Food and Agricultural
Polic y, University of Minnesota. [286]
Goodman, S. (2017), “Why is Eliminating P-Values So Hard? Reections
on Science and Statistics,ASA Symposium on Statistical Inference,Oct.
11–13, 2007.
(2002), “A Comment on Replication, P Values, and Evidence,
Statistics in Medicine, 11, 875–879. [281]
Gosset, W. S. (1904), “The Application of the ‘Law of Error’ to the Work of
the Brewery,Laboratory Report,8,ArthurGuinness&Son,Ltd.,Diageo,
Guinness Archives, 3–16 and unnumbered appendix. [287]
(1905), Letter from W.S. Gosset to K. Pearson, Guinness Archives,
GDB/BRO/1102 [283,284]
(1936), “Co-Operation in Large-Scale Experiments,Supplement to
the Journal of the Royal Statistical Society, 3, 115–136.
(1962), Letters of William Sealy Gosset to R.A. Fisher. Vols. 1–5,
EckhartLibrary,UniversityofChicago.Privatecirculation.
Greenland, S., et al. (2016), “Statistical Tests, P-values, Condence Intervals,
and Power: A Guide to Misinterpretations,The American Statistician,
70, 1–12. [281]
Hotelling, H. (1930), “British Statistics and Statisticians Today,Journal of
the American Statistical Association, 25, 186–190. [282]
Jereys, H. (1939 [1961]), Theory of Probability (3rd ed.), London: Oxford
University Press. [283,284,288]
Johnson, V. (2013), “Revised Standards for Statistical Evidence,PNAS, 110,
19313–19317. [288,289]
Kahneman, D. (2011), Thinking Fast and Slow, New York: Farrar, Straus and
Giroux. [281]
Lavine, M., and Schervish, M. (1999), “Bayes Factors: What They Are and
What They Are Not,” The American Statistician, 53, 119–122 [289]
Lew, M. (2012), “Bad Statistical Practice in Pharmacology (And Other Basic
Biomedical Disciplines): You Probably Don’t Know P,” British Journal of
Pharmacology, 166, 1559–1567.
Lindley, D. (1991), Making Decisions,NewYork:Wiley. [284,285]
Manski, C. (2018), “Treatment Choice with Trial Data: Statistical Decision
Theory should Supplant Hypothesis Testing,The American Statistician,
this issue, DOI: 10.1080/00031305.2018.1513377.[284]
McCloskey, D. N., and Ziliak, S. T.. (1996), “The Standard Error of Regres-
sions,Journal of Economic Literature, 34, pp. 97–114.
(2010), Brief of Amici Curiae Statistics Experts Professors Deirdre
N.McCloskeyandStephenT.ZiliakinSupportofRespondents,Matrixx
Initiatives Inc. et al. v. Siracusano et al. (vol. No. 09–1156, pp. 22).
Washington, DC: Supreme Court of the United States. (Ed.) Edward
Labaton et al. Counsel of Record.
McCloskey, D. N., and Ziliak, S. T. (2009), ”Signifying Nothing: Reply to
Hoover and Siegler,Journal of Economic Methodology, 15, 39–55
Meyers, J., Sacks, G., van Es, H., and Vanden Heuvel, J. (2011) “Improving
Vineyard Sampling Eciency via Dynamic Spatially Explicit Optimi-
sation,Australian Journal of Grape and Wine Research, 17, 306–315.
[285]
Neyman, J. (1934), “On the Two Dierent Aspects of the Representative
Method: The Method of Stratied Sampling and the Method of Purpo-
sive Selection,JournaloftheRoyalStatisticalSociety, 97, 558–625. [285]
Pearson, E. S. (1939), “’Student’ as Statistician,Biometrika, 30, 210–250.
[286]
(1990), Student: A Statistical Biography of William Sealy Gosset,Eds.
R.L.PlackettandG.A.Barnard.Oxford:ClarendonPress.[282]
Pierson, S., et al. (2017), “Recommendations to Funding Agencies for Sup-
porting Reproducible Research,American Statistical Association,avail-
able at https://www.amstat.org/ASA/News/ASA-Develops-Reproducible-
Research-Recommendations.aspx [281,286]
Press, S. J. (2003), Subjective and Objective Bayesian Statistics,NewYork:
Wile y. [288]
Press, S. J. (1972 [2005]), Applied Multivariate Analysis: Using Bayesian
and Frequentist Methods of Inference (2nd ed.), Mineola, NY: Dover.
[284,285]
Ramsey, F. P. (1926),“Truth and Probability,” reprinted in H. Kyburg and H.
Smoker, eds., Studies in Subjective Probability (NewYork:R.E.Krieger,
1980), 25–52. [284,285]
Rothman, K. J., Johnson, E. S., and Sugano, D. S.. (1999), “Is Flutamide
Eective in Patients with Bilateral Orchiectomy?” Lancet, 353, 1184.
[287,288]
Savage, L. (1954), The Foundations of Statistics,NewYork:Dover. [285]
(1971 [1976]), “On Re-Reading R. A. Fisher, Annals of Statistics,4,
441–500.
Soyer, E., and Hogarth, R. (2011), “The Illusion of Predictability: How
Regression Statistics Mislead Experts,International Journal of Forecast-
ing, 28, 695–711. [288]
Student (1907), “On the Error of Counting with a Haemacytometer,
Biometrika, 5, 351–360.
(1908a), “The Probable Error of a Mean,Biometrika, VI, 1–24.
[281,287]
(1908b), “The Probable Error of a Correlation Coecient,
Biometrika, 2/3, 300–310.
(1923), “On Testing Varieties of Cereals,Biometrika, 15, 271–293.
[284,285]
(1925), “New Tables for Testing the Signicance of Observations,
Metron, V, 105–108. [281]
(1938), “Comparison between Balanced and Random Arrange-
ments of Field Plots,Biometrika, 29, 363–378. [284,288]
(1942), Student’s Collected Papers,eds.Pearson,E.S.andWishart,
J., London: Biometrika Oce. [282]
Supreme Court of the United States (2011), “Matrixx Initiatives, Inc., et al.,
No. 09–1156, Petitioner v. James Siracusano et al.,On Writ of Certiorari
to the United States Court of Appeals for the Ninth Circuit, March 22nd,
25 pp., syllabus.
Taleb, N. N. (2018), Skin in the Game: Hidden Asymmetries in Daily Life,
New York: Random House. [282,284]
The Guinness Archives (Diageo), Guinness Storehouse Museum,
Dublin; Special Collections Library, University College London;
Cork County (Ireland) Archives; Museum of English Rural Life,
National Library of Ireland; University of Oxford, Bodleian Library
and New College Library; Winchester College (UK) Archives; and
University of Chicago, Crerar Library, Eckhart Library, and Regenstein
Library.
Wasserstein, R., and Lazar, N., eds. (2016), “ASA Statement on Statisti-
cal Signicance and P-Values”, The American Statistician, 70, 129–132.
[281]
Ziliak, S. (2008), “Guinnessometrics: The Economic Foundation of ‘Stu-
dent’s’ t,” Journal of Economic Perspectives, 22, 199–216. [281,282,287]
(2010a), “The Vali dus M edi c us andaNewGoldStandard,The
Lancet, 376, 324–325. [282,283]
(2010b), “Signicant Errors—Reply to Stephen Senn,The Lancet,
376, 1391. [282,283]
(2014), “Balanced versus Randomized Field Experiments in Eco-
nomics: Why W.S. Gosset Matters,Review of Behavioral Economics,1,
167–208. [282]
(2016), “The Signicance of the ASA Statement on Statistical
Signicance and P-Values,” The American Statistician, 70, 1–2.
Ziliak, S., and McCloskey, D. (2008), The Cult of Statistical Signicance:
How the Standard Error Costs Us Jobs, Justice, and Lives,AnnArbor,MI:
University of Michigan Press. [281,282,283,287]
Ziliak, S., and Teather-Posadas, E.. (2016), “The Unprincipled Random-
ization Principle in Economics and Medicine,” in Oxford Handbook of
Professional Economic Ethics,eds.G.DeMartinoandD.McCloskey,
Oxford: Oxford University Press, 423–452. [282,286]
Archival Sources
... However, significance testing is not the only valid method for separating signal (in this case the effect of restoration) from noise, and in fact in some cases it may be inappropriate (Wasserstein, Schirm and Lazar, 2019). There is widespread agreement amongst statisticians that one should not conclude that an effect does not exist only because it was not statistically significant, but instead results should be contextualised in the range of existing knowledge, and in doing so the level of effect that is scientifically or practically meaningful can be determined (Pogrow, 2019;Ziliak, 2019). Following this philosophy, a qualitative assessment of trends was able to reveal useful information about the effect of wood restoration on transport and storage. ...
Thesis
Full-text available
Poor water quality is a ‘wicked problem’ – an uncertain and complex problem with no optimal solution - which poses risks to planetary and public health. Nature-based solutions (NBS) are vital to address water quality challenges on a sufficient spatial and temporal scale to realise long-term water security. The restoration of instream wood has been recognised as a particularly promising NBS to nutrient pollution, one of the most pervasive water quality challenges. Research was conducted to address knowledge gaps about the coupled hydrological and biogeochemical processes that control nutrient removal in the river corridor, and to evaluate the efficacy of instream wood restoration in different environmental settings. A laboratory experiment evaluated protocols of the resazurin-resorufin smart tracer system, which can be used to measure coupled hydrological and biogeochemical processes, showing that concentrations can change by up to 22.5% in 24 hours but in certain conditions samples can be stored for up to 14 days, increasing the geographical and experimental scope in which it can be applied. A microcosm experiment showed that streambed wood can lead to significant increases in microbial metabolic activity, nitrate removal rate and greenhouse gas production. This demonstrates the often-neglected contribution of streambed wood to fundamental biogeochemical processes and the impacts on associated ecosystem (dis)services, with consequences for global models of carbon and nitrogen cycles, and for restoration practice. A before-after-control-impact field experiment, using conservative tracer methods coupled with a transport and storage model analysis, investigated the effects of installing instream wood in a lowland sandy stream on transient storage. The results suggest that in a lowland stream wood restoration could decrease transient storage, contrary to what has been observed in upland settings. For the first time, insights from hyporheic zone research are distilled, synthesised, and presented in a framework which is suitable to directly inform river restoration design.
... Gosset was a pioneer of modern statistics in small sample experimental design and analysis. As a beer brewer at Guinness, he developed practical approaches to experimentation to compare barley varieties and beer brewing practices [16]. ...
Article
Full-text available
To derive insights from data, researchers working on agricultural experiments need appropriate data management and analysis tools. To ensure that workflows are reproducible and can be applied on a routine basis, programmatic tools are needed. Such tools are increasingly necessary for rank-based data, a type of data that is generated in on-farm experimentation and data synthesis exercises, among others. To address this need, we developed the R package gosset, which provides functionality for rank-based data and models. The gosset package facilitates data preparation, modeling and results presentation stages. It introduces novel functions not available in existing R packages for analyzing ranking data. This paper demonstrates the package functionality using the case study of a decentralized on-farm trial of common bean (Phaseolus vulgaris L.) varieties in Nicaragua.
... The sample size (N = 19) could be viewed as a further limitation, however, this was exploratory research testing the efficacy of a new informant interview model (RWITS-US) through the use of a novel experimental role-play paradigm during the COVID-19 pandemic (which brought about enforced new ways of working). In particular, our study was designed to inform practitioner application and when conducting research of this nature, it is often preferrable to utilise a smaller sample whilst prioritising ecological validity and replicability (see Ziliak, 2019). Ultimately, caution must be employed when drawing overall conclusions from a single study of this nature. ...
Article
The covert use of civilian informants leaves law enforcement agencies open to accusations of unethical conduct. The use of a structured interview protocol is a recognised method of promoting ethical interactions between police and public citizens, however, there is no known interview model specifically designed to meet informant handler objectives. The current study adopts a holistic view of the interaction between ‘informant’ and ‘handler’ to develop a bespoke informant interview model (RWITS-US: Review and Research, Welfare, Information, Tasking, Security, Understanding Context, Sharing). This model is compared to the PEACE model of interviewing as part of a novel experimental paradigm using mock-informants (N = 19), measuring levels of motivation, rapport, cooperation and intelligence gain. Results indicate that the RWITS-US model generated significantly greater levels of self-reported rapport without having any detrimental effect on the other measured variables. Whilst the results are encouraging, we suggest that the RWITS-US model should be tested in handler training environments before being recommended for widespread use in the field.
... Gosset was a pioneer of modern statistics in small sample experimental design and analysis. As a beer brewer at Guinness, he developed practical approaches to experimentation to compare barley varieties and beer brewing practices [16]. ...
... 47 For this reason, the interpretation of a result must be focused mainly on the effect size's confidence interval and statistical power, which can be improved (confidence interval narrowed and power increased) with appropriate sample size. 46,48 The single P-value is not the probability of whether a difference exists, and according to the American Statistics Association, it can lead to erroneous beliefs and poor decisionmaking. 44 A conclusive result was obtained with the ormocer experimental group (power of 81.6%, and small effect size Cohen's H of 0.413), favoring the control group. ...
Article
Full-text available
Purpose: To evaluate the clinical longevity of bulk-fill resins and ormocer composites compared to conventional nanofill and nanohybrid resins in posterior permanent teeth. Methods: PubMed, Web of Science, Scopus, Science Direct, Cochrane Library, and Scielo were electronically searched for randomized clinical trials, without language restrictions. The extracted data were analyzed using Review Manager, comparing the clinical behavior of bulk fill or ormocer restorations with nanofill or nanohybrid resins. Statistical analysis was performed with a significance level of 5% for all analyses (P= 0.05). The risk of bias was assessed using the Cochrane assessment tool. Results: 11 randomized clinical trials were included, with an average follow-up time of 40.36 months. A total of 812 restorations were evaluated and 58 failures were analyzed: 18 of the 253 bulk-fill restorations (7.11%), 21 of the 173 (12.3%) ormocer restorations, and 20 of the 386 (5.18%) control group (nanofill or nanohybrid composites) restorations failed. In the meta-analysis, there was no significant difference between the bulk-fill and the control group (statistical power = 24.38%; P= 0.206; IC = 95%); whereas, when comparing between ormocer and control group, the control group exhibited better performance (statistical power = 81.62%; P= 0.0042; IC = 95%). Clinical significance: Conventional nanofill and nanohybrid resins exhibited better clinical longevity than ormocer composites in posterior restorations, but when compared to bulk fill, they had similar performance.
... But we stuck together and to the evidence and truth, emphasizing economic and substantive significance and urging abandonment of statistical significance. Our view is starting to prevail, including in the Supreme Court of the United States (McCloskey and Ziliak 2010) and at the American Statistical Association (Ziliak 2019;2016). We are proud of many things, The Cult of Statistical Significance (Ziliak and McCloskey 2008) and "The Standard Error of Regressions" (McCloskey and Ziliak 1996) especiallyan article which has been imitated and applied by other researchers to analyze use of statistical significance in every field of science and business and governance, from agriculture and medicine to accounting and law. ...
Article
Full-text available
Chapter
This opening chapter lays the groundwork for a comprehensive exploration of crowd behavior, addressing the rationale for its continued study despite extensive existing literature. It suggests that crowds are intricate, non-linear amalgamations of individuals, often yielding surprising outcomes despite their apparent simplicity. Central to this discussion is the concept of “learned ignorance,” a mode of inquiry that eschews absolute truth. Emphasis is placed on uncovering foundational principles associated with crowds, primarily focusing on identifying opportunities for the individual to make contrarian decisions against prevailing crowd sentiment. The chapter culminates in an overview of the book's structure, inviting the reader to embark on a journey of contemplation and introspection amidst the turbulent sea of collective existence.
Technical Report
Full-text available
لا أتصور أن مفهوم الدلالة الإحصائية (أو المعنوية الإحصائية) غريب أو جديد على كل مشتغل بالإحصاء أو الدراسات القياسية أو التقنيات الكمية عموما. في هذا العمل المترجم للتقرير الأصلي "Moving to a World Beyond p <0.05 "، والذي هو ملخص لـ 43 مقال علمي لرواد الإحصاء الرياضي والتطبيقي في العالم حاليا، والذي كان موضوع ملتقى علمي منظم من طرف "الجمعية الأمريكية للإحصاء" (American Statistical Association) ، سيجد القارئ العادي والمختص على حد سواء نفسه أمام أفكار وصراع علمي وحركة إصلاح جذرية في حقل الاستدلال الإحصائي. ضمن هذا الإطار، رأيت انه من الواجب علي أن أقدم ترجمة (مع تنقيح وتوضيح للأفكار التقنية) عربية لهذا التقرير المهم. التقرير المترجم سيكون لبنة أساسية للتفكير والنقاش حول واقع الممارسة الإحصائية في الدول العربية عامة، سواء على المستوى الأكاديمي والرسمي. الفكرة الأساسية للتقرير ومقالاته، تنصب حول إعادة بناء وتصويب لأخطاء المتراكمة والاستعمال غير السليم للاستدلال الإحصائي خاصة في الجانب التطبيقي العملي للإحصاء. عملنا كان أساسا منصب على ترجمة التقريرالى العربية مع مراعاة المفاهيم التقنية، وكذا شرح لبعض المصطلحات والأفكار الواردة في نص التقرير. العمل كان متعب وطويل النفس، ولا أتصور واعتقد فيه الكمال (خاصة في جانب الترجمة اللغوية). للتذكير، التقرير مجملا، وكل المقالات 43 متاحة للتحميل مجانا، وأنا انصح القارئ والمختص ان يطلع عليها كلها.
Chapter
In this paper Muhumuza, Asaph KeikaraLundengård, KarlMalyarenko, AnatoliySilvestrov, SergeiMango, John MageroKakuba, Godwinwe demonstrate the extreme points of the Wishart joint eigenvalue probability distributions in higher dimension based on the boundary points of the symmetric cones in Jordan algebras. The extreme points of the Vandermonde determinant are defined to be a set of boundary points of the symmetric cones that occur in both the discrete and continuous part of the Gindikin set. The symmetric cones form a basis for the construction of the degenerate and non-degenerate Wishart ensembles in Herm(m,C), Herm(m,H), Herm(3,O) denotes respectively the Jordan algebra of all Hermitian matrices of size m×m with complex entries, the skew field H of quaternions, and the algebra O of octonions.
Article
JEDI stands for justice, equity, diversity, and inclusion. JEDI is a global movement, with networks connecting academic, business, and grass roots organizations. A definition of “JEDI statistics” and “impermissible inequality” is proposed and illustrated with stories from government work, university teaching, and academic research regarding race, ethics, and social justice in statistics. I recently had the pleasure of discussing these ideas on a panel with Wendy Martinez, Safiya Umoja Noble, Donna LaLonde and participants in a plenary session of SDSS 2021, “Equitable and Inclusive Data and Technology.” I thank them for their comments, and Wendy Martinez, notably.¹¹ https://ww2.amstat.org/meetings/sdss/2021/onlineprogram/AbstractDetails.cfm?AbstractID=309823 There are in front of us unlimited possibilities for good by exploring the Venn diagram-overlaps of JEDI philosophy and statistics, JEDI and economic statistics, JEDI and department culture; JEDI medicine, JEDI coding, JEDI wealth and ownership, JEDI history and the historians of statistics, and so forth, striding toward our future for an antiracist and inclusive statistics and society.²² To explain a little more, in 1996 I earned a PhD Certificate in the Rhetoric of the Human Sciences at the same time I completed the PhD in Economics. I teach Rhetoric and Moral Philosophy, too, and since 2007 I have taught annually a course on “Theories of Justice in Economics and Philosophy” to PhD, MA, and BA students at Roosevelt University and in short courses at several universities in Europe.
Article
Full-text available
A central objective of empirical research on treatment response is to inform treatment choice. Unfortunately, researchers commonly use concepts of statistical inference whose foundations are distant from the problem of treatment choice. It has been particularly common to use hypothesis tests to compare treatments. Wald’s development of statistical decision theory provides a coherent frequentist framework for use of sample data on treatment response to make treatment decisions. A body of recent research applies statistical decision theory to characterize uniformly satisfactory treatment choices, in the sense of maximum loss relative to optimal decisions (also known as maximum regret). This article describes the basic ideas and findings, which provide an appealing practical alternative to use of hypothesis tests. For simplicity, the article focuses on medical treatment with evidence from classical randomized clinical trials. The ideas apply generally, encompassing use of observational data and treatment choice in nonmedical contexts.
Article
Full-text available
Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so-and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting.
Chapter
This entry provides an overview of the United States Supreme Court. Specifically, it discusses the creation and jurisdiction of the Court, the composition and procedures of the Court, and a history of case law that the Court has established.
Chapter
This chapter is a reprint of Frank P. Ramsey’s seminal paper “Truth and Probability” written in 1926 and first published posthumous in the 1931 The Foundations of Mathematics and other Logical Essays, ed. R.B. Braithwaite, London: Routledge & Kegan Paul Ltd. The paper lays the foundations for the modern theory of subjective probability. Ramsey argues that degrees of beliefs may be measured by the acceptability of odds on bets, and provides a set of decision theoretic axioms, which jointly imply the laws of probability.
Article
Pooling data from multiple sources plays an increasingly vital role in today’s world. By using a popular Sudoku game, we propose a new type of design, called a Samurai Sudoku-based space-filling design to address this issue. Such a design is an orthogonal array-based Latin hypercube design with the following attractive properties: (i) the complete design achieves uniformity in both univariate and bivariate margins; (ii) it can be divided into groups of subdesigns with overlaps such that each subdesign achieves uniformity in both univariate and bivariate margins; and (iii) each of the overlaps achieves uniformity in both univariate and bivariate margins. Examples are given to illustrate the properties of the proposed design, and to demonstrate the advantages of using the proposed design for pooling data from multiple sources. [Received August 2013. Revised July 2015.]