Available via license: CC BY 4.0
Content may be subject to copyright.
Vol.:(0123456789)
Scientometrics
https://doi.org/10.1007/s11192-024-05104-1
Opium inscience andsociety: numbers andother
quantifications
LutzBornmann1 · JulianN.Marewski2
Received: 11 October 2023 / Accepted: 1 July 2024
© The Author(s) 2024
Abstract
In science and beyond, quantifications are omnipresent when it comes to justifying
judgments. Which scientific author, hiring committee-member, or advisory board
panelist has not been confronted with page-long publication manuals, assessment reports,
evaluation guidelines, calling for p-values, citation rates, h-indices, or other numbers to
judge about the ‘quality’ of findings, applicants, or institutions? Yet, many of those
of us relying on and calling for quantifications may not understand what information
numbers can convey, and what not. Focusing on the uninformed usage of bibliometrics
as worrisome outgrowth of the increasing quantification of science, in this opinion
essay we place the abuse of quantifications into historical contexts and trends. These
are characterized by mistrust in human intuitive judgment, obsessions with control and
accountability, and a bureaucratization of science. We call for bringing common sense
back into scientific (bibliometric-based) judgment exercises. Despite all number crunching,
many judgments—be it about empirical findings or research institutions—will neither be
straightforward, clear, and unequivocal, nor can they be ‘validated’ and be ‘objectified’ by
external standards. We conclude that assessments in science ought to be understood as and
be made as judgments under uncertainty.
Keywords Research evaluation· Bibliometrics· Judgements
Lutz Bornmann and Julian N. Marewski have contributed equally to this work.
* Lutz Bornmann
bornmann@gv.mpg.de
Julian N. Marewski
julian.marewski@unil.ch
1 Science Policy andStrategy Department, Administrative Headquarters oftheMax Planck Society,
Hofgartenstr. 8, 80539Munich, Germany
2 Faculty ofBusiness andEconomics, University ofLausanne, Quartier UNIL-Dorigny, Bâtiment
Internef, 1015Lausanne, Switzerland
Scientometrics
Introduction
The focus of this opinion essay is a powerful means to shape—and often cloud—people’s
judgments: quantifications1. Numbers have been turned into, so we argue in allusion to
Karl Marx’ famous statement about religion, the new opium of the people, namely when it
comes to making judgments about uncertain things. Numbers affect the public as much as
those experts who dedicate their lives to produce, process, and scrutinize quantifications:
scientists. Statistics-driven thinking is characteristics for the modern society (Desrosières,
1998); “...we are living in a world informed by indicators” (Zenker, 2015, p. 103). Give
people numbers, and they will have something to hold on to, to be blinded by, or to argue
against. In one way or the other, “[n]umbers rule the world” (Gigerenzer etal., 1989, p.
235):
1. Numbers can dramatically change how we make judgments under uncertainty
about multiple aspects of science and society. Particularly p-values (and nowadays
increasingly Bayesian statistics, too) have metamorphosed the way psychologists and
other social scientists make inferences (for basic readings, see Gigerenzer etal., 1989,
and Gigerenzer & Murray, 1987; for a critical discussion of the Bayesian twist, see
Gigerenzer & Marewski, 2015). When it comes to building theories of cognition, “…
methods of statistical inference have [even] turned into metaphors of mind” (Gigerenzer,
1991, p. 254). Bibliometric statistics transform judgments about scientists, their work
and institutions, when it comes to ‘assessments’ of productivity, value, or quality:
quantitative science evaluation.
2. Decision making in science is concerned with uncertainty (Pfeffer etal., 1976; Salancik
& Pfeffer, 1978). The routine uses of quantifications to target uncertainty aid governance,
fuel bureaucratization, and cement social conventions (e.g., h-indices to evaluate
scientists and p-values to evaluate research findings). Bibliometric indicators seemingly
help establishing ‘objective’ facts. Indicators are used to “...produce relationships among
the things or people that they measure by sorting them according to a shared metric”
(Espeland, 2015, p. 59). Those, in turn, serve to justify decisions (e.g., about funding
scientific work or hiring senior scientists), and if need to be, also put decision makers
(e.g., scientists, administrators, or politicians) in a position to defend themselves (e.g.,
against being accused to have made arbitrary, biased, nepotistic, or otherwise flawed
decisions). The problem with indicators is, however, that they measure but “do not
explain” (Porter, 2015, p. 34). In order for those numbers to become meaningful, it is
necessary to fill them with life – as it can be done when scientific experts interpret the
numbers.
3. The advance of numbers as substitute for judgment comes with rather old ideals,
including those of ‘rational, analytic decision making’, dating back, at least, to the
Enlightenment. The past century has seen a twist of those ideals, with psychological
research (e.g., Kahnemann etal., 1982) documenting how human judgment deviates
from alleged gold standards for rationality, this way trying to establish intuition’s flawed
nature (Hoffrage & Marewski, 2015). Views that human judgment cannot be trusted
certainly do not help when it comes to stopping the twentieth century’s Zeitgeist of using
1 An earlier draft version of this paper is available (Marewski & Bornmann, 2018).
Scientometrics
seemingly objective (= judgment-free) statistics in research (e.g., p-values) or the more
recent digital wave of ‘objectifiers’ in science evaluation (e.g., h-indices).
4. In a way it is telling that the notion of objectivity in science itself is an object of study
and critical reflection (e.g., Douglas, 2004, 2009; Gelman & Hennig, 2017; Gigerenzer,
1993). Yet, there is more: John Q Public—including ourselves—are prepared to trust
and use quantitative data to understand and manage all kinds of objects and phenomena,
respectively—from our finances to life-expectancy and “human needs” (e.g., Glasman,
2020, p. 2), with the title of Porter’s (1995) classic, “Trust in numbers: The pursuit of
objectivity in science and public life”, beautifully reflecting more general trends that
engulf academic activities, including their evaluation.
5. Cohen (1994) titled a paper “The earth is round (p < .05)”. While the “[m]indless”
(Gigerenzer, 2004a, p. 587) abuse of p-values and seemingly judgment-free indicators
such as the h-index is nowadays prevalent in virtually all branches of academia, the
decision sciences, statistics, and their history inspire us to both question this state
of affairs and to point to antidotes against the harmful side effects of the increasing
quantification of science evaluation.2 In our view, the mindlessness can be overcome
if science evaluations are actually made and understood as good human judgments
under uncertainty where not everything is known or knowable, and where surprises can
disrupt routines and other seemingly givens (see e.g., Hafenbrädl etal., 2016). This view
suggests that mistakes are inevitable and need management, or that different statistical
judgment tools ought to be chosen mindfully, in an informed way, as a function of the
context at hand (see e.g., Gigerenzer, 1993, 2018). We believe that science evaluation
under uncertainty may be aided if those using numbers (e.g., citation counts) in research
evaluation have expertise in bibliometrics and statistics and are ideally active in the
to-be-evaluated area of research. Such expertise can aid both: (i) to understand when
good human judgment ought to be trusted even when numbers speak against that
judgment, and (ii) to realize why good human judgment and intuition is what matters,
and that even when there is no number attached to it. Put differently, common sense
should rule numbers, not vice versa.
In what follows, we will first sketch out historical contexts and societal trends that
come with the increasing quantification of science and society. Second, we will turn to
those developments’ latest outgrowth: the exaggerated and uninformed use of bibliometric
statistics for research evaluation purposes. Third, we explore how the mindless use of
bibliometric numbers can be overcome. We close by calling for bringing common sense
back into scientific judgment exercises.
2 Gigerenzer (2004b) once wrote a paper, titled “Striking a blow for sanity in theories of rationality”. Strik-
ing a blow for sanity in science evaluation is our goal in writing the present opinion essay. In pursuing
this goal, our essay does not aspire to put forward novel theses or other type of marketable, “distinctive
turf” (Mischel, 2008). In pulling together and spreading existing ideas, the goal is to increase awareness of
issues that affect the sciences as a whole. Researchers interested in more details, context, or cross-links are
invited to turn to earlier work on statistics, their history, and decision-making—most notably the writings
of Gigerenzer (e.g., 1993, 2002a, 2002b, 2004a, 2014, 2018; Gigerenzer & Marewski, 2015; Gigerenzer &
Murray, 1987; Gigerenzer etal., 1989). While Gigerenzer’s work formed an obvious source of inspiration
for us—not only here, but also elsewhere (e.g., Bornmann & Marewski, 2019)—it is needless to say that
many others have written on those and/or related points, too (e.g., Gelman & Henning, 2017; Porter, 1992,
1993, 1995; Smaldino & McElreath, 2016; Weingart, 2005). We stand on the shoulders of giants, as the
proverbial expression goes.
Scientometrics
Before we begin, let us add a commentary. One of us once co-edited a special issue on
human intuition. The issue’s introduction (Hoffrage & Marewski, 2015) tried to capture
the elusive nature of human intuition with contrasting poles, including Enlightenment
thinking and the “culture of objectivity” (p. 148) as well as poetry by polymath Johann
Wolfgang von Goethe and a painting by Romanticist artist Caspar David Friedrich. Yet,
while pictures, poetry, and other artwork (e.g., stories, films, songs) may trigger intuitions
about intuition, until about that time, the author had spent little time on reflecting that there
maybe things numbers and algorithms cannot capture; to the contrary in our respective
fields, we both repeatedly argued for approaching judgment quantitatively (e.g. through
algorithmic models).3 So while we caricaturize the quest for quantification in this opinion
essay, following the proverbial expression “Let any one … who is without sin be the first to
throw a stone …” (The Bible, John, 8:7) we need to stone ourselves; and as suggested by
the essay’s title, we are, perhaps, stoned already.
Signicant numbers
Quantifications aid governance
The quest for quantifications is not new. Numbers, written on papyrus, coins or milestones
aid governing societies and their activities—ranging from trade to war—since thousands
of years. The Roman Empire offers examples (see e.g., Vindolanda Tablets Online, 2018);
and so does, for intance, Prussia later (e.g., Hacking, 1990). Quantifications became part of
the Deoxyribonucleic Acid (DNA) of states (e.g., the German welfare state), enabling to
compute contributions to pension funds and insurances; they served to protocol commercial
and demographic activities as well as military assets. Indeed, the word statistics likely
originates in states’ quest for data, with for instance, the “English Political Arithmetic”
(Desrosières, 1998, p. 23) and the German equivalent, “Statistik” (Desrosières, 1998, p.
16) being traceable to the 1600s (for more historical discussion, e.g., Daston, 1995; Krüger
etal., 1987; Porter, 1995).4
Big data is one of the latest reflections of the old proverbial insight ‘knowledge is
power’—an insight that exists in various forms (e.g., The Bible, Book of Proverbs, 24:5;
Hobbes, 1651, Part I-X) and languages (e.g., in German: ‘Wissen ist Macht’),5 but that may
gain yet other meanings with the massive digitalization; in the future possibly implying
E-governance, digital democracy or the dictatorship of numbers (e.g., Helbing etal., 2017;
Marewski & Hoffrage, 2021; O’Neil, 2016). A development one could subsume under
a lemma commonly (mis)attributed to Galileo Galilei (e.g., by Hoffrage & Marewski,
2015, p. 149): “measure what is measurable, and make measurable what is not so”6; or,
4 On the etymology of the word, the Merriam-Webster.com dictionary (2023) states: “German Statistik
study of political facts and figures, from New Latin statisticus of politics, from Latin status state”.
5 While this expression exists in different forms with different connotations and backgrounds, it is often
prominently attributed to Francis Bacon. Yet that attribution may be incorrect, although indeed two variants
of it can be found in his work (Vickers, 1992).
6 The attribution to Galileo may not be correct (see Kleinert, 2009).
3 We also (try to) sell numbers in our non-academic lives. One of us finds himself bringing the latest bib-
liometric reports to the doctors that treat him; the other would do something similar with statistics from
medical studies.
Scientometrics
reformulated in terms of science evaluation, evaluate what is evaluable and make evaluable
what is not.
Databases featuring numbers of publications, grants, and other ‘output’ can be used—
and if need be—instrumentalized for academic governance purposes. Much like outside of
science, even the mere ability—expertise—to quantify can be a source of power or claims
thereof. And that not only for the science evaluator, but also for researchers themselves. For
instance, in the twentieth century, psychologists like Edward Thorndike spearheaded the
quantification of their field; and that quantification came with a side-effect. As Danzinger
(1990) points out, “Quantitative data … could be transformed into…power for those who
controlled their production and interpreted their meaning to the nonexpert public” (p.
147), equating “[t]he keepers of that [quantitative] knowledge … [with] a new kind of
priesthood, which was to replace the traditional philosopher or theologian” (p. 147)—a
“religion of numbers” (p. 144). Conflicts such as between quantitively and qualitatively
working social scientists are probably not alien to some readers of this essay either.
Sir Ronald Fisher (1990a), put it bluntly in his “Statistical methods for research
workers”—a bible, published originally in 1925: “Statistical methods are essential to social
studies, and it is principally by the aid of such methods that these studies may be raised to
the rank of sciences” (p. 2). Numbers create science.
Quantifications offer seemingly universal andautomatic means toends
Using analysis and reason to understand (and rule) the world are old ideals: traces of
them can be found in the Enlightenment, an epoch that has brought forward thinkers
such as Immanuel Kant and Pierre-Simon de Laplace. Gottfried Wilhelm Leibniz
(1677/1951), an Enlightenment pioneer, pointed out that “…most disputes arise from
the lack of clarity in things, that is, from the failure to reduce them to numbers” (p. 24).
Arguing that “…there is nothing which is not subsumable under number” (p. 17), he
proposed to develop a “universal characteristic” (p. 17, in original fullycapitalized) that
“…will reduce all questions to numbers…” (p. 24). Modern counterparts of such ideals
seem to be universalism and automatism (or some variant thereof; see e.g., Gigerenzer,
1987; Gigerenzer & Marewski, 2015; see also Gelman & Hennig, 2017). With automatic
‘neutral’ measurements and quantitative evaluation procedures are meant that—
independent from the people (e.g., scientists, evaluators, judges) using them—should
yield ‘unbiased, objective judgments’, say for better decision making. One can think of
such automatic processes as being “mechanical” (Gigerenzer, 1993, p. 331) input–output
relations. Universal is a complement to mechanical automatism and refers to the pretension
of corresponding judgment procedures to be serviceable for all problems. Omnipresent in
universal and automatic procedures are numbers. Numbers can be conveniently used across
contexts (one can enumerate anything). They seem to lend objectivity (e.g., to observations,
inferences, and decisions) that is independent of people (it does not matter who counts the
number of citations; everybody should arrive at the same number; see also Porter, 1993).
In scientific research, a prominent example of universalism and automatism is the
usage of ‘null hypothesis significance testing’ (NHST) for all statistical inferences (e.g.,
Scientometrics
Gigerenzer, 2004a).7 Statistical inferences are judgments under uncertainty. In making
those judgments, social science researchers unreflectingly report p < 0.05, as if the p-value
would not depend on their own intentions (see Kruschke, 2010, for dependencies of the
p-value), or as if that number were equally informative for all problems. As Meehl (1978)
pointed out almost half a century ago in an article titled “Theoretical risks and tabular
asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology”: “In the typical
Psychological Bulletin article … we see a table showing with asterisks … whether this or
that experimenter found a difference in the expected direction at the .05 (one asterisk), .01
(two asterisks!), or .001 (three asterisks!!) levels of significance” (p. 822).
P-value fetish is not limited to old-school psychology. For instance, Gigerenzer and
Marewski (2015) report that an estimated 99 p-values were computed, on average, per
article in the Academy of Management Journal in 2012. Why would somebody compute
99 p-values? Habits and other factors may play a role in the sustained usage of significance
testing (e.g., Oakes, 1990). But such routine number crunching can also be linked to “…
the satisfying illusion of objectivity: [Historically,] [t]he statistical logic of analyzing data
seemed to eliminate the subjectivity of eyeballing and wishful distortion” (Gigerenzer,
1993, p. 331; see also Ziliak & McCloskey, 2012). A small number (p < 0.05) translates
into a giant leap towards objectivity in scientific (e.g., what to conclude from data) and
editorial decision-making (e.g., what papers to publish). In Danziger’s (1990) words: “By
the end of the 1930s … ‘Statistical significance’ had become a widely accepted token of
factual status for the products of psychological research” (p. 152).
But not only scientists use automatic, universal statistics to make seemingly objective
judgments; science itself is increasingly submitted to context-blind, number-driven
inference—“Surrogate science”, as Gigerenzer and Marewski (2015, p. 436) put it,
spreads fast and widely. Here, an important example of universalism is the routine reliance
on h-indices and journal impact factors (JIFs) by bibliometric laypersons for measuring
and making judgments about the ‘productivity’ or ‘quality’ of scientists, institutions,
and scientific outlets, independent of context. Context can be the discipline, the research
paradigm within a discipline, or the unit investigated (be it a teaching- or research-oriented
professor or institution). Automatism in research evaluation takes the disguise of legal
procedures that come, for instance, with faculty evaluation exercises. JIFs and h-indices
should ‘objectively’ tell, independently of who conducts the evaluation, whether a scientist
is worth hiring or meriting a promotion.
How numbers seem toreplace (bad) human judgement andintuitions
In 2011, a paper in the Journal of Personality and Social Psychology, “… reports 9
experiments, involving more than 1,000 participants, that test for retroactive influence by
‘time-reversing’ well-established psychological effects so that the individual’s responses
are obtained before the putatively causal stimulus events occur” (Bem, 2011, p. 407).
Quantifications communicate reassuring objectivity, and aid to establish ‘facts’—
seemingly regardless of whether it comes to the supernatural, the divine, death, the
uncertain future or other more mundane things, and even more so when the numbers come
7 Note that Fisher’s null hypothesis testing is not the same as NHST which mingles the rationales of differ-
ent statistical frameworks, proposed by Fisher and Neyman and Pearson. Gigerenzer refers to this mix also
as the “null ritual” or “hybrid” (Gigerenzer, 2004a, p. 588; 2018, p. 202).
Scientometrics
embedded into procedures (e.g., Gigerenzer et al., 1989). Indeed, nowadays, scientists,
politicians, doctors, and businesspersons evoke quantifications rather than hunches or
gut feel to motivate and justify their thoughts about scientific ideas, spending policies,
diseases, or mergers. h-indices and JIFs serve to infer scholars’ future performance,
motivating tenure decisions. We even use the absence of large numbers to make judgments,
such when pointing out that few are known to have died from smoking (a few decades
ago), or from living close to nuclear power plants (still today).8 Quantifications, cast into
statistics and algorithms, seem to allow us to control the uncertainties of the future and to
(cold-bloodedly) justify present-day decisions. In short, calculation seems to have, for the
better or the worse, replaced mere hunches, intuitions, feelings and personal judgment—in
science and beyond (Hoffrage & Marewski, 2015).
This was not always so, and not in so many contexts—and one does not have to go
to times of shamanic rituals for cases in point. For instance,”…the rise of statistics in
therapeutics was part of the process of objectivization through which science entered
medicine … and diagnosis became increasingly independent of the patient’s own
judgment” (Gigerenzer etal., 1989, p. 47). Before that, medicine was based throughout
centuries on individual judgment rather than on averages and other (e.g., epidemiological)
statistics (for more discussion, see e.g., Gigerenzer, 2002b; Porter, 1993).
Also the link between calculation and intelligence has changed—including the meaning
of intellegence itself, which differs between the eighteenth and twentieth centuries (Daston,
1994). As Daston (1994) points out,
“The history of calculation in the Enlightenment is a chapter in the cultural history
of intelligence. Calculation had not yet become mechanical, the paradigmatic
example of processes that were mental but not intelligent. Indeed, eighteenth-century
philosophers conceived of intelligence and even moral sentiment to be in their
essence forms of calculation” (p. 185).
Daston (1994) adds that
“When Pierre-Simon Laplace described probability theory as ‘good sense re-duced
to a calculus,’ he intended to disparage neither good sense nor probability theory
thereby… Yet by the turn of the nineteenth century, calculation was shifting its field
of associations, drifting from the neighborhood of intelligence to that of something
very like its opposite and sliding from the company of savants and philosophes into
that of unskilled laborers” (p. 186).
Actually, what is the meaning of ‘intellengence’ nowadys? And who is intelligent?
Seemingly ‘objective’ assessments of aptitude have become almost synonymous with
a number: IQ. Yet, intelligence and alike notions, measured by IQ and other indices, are
inventions of the twentieth century. They served, for instance, U.S. military recruitment,
immigration control, and horrifying policies, with low (e.g., IQ) scores offering arguments
(i) for limiting access to education or even (ii) for sterilization (see e.g., Gigerenzer etal.,
1989; Severson, 2011; Young, 1922). This example sadly illustrates how numbers can not
only displace individual, idiosyncratic judgment but also be turned into easy and universal
criteria to single out individuals.
8 Particularly when it comes to the link between smoking and disease, ‘science’ has served to shed doubt
on that link (see Oreskes & Conway, 2011, for the tactics employed).
Scientometrics
Seemingly irrational behaviors and subjective, biased cognitions, ranging from feelings
to intuitions, pose prominent targets for investigation when it comes to documenting,
correcting, and even exploiting them. Starting in the 1970’s, Kahneman and Tversky’s
heuristics-and-biases research program (e.g., Kahnemann et al., 1982; Tversky &
Kahneman, 1974) brought irrational, error-prone, and faulty judgment and decision
making into thousands of journal pages, and onto practitioners’ agendas (Hoffrage &
Marewski, 2015; Lopes, 1991). Many of those judgment biases were defined based on
numbers; including by experiments showing how people’s judgments violated Bayes
theorem—a seemingly universal, single benchmark for “…rationality for all contents and
contexts” (Gigerenzer & Murray, 1987, p. 179). In other research programs, statistics from
behavioral studies fueled similar conclusions, namely that irrational citizens need outside
help and steering.
The libertarian paternalist movement’s emphasis on nudging (ignorant) people (Thaler
& Sunstein, 2009) is one example (see Gigerenzer, 2015, for a critique). Also, caricatures
of homo economicus—an egoistic being by nature, who in the absence of punishment
and control will inevitably maximize her/his own interests, measured by utilities—fit the
widespread view that people’s subjective judgments cannot be trusted. Of course, utilities
can be expressed numerically such that, in principle, everything (e.g., costs and benefits of
crime, marriage, fertility, or discrimination) can be modelled with them (see Becker, 1976).
The most recent outgrowth of the mistrust in good human judgement could be the view that
artificial intelligence—systems based on codes and numbers—will soon outperform human
intelligence or the wishful belief that machines—algorithms operating on numbers—will
always be more ‘objective’ judges than humans (e.g., avoiding biases in hiring decisions).
A potential backside of the medal: perhaps “Weapons of Math Destruction”—to borrow a
beautifully frightening term from O’Neil (2016, p. 3).
Science evaluation withquantications
How numbers fuel thequest forobjective, unbiased, andjustifiable judgment
“The ideal of the classical natural sciences was to consider knowledge as independent of
the scientist and his measuring instruments”, so Gigerenzer (1987, p. 16) points out. His
piece (1987) on the “Fight against Subjectivity” (p. 11) illustrates how the use of statistics
became institutionalized in a social science: psychology. Experimenters invoked them to
make their claims independent (i) from themselves as well as (ii) from the human subjects
they studied. The fight against subjectivity is neither unique to psychology, nor have quests
for objectivity ended in the social sciences. Across disciplines, scientists use quantifications
to make their claims about their findings and the value of their work appear independently
from themselves and from context (see also Gelman & Hennig, 2017). Academia, so we
argue, is in the process of undergoing a significant transformation since many years: as
much as quantifications have contributed to transform other aspects of science and society,
they shape science evaluation (Wilsdon etal., 2015). Indicators are used to strive for
unbiased, fair, or legitimate judgments.
What is being evaluated varies, ranging from individual journal articles to different
‘producers of science’, including scientists or competing departments and universities. For
instance, one may use the number of citations a paper enlists on Google Scholar to find
out whether that paper is worth reading and citing. Likewise, when it comes to justifying
Scientometrics
promotions of assistant professors or to allocating limited amounts of funds to competing
departments, what frequently counts is the number of papers published in first quartile
journals of the Journal Citation Reports (Clarivate) or in the top 5 journals in economics
(American Economic Review, Econometrica, Journal of Political Economy, Quarterly
Journal of Economics, and Review of Economic Studies) (Heckman & Moktan, 2020).
Similar statistics serve science policy, with ‘tax payer’s investments’ in academic
institutions, personnel, and research seemingly calling for ‘objective’ indicators of success
as justification. Research evaluation is characterized by heterogeneous practices (Hug,
2022). A prominent example (with changing practice over the last years) is the Research
Excellence Framework (REF) of the United Kingdom. REF is theUnited Kingdom’s“…
system for assessing the excellence of research in UK higher education providers...” REF
2029 (2024). It informs the public about the quality of British science (e.g., for 2014, “[t]
he overall quality of submissions was judged … 30% world-leading … 46% internationally
excellent…”, REF, 2014).. The objectives of the frameworkare to:
• “provide accountability for public investment in research and produce evidence of the
benefits of this investment
• provide benchmarking information and establish reputational yardsticks, for use in the
higher education sector and for public information.
• inform the selective allocation of funding for research” REF 2029 (2024).
One wonders to what extent in the United Kingdom andother countries (e.g., Australia)
those developments feed and are fed by businesspersons.9 Companies offer a stream of
user-friendly number producers, ranging from search engines (e.g., Google Scholar) and
network applications (e.g., ResearchGate) to bibliometric products (e.g., InCites provided
by Clarivate, and SciVal from Elsevier). Lawyers and journalists mayhave their share, with
the public outcries Corruption! Nepotism! or the latent threat of court trials (e.g., from
job candidates) incentivizing academic institutions to implement (e.g., hiring, resource
allocation) procedures that are not, first and foremost, sensible, but that are defendable. The
rationale of the number-based defenses would be that quantifications seem harder to argue
with than subjective judgments and intuitions.
The quantification of science evaluation has numerous consequences (see also e.g.,
Weingart, 2005). Scientists can get fixated on producing a certain number of publications
(e.g., per year) rather than with simply doing research for the sake of the research itself
(see also e.g., Gigerenzer & Marewski, 2015; Smaldino & McElreath, 2016). Scholars
may worry more about the statistics computed (e.g., the p-value), promising both
(a) ‘publisheability’ (e.g., in ‘high-impact’ journals) and (b) seeming ‘objectivity’ in
conclusions, than about the meaningfulness of the analysis and research question, the
precision of the underlying theory and its generalizability, or the quality of the data
collected to test the theory. Likewise, when research is evaluated, the focus may be,
once more, on seemingly ‘objective’ h-indices and JIFs, but not on the actual content
and contribution of scientific work—and much the same holds true when researchers
themselves (e.g., job applicants) are under evaluation.
9 The quantification of science evaluation may be a problem especially in universities governed by corpo-
rate managers; not every university has been turned yet into this leadership model.
Scientometrics
Even when it comes to ‘measuring’ career prospects, there may be parallels (to citation-
based numbers such as h-indices and JIFs) —at least historically in disciplines such as
psychology. As Rosnow and Rosenthal (1989) point out,
“It may not be an exaggeration to say that for many PhD students, for whom the
.05 alpha has acquired almost an ontological mystique, it can mean joy, a doctoral
degree, and a tenure-track position at a major university if their dissertation p is less
than .05. However, if the p is greater than .05, it can mean ruin, despair, and their
advisor’s suddenly thinking of a new control condition that should be run” (p. 1277).
By that logic, a few decades ago, the number of p-values < 0.05 a young scholar ‘found’
could have served as an early indicator of professional success, similar to how one can
look, nowadays, at the number of high-impact journals she/he publishes in or her/his
h-index.
Each indicator suffers from different problems. For example, there is probably no
universal way of citing across fields and reasons for citing differ (Tahamtan & Bornmann,
2018). So do normalization procedures. Yet, tools and evaluation guidelines, centered on
single ‘key performance’ indicators (e.g., JIFs, h-indices, or other ‘significant numbers’)
can obliterate such diversity in repertoires of methods and indices. A parallel are
publication manuals and textbooks on statistics advocating uniform p-value crunching
that brushes across conflicting assumptions of different statistical frameworks, including
diverse meanings of level of significance (from Fisher and Neyman and Pearson; see e.g.,
Gigerenzer, 1993). Arguably, a single ‘key performance’ indicator, mandated to be used in
all circumstances, poses little affordance—to borrow a notion from Gibson (1979/1986)—
for judgment. Hence, it also does not pose an affordance for practicing judgment and
acquiring expertise, such as when to rely on what statistic.10
However, single ‘key performance’ indicators afford being looking up quickly in open
literature databases, with online fora and other digital media (e.g., the press) conveniently
allowing for pressure and control. This ranges from public investigation (potentially
by everybody with an internet connection) to harsh punishment, with the fear of digital
pillorying and ‘shit storms’ potentially further inducing defensive logics of the kind ‘better
publish too much than too little’. Or would you, as a dean or public funder, like to see ‘your
institution’ exposed, on the internet, as consistently producing fewer papers than average,
being low in the rankings, hosting scientist X, publicly accused of being guilty of Y, or
being in-compliant with guidelines A, B, and C? In the aftermath of all this mess, what
matters is
– producing more than an arbitrarily defined number of papers per year,
– having an h-index of a certain magnitude (e.g., 15 or whatever a supposed excellent
score is, see Hirsch, 2005), or
– publishing in a journal in the first quartile of the Journal Citation Reports
(= ‘Q1-journal’).
10 Affordances are the functional properties of elements of the world. Those properties do not just depend
on the elements themselves, but emerge relative to the agent. A chair, for instance, affords sitting, throwing,
defending, burning, and other actions to an adult human, but not toa baby or to a fish. Originally devel-
oped by Gibson (1979/1986) to understand visual perception, the notion of affordance has been used in
several domains, including for understanding the selection among different judgment strategies (Marewski
& Schooler, 2011).
Scientometrics
How numbers inquantitative science evaluation parallel those insocial science
research
But also on other dimensions, there seem to be interrelated parallels between the
quantification of research evaluation and social science research more generally
(Gigerenzer, 2018; Gigerenzer & Marewski, 2015):
1. ’Significant numbers’ simplify life: whether it is 99 p-values, spit out by off-the-shelf
statistical software or citation counts on Google Scholar, nowadays indicators are easy
to obtain. Also, seemingly everybody feels capable of using them: indeed, even a school
child who might not understand the contents of a paper, or know how to recognize
quality work, can count and grasp larger-smaller relationships, which is what indicators
are all about (e.g., 80 > 20 citations; p < 0.05, see e.g., Gigerenzer, 1993, p. 331, on the
common practice of hypothesis testing in social science (i.e., psychology): “...a fairly
mechanical schema that could be taught, learned, and copied by almost anyone”).
2. Indicators speed up ‘production’ in global academic factories: in a research world where
quantifications matter, time saved by relying on a number (e.g., a p-value or JIF) rather
than on more cumbersome activities (e.g., alternative data analyses or reading papers
from a job applicant) aids to play the game (e.g., producing more papers or evaluating
articles fast; see also Bornmann & Marewski, 2019). As Gigerenzer (1993) points out
with respect to the common practice of mechanical hypothesis testing, “[i]t made the
administration of the social science research that exploded since World War 2 easier,
and it facilitated editors’ decisions” (p. 331). The p-value offers “...asimple, ‘objective’
criterionfor...” (Gigerenzer & Marewski, 2015, p. 429) judging findings—NHST as
fast automatic judgment procedure, applicable to all articles, independent of context
and people. Something analogous seems to be happening with h-indices and JIFs when
they are (mis)used in fast and automatic ways, brushing across the idiosyncracies of
academic life. Indeed, as noted by Gigerenzer (2018) “…null ritual [NHST] can be seen
as an instance of a broader movement toward replacing judgment about the quality of
research with quantitative surrogates” (p. 214).
3. ’Significant numbers’ are not always understood: in the last century, as textbook writers
intermingled Fisher’s and Neyman and Pearson’s competing statistical frameworks,
social scientists started to use with NHST a “hybrid theory” (Gigerenzer, 2018, p.
212). They made p < 0.05 a magic potion (i.e., a drug), widely consumed, but prone to
misattributions about what that potion can actually do and what not (e.g., “Probability of
replication = 1 – p”, Gigerenzer, 2018, p. 204; see also Oakes, 1990). The Annual Review
of Statistics and Its Application—a journal that aims to “…inform[] statisticians, and
users of statistics…”—onceadvertisedon its principal website to have “…debuted in the
2016 Release of the Journal Citation Report[s] (JCR) with an Impact Factor of 3.045”
(Annual Review of Statistics and Its Application, 2020). Do marketing professionals
working for scientific journals, science administrators, and other practitioners fully grasp
what information h-indices, JIFs, and other indices can convey, and especially what not?
4. Research evaluations and statistical analyses serving research itself (e.g., hypothesis
tests) are both often carried out by non-experts, that is, by non-bibliometricians (e.g.,
administrators) and researchers (e.g., applied psychologists) who are not statisticians
by training. Lack of expertise may drive the quest for automatic, mechanical procedures
(see also e.g., Oakes, 1990), seemingly obliterating the need for personal judgment:
non-experts’ (e.g., statistical) intuitions cannot be trusted, but even a non-expert can
Scientometrics
follow simple procedures such as computing a p-value or telling which of two h-indices
and JIFs are larger. As Gelman and Hennig (2017) point out, “…statistics is sometimes
said to be the science of defaults: most applications of statistics are performed by non-
statisticians who adapt existing general methods to their particular problems. … It is
then seen as desirable that any required data analytic decisions … are performed in
an objective manner…” (p. 971). We believe that the same holds true for quantitative
research evaluation.
5. Whether it is the JIF, h-index, or p-value, in both research evaluation and applied
statistics, ‘significant numbers’ are at the core of critique and controversies (e.g.,
Benjamin etal., 2018; Callaway, 2016), making their massive consumption a surprising
affair. One may speculate to what extent misconceptions such as false believes about
the information conveyed by an indicator could contribute to sustain its use (see e.g.,
Gigerenzer, 2004a, for that conjecture with respect to NHST).
To summarize, most importantly, ‘significant numbers’ seem to aid to base propositions on
seemingly ‘objective’ grounds (see also e.g., Porter, 1993). In science evaluation, that job
is done by h-indices and JIFs enabling to ‘objectively’ judge the research performance of
scientists and institutions or the ‘quality of journals’—across fields and other elements of
context in an automatic way.
How quantifications fuel thebureaucratization ofscience
The quantification of science evaluation has not ended with mere efforts towards making
judgments look unbiased, objective, and justifiable. The bureaucratization of science is
another (equally worrisome) outcome. Modern science is frequently called post-academic.
According to Ziman (2000), bureaucratization is an appropriate term which describes most
of the processes in post-academic science:
“It is being hobbled by regulations about laboratory safety and informed consent,
engulfed in a sea of project proposals, financial returns and interim reports,
monitored by fraud and misconduct, packaged and repackaged into performance
indicators, restructured and downsized by management consultants, and generally
treated as if it were just another self-seeking professional group” (p. 79).
For example, principal investigators in ERC Grants by the European Research Council
(ERC) “…should … be able to provide evidence for the calculation of their time
involvement…”—time sheets enter academia (European Research Council, 2012, p. 33).
Words such as management, performance, regulation, accountability, andcompliance
had previously no place in scientific life. The vocabulary was not developed within science
but was transferred by the bureaucratized society (Dahler-Larsen, 2012; Ziman, 2000). The
bureaucratic virus spreads through the scientific publication process itself: nowadays, many
journals require hosts of (e.g., web) forms to be signed (or ticked), ranging from conflict-
of-interest statements to ethical regulations, assurances that the data to be published is
new and original, to copyright transfer agreements. Certain publication guidelines read
like instructions one would otherwise find in tax forms in public administration, and the
length of those guidelines is sometimes as overwhelming as the endless legal small print of
privacy and licensing policies on internet websites. Data protection regulations add to the
growing mess.
Scientometrics
Does bureaucracy suffocate science? It has not come that far. But arguably, research in
post-academic science is characterized by less freedom. Projects are framed by proposals,
employment, and supervision of project staff (PhD students, post-doctoral researchers).
Explorative studies—which may lead to scientific revolutions—can raise questions that are
new and not rooted in the field-specific literature; such studies can, moreover, come with
unconventional approaches, and lead to unforeseeable expenditures of time (Holton etal.,
1996; Steinle, 2008). There is the risk that these elements are negatively assessed in grant
funding and research evaluation processes, because they do not fit into ‘efficient’ project
management schemes. The five most important characteristics of post-academic science
evaluation can be summarized as follows (Moed & Halevi, 2015):
(1) Performance-based institutional funding In many European countries, the
number of enrolled students is decreasingly, and performance criteria are increasingly
relevant for research funds for universities. Today, some institutions favor the exclusive
use of bibliometrics or peer review to determine research performance; others favor
mixed approaches by combining peer review and bibliometrics (Thomas etal., 2020). The
performance criteria are used for accountability purposes (Thonon etal., 2015). According
to Moed and Halevi (2015), “[i]n the current … [climate]where budgets are strained and
funding is difficult to secure, ongoing, diverse and wholesome assessment is of immense
importance for the progression of scientific and research programs and institutions” (p.
1988).
(2) International university rankings Universities are confronted with the results of
international rankings (Espeland, et al., 2016). Although heavily criticized (Hazelkorn,
2011), politicians are influenced by ranking numbers in their strategies for funding national
science systems. There are even universities incentivizing behavior to influence their
positions in rankings, for instance, by institutionallybinding highly cited researchers from
universities in other countries (Bornmann & Bauer, 2015). Clarivate publishes annually
a list of highly cited researchers https:// clari vate. com/ hcr/who have authored the most
papers in their disciplines belonging to the 1% most frequently cited papers. This list is
constitutive of one of the best-known international rankings, the Academic Ranking of
World Universities (ARWU ), also known as Shanghai Ranking. TheFinancial Times and
The Economistundertake rankings of business schools and their programs (e.g., Master,
MBA, Executive MBA).
(3) Internal research assessment systems More and more universities and funding
agencies install research information systems to collect relevant data on research input
(e.g., number of researchers) and output (e.g., publications; Biesenbender & Hornbostel,
2016). These numbers are used to monitor performance and efficiency continually.
Problems emerge if those monitoring systems change researchers’ goals in an unintended
way—for instance, leading them to frame a finding or theory as ‘novel’, rather than closely
tying it to previous work, or to dissect ideas into short journal papers to increase the
output (Gigerenzer & Marewski, 2015; Weingart, 2005): the 2013 Nobel Prize laureate
in physics, Peter W. Higgs—who recently passed away—remarked to The Guardian,
“Today I wouldn’t get an academic job … I don’t think I would be regarded as productive
enough”; Higgs noted that he came to be “an embarrassment to the department when they
did research assessment exercises”—when requested “Please give a list of your recent
publications …I would send back a statement: ’None.’” (Aitkenhead, 2013). Or, in turning
to psychology, speaking with Mischel (2008):
“When the science community is working well, it doesn’t re-label, or at least it
tries not to reward re-labeling. After the structure of DNA was discovered, nobody
Scientometrics
renamed and recycled it as QNA (or if they did, it was not published in Science or
Nature). But in at least some areas of psychological science, excellent and honorable
researchers with the best intentions inadvertently create a QNA or two, sometimes
perhaps even a QNA movement.“
(4) Performance-based salaries In certain countries salaries are sometimes connected
to publishing X articles in reputable journals (e.g., Science or Nature) (Reich, 2013). In
business schools, academic faculty may see a reduced teaching load, promotions, or
other benefits linked to publishing in journals on the Financial Times List—the same
newspaper that publishes the MBA and other rankings important to the schools’ prestige
and, ultimately, to profitability of their programs. Such practices widely open the doors to
scientific misconduct (Bornmann, 2013). One is not really be surprised to read that papers
are bought from online brokers or that scientists pay for authorships (Hvistendahl, 2013).
(5) The use of metrics to target ambiguity in peer review processes Peer review is
the main quality assurance process in science (Hug, 2022). The meaning of research quality
differs between research fields, the context of evaluation, and the policy context (Langfeldt
et al., 2020). Reviewers use many different criteria for making judgements in different
contexts and integrate the criteria into judgements using complex rules (Hug, 2024; Hug
& Aeschbach, 2020). Bibliometrics is one of these criteria frequently used in the context
of peer review processes (Cruz-Castro & Sanz-Menendez, 2021; Langfeldt etal., 2021).
One reason for the use of metrics is the ambiguity of the peer review process: research is
evaluated against some criteria and some level of achievement. As the study by Langfeldt
etal. (2021) shows, especially reviewers with high scores on metrics “...find metrics to be
a good proxy for the future success of projects and candidates, and rely on metrics in their
evaluation procedures despite the concerns in scientific communities on the use and misuse
of publication metrics” (p. 112). The issue of choice under ‘ambiguity’ is not only specific
for research evaluation processes, but is characteristic for areas of policy (Dahler-Larsen,
2018; Manski, 2011, 2013).
Why quantifications alone are notsufficient whenit comes toresearch evaluation
According to Wilsdon etal. (2015), today three broad approaches are mostly used to assess
research in post-academic science: the metrics-based model, which relies on quantitative
measures (e.g., counts of publications, prices, or funded projects), peer review (e.g., of
journal or grant submissions), and the combination of both approaches. Quality in science,
so the rationale of the peer review process, can only be established if research designs and
results are assessed by peers (from the same or related fields). In the past decades, the quest
for comparative and continuous evaluations on a higher aggregation level (e.g., institutions
or countries) has fueled preferences for the metrics-based model. Those preferences are
also triggered by the overload of the peer review system: the demand for the participation
in peer review processes exceeds the supply.
In the metrics-based model of research evaluation, bibliometrics has a
prominent position (Schatz, 2014; Vinkler, 2010). According to Wildgaard et al.
(2014)“[a]researcher’s reputational status or ‘symbolic capital’ is to a large extent derived
from his or her ‘publication performance’” (p. 126). Bibliometric information is available
in large databases (e.g., Web of Science, Clarivate, or Scopus, Elsevier) and can be used
in many disciplines and on different aggregation levels (e.g., single papers, researchers,
research groups, institutions, or countries). Whereas the number of publications is used as
an indicator for output, the number of citations is relied upon as proxy for quality.
Scientometrics
However, the metrics-based model has several pitfalls (Hicks etal., 2015). Five of those
problems stem from the ways in which quantifications are used (Bornmann, 2017).
(1) Skew in bibliometric data Bibliometric data tend to be right-skewed, with there
being only a few highly cited publications and many publications with only a few or zero
citations (Seglen, 1992). There is a tendency for citations to concentrate on a relatively
small stratum of publications. Citations are over-dispersed count data (Ajiferuke &
Famoye, 2015). Hence, simple arithmetic means—as they are built into mean citation rates
or JIFs—should be avoided as measure of central tendency (Glänzel & Moed, 2013).
(2) Variability in bibliometric data In line with the ideals of universalism and
automatism, the results of bibliometric studies are typically published as if they were
independent of context or otherwise invariant (Waltman & van Eck, 2016). The results
of bibliometric studies on the same unit can vary between different samples (e.g., from
different publication periods or literature databases).
(3) Time- and field-dependencies of bibliometric data Many bibliometric studies are
based on bare citation counts, although these numbers cannot be used for cross-field and
cross-time comparisons (of researchers or universities). Different publication and citation
cultures lead to different average citation rates in the fields—independently of the quality
of the publications.
(4) Language effect in bibliometric data In bibliometric databases, English
publications dominate. Since English is the most frequently used language in science
communication, the prevalence of English publications comes as no real surprise.
However, the prevalence can influence research evaluation in practice. For example,
there is a language effect in citation-based measurements of university rankings, which
discriminates, particularly, against German and French institutions (van Raan etal., 2011).
Publications not written in English receive—as a rule—fewer citations than publications in
English.
(5) Missing and/or incomplete databases in certain disciplines Bibliometric
analyses can be poorly applied in certain disciplines (e.g., social sciences, humanities, or
computer science). The most important reason is that the literature from these disciplines is
insufficiently covered in the major citation databases which focus on international journal
publications (Marx & Bornmann, 2015). In other words, “[b]ibliometric assessment of
research performance is based on one central [but possibly false] assumption: scientists,
who have to say something important, do publish their findings vigorously in the open,
international journal literature” (van Raan, 2008, p. 463).
Three additional problems with bibliometric indicators concern their purpose, how they
are used, and what information the numbers can convey. Those problems are of a more
general nature.
(1) Poorly understood indicators As Cohen’s (1990) points out with respect to
hypothesis testing in psychology, “Mesmerized by a single all-purpose, mechanized,
‘objective’ ritual in which we convert numbers into other numbers and get a yes–no answer,
we have come to neglect close scrutiny of where the numbers came from” (p. 1310). Also
many of those relying on bibliometric indicators do not seem to know where the numbers
come from and/or what they really mean. Today, the JIF is a widely used indicator to infer
‘the impact’ of single publications by a researcher. However, the indicator was originally
developed to decide on the importance of holding journals in libraries. Paralleling how
the historical roots and purposes of Fisher’s and von Neyman and Pearson’s respective
statistical frameworks and their bitter controversy are buried in the current “hybrid”
(Gigerenzer, 2018, p. 202) practice of seemingly universal and automatic NHST, the JIF
was applied, with little conceptual fundament, to new judgment tasks. These tasks are
Scientometrics
inferences about the quality or relevance of scientific output, made mindlessly across
contexts and people. Similarly problematic, the h-index combines the output and citation
impact of a researcher in a single number. However, with h papers having at least h
citations each, the formula for combining both metrics is arbitrarily chosen: h2 citations or
2h citations could have been used as well (see Waltman & van Eck, 2012); just as p < 0.06
or < 0.03 could become a convention instead of 0.05 and 0.01. Rosnow and Rosenthal
(1989) put it like this: “...God loves the 0.06 nearly as much as the 0.05” (p. 1277), to
which Cohen (1990) added “...amen!” (p. 1311).
(2) Amateurs playing experts A physician once remarked, exasperated, to one of
us that, nowadays, fueled by digital media, certain of his patients pretend to be expert
doctors—albeit without caring to know what they don’t know. Until the end of the twentieth
century, bibliometric analyses were frequently conducted by expert bibliometricians
who knew about the typical problems with bibliometric studies, alongside with possible
solutions. Since then, “Desktop Scientometrics” (Katz & Hicks, 1997, p. 142) has become
more and more popular. Here, research managers, administrators, and scientists from fields
other than bibliometrics use “…bibliometric data in a quick, unreliable manner…” (Moed
& Halevi, 2015, p. 1989). Digitalized bibliometric applications are available (e.g., InCites
or SciVal), which provide ready-to-use bibliometric results, foregoing available expertise
and scrutiny from professional bibliometricians (Leydesdorff etal., 2016). As Retzer and
Jurasinski (2009) point out—rightly—“…a review of a scientist’s performance based on
citation analysis should always be accompanied by a critical evaluation of the analysis
itself” (p. 394). Bibliometric applications, like SciVal or InCites, can deliver bibliometric
results, but they cannot replace expert judgment—much like off-the-shelf statistical
software can only deliver p-values and other statistics automatically, but not judgment.
(3) Impact is not equal to quality Gigerenzer (e.g., 2018) and others (e.g., Cohen,
1994; Oakes, 1990) have touched academic researcher’s sore spots when it comes wishful
but false believes about statistical hypothesis tests. In research evaluation, wishful thinking
takes place when impact is equated with quality. A citation-based indicator might capture
(some) aspects of quality but is not able to accurately measure quality. Indicators are
“…largely built on sand” (Macilwain, 2013, p. 255), in the view of some. According to
Martin and Irvine (1983) citations reflect scientific impact as just one aspect of quality:
correctness and importance are others. Applicability and real-world relevance are further
aspects scarcely reflected in citations. Moreover, groundbreaking findings—those, indeed,
leading, to scientific revolutions—are not necessarily highly cited ones. For example, Marx
and Bornmann (2010, 2013) bibliometrically analyzed publications that were decisive in
revolutionizing our thinking. They analyzed those (1) that replaced the static view with Big
Bang theory in cosmology, or (2) that dispensed with the prevailing fixist point of view
(fixism) in favor of a dynamic view of the Earth where the continents move through the
Earth’s crust. As those bibliometric analyses show, several publications that propelled the
transition from one theory to another are lowly cited.
What is more, in all areas of science, important publications might be recognized as such
only many years after publication—these articles have been named as “sleeping beauties”
(van Raan, 2004, p. 467), only that—unlike in fairy tales—nobody starts to kiss them. The
Shockley-Queisser paper (Shockley & Queisser, 1961)—describing the limited efficiency
of solar cells based on absorption and reemission processes—is one such sleeping beauty
(Marx, 2014). Within the first 40 years after it appeared, this groundbreaking paper
was hardly cited. Even worse, sometimes papers that are highly cited perpetuate factual
mistakes, misconceptions, or misunderstandings contained in them. For instance, a highly
cited paper by Preacher and Hayes (2008) recommends using a certain statistical procedure
Scientometrics
to test mediation. This procedure is widely used in the disciplines of psychology and
management; however, the procedure produces biased statistical estimates: it ignores a key
assumption made by the estimator (i.e., that the mediator is not endogenous; see Antonakis
etal., 2010; Kline, 2015).
Science evaluation fromastatistical point ofview: universal andautomatic
classifiers donotexist
Together with his friend and colleague, Allen Newell, the later Nobel laureate Herbert
Simon wrote a visionary paper in 1958 (Simon & Newell, 1958). In that paper, the two
(see also Simon, 1947/1997; 1973) introduced a distinction between two types of problems
decision-makers face. Their distinction was grounded in what was, at the time, becoming
an emerging technology for research—a technology that has become, in addition, an
indispensable tool for many quantification processes as well as a metaphor of the workings
of the human mind itself (Gigerenzer, 2002a, Chapter2): the computer. Specifically, Simon
and Newell (1958) distinguished between ill-structured and well-structured problems.
According to them:
“A problem is well structured to the extent that it satisfies the following criteria:
1. It can be described in terms of numerical variables, scalar and vector quantities.
2. The goals to be attained can be specified in terms of a well-defined objective function—
for example, the maximization of profit or the minimization of cost.
3. There exist computational routines (algorithms) that permit the solution to be found and
stated in actual numerical terms…
In short, well-structured problems are those that can be formulated explicitly and
quantitatively, and that can then be solved by known and feasible computational
techniques” (Simon & Newell, 1958, pp. 4–5).
Can the problem of recognizing quality science be conceived of as being well-structured?
The difficulty of recognizing quality science is that—statistically speaking—judgments
about the quality of research (e.g., people or institutions) represent classifications
(Bornmann & Marewski, 2019).11 For instance, the various indicators (e.g., JIFs or
h-indices) alluded to above can be thought of as numerical predictor variables to be used
in the classification of research output. Yet, regardless of whether it comes to science
evaluation or other classification tasks (e.g., medical diagnosis or credit scoring), no
classifier will always yield totally accurate results. Instead, false positives (giving ‘poor
research’ laudatory evaluations) and false negatives (giving ‘quality research’ disapproving
evaluations) will occur (Bornmann & Daniel, 2010). That is, mistakes are inevitable. In
hypothesis testing, such mistakes are also known as type I and type II errors.
The fact that mistakes can occur, however, does not necessarily mean that a problem is
not well-structured. Rather, to construct classifiers and assess what level of accuracy they
11 This and the following sections (e.g., on decision making) partially match and partially diverge from
arguments we have made in another paper (Bornmann & Marewski, 2019) that formulates a proposal for
studying and understanding bibliometrics through the lens of a decision making framework, developed by
Gigerenzer and colleagues. Differences in the thrust of those papers reflect differences between our own
evolving views, which do not fully converge on what to make of number-driven science evaluation.
Scientometrics
can attain, in areas other than science evaluation, one tests their performance empirically
out of sample (e.g., in cross-validations), with the testing of different classifiers against
each other permitting to identify what might be the best one, given a learning and test
sample, and a precise performance criterion. Performance can be measured in terms of
classification accuracy, or in terms of the costs and benefits that come with making correct
(i.e., correct-positive, correct-negative) and incorrect (i.e., false-positive, false-negative)
classifications. Granted, differences between calibration data and validation data may
introduce some weak, or—in case of predictions out-of-population rather than out-of-
sample—stronger uncertainty12 (see also Marewski & Hoffrage, 2021, 2024). Yet as long
as one stays within the realm of clearly defined populations, the problem of identifying a
classifier that produces the most accurate and/or best cost–benefit ratio seems reasonably
well-structured.
In research evaluation taking that approach is hardly feasible—conceptually, research
evaluation does not present itself as a well-structured problem:
(1) To determine classification (i.e., the judgements’) accuracy, one needs to have at least
one meaningful criterion variable which scarcely exists in research evaluation. That
is, one does not really know for sure how good research is, given a valid yardstick.
Instead, the same indicators that could be used as predictor variables in classifiers, must
be interchangeably used as criterion to define, seemingly objectively, what counts as
quality. In certain areas, in contrast, it is relatively more straightforward to establish
self-standing and meaningful outside criteria. A cancer or an unpaid loan (i.e., a debt)
might be present or not. If present and if the classifier predicts that state to be present,
one has a correct positive. If not present, and if the classifier predicts the state to be
present, one has a false positive, and so on.
(2) That said, one may take the view that citations and other numbers are valid criteria. For
instance, just as one may be able to use past cancers or debts to classify individuals with
respect to how likely it is that they will develop future cancers or debts, respectively,
one could use past citations as predictor variable for making best guesses about
future citations. One may, furthermore, accept the obvious: namely that the specific
set of predictor variables used, how one combines them (i.e., the functional form),
and the population at hand will shape classification accuracy. However, when it
comes to research evaluation, the criterion one ought to be interested in is not just
(likely unmeasurable) classification accuracy, but the costs and benefits associated
with correct positives, correct negatives, false negatives, and false positives. What is
the prize to pay (e.g., by society) if just one brilliant, groundbreaking piece of work
on cancer research is not recognized as such (false negative) or reversely, if million
papers with weak theory and not replicable findings are classified, false-positively, as
‘commendable’ (e.g., simply because they received some self-citations)? What is more,
counterfactuals may be hard, if impossible to observe (e.g., how would the world look
like today, if major discoveries X, Y, and Z had gained attention). In science evaluation
and many other classification tasks, the real costs and benefits of classifications are
hard to estimate or fully unknowable. And even if one worked with fully hypothetical
cost-benefit structures, different stakeholders would place more or less importance on
different costs and benefits. For example, different scientists, evaluators, or politicians
12 A calibration data set serves to fit a model’s (e.g., a classifier’s) free parameters. With those parameter
values fixed, performance is then tested in a validation data set.
Scientometrics
might have diverging agendas when it comes to evaluating the research of colleagues
or institutions. Assumptions about cost–benefit structures will also vary across contexts
(e.g., a false negative in cancer research is not the same as one in psychology). Finally,
time may matter. For instance, as March (1991) points out, discussing trade-offs
between exploration and exploitation in organizational learning:
“The certainty, speed, proximity, and clarity of feedback ties exploitation to its
consequences more quickly and more precisely than is the case with exploration.…
Basic research has less certain outcomes, longer time horizons, and more diffuse
effects than does product development. The search for new ideas, markets, or
relations has less certain outcomes, longer time horizons, and more diffuse effects
than does further development of existing ones” (p. 73).
In short, even if one tries to treat science evaluation as well-structured problem, it becomes
clear that this problem comes with massive uncertainties that go beyond those due to
making predictions outof sample or out of population. Before defining more precisely what
the term uncertainty actually means, let us now treat science evaluation as ill-structured
problem. Said Simon and Newell (1958):
“Problems are ill-structured when they are not well-structured. In some cases, for
example, the essential variables are not numerical at all, but symbolic or verbal. An
executive who is drafting a sick-leave policy is searching for words, not numbers.
Second, there are many important situations in everyday life where the objective
function, the goal, is vague and nonquantitative. How, for example, do we evaluate
the quality of an educational system or the effectiveness of a public relations
department? Third, there are many practical problems – it would be accurate to
say ‘most practical problems’ – for which computational algorithms simply are not
available” (p. 5).
That problem-description seems, in our view, to be more fitting to most research-evaluation
tasks. But what then are tools for tackling ill-structured problems? Simon and Newell
(1958) believed that particularly simple problem-solving tools, called heuristics, would
permit tackling such problems. They had started to implement heuristics “...similar to
those that have been observed in human problem solving activity” (Newell & Simon, 1956,
p. 1) in a computer program that modeled scientific discoveries (“proofs for theorems in
symbolic logic”; Newell & Simon, 1956, p. 1), the logic theory machine or logic theorist
(Newell etal., 1958). As Newell etal. (1958) commented, the “[logic theorist’s] success
does not depend on the ‘brute force’ use of a computer’s speed, but on the use of heuristic
processes like those employed by humans” (p. 156). While the program itself can be
considered foundational for modern-day AI and computational cognitive psychology, it
also reflects Simon’s research program on bounded rationality (e.g., Simon, 1955a, 1956)
and foreshadows a vast body of later neo-Simonian research on heuristics that emerged
in Simon’s footsteps (see Gigerenzer, 2002a, Chapter2; Marewski & Hoffrage, 2024).13
13 Simon’s work on bounded rationality is often connected to that of a later Nobel laureate: Daniel Kahne-
man and his co-workers. Yet, that connection between the heuristics-and-biases framework and Simon’s
earlier work may have arisen from hindsight. Kahneman etal. (1982) do acknowledge Simon in the preface
to their anthology, but as Lopes (1992) notices, “...there are no citations at all to Simon in Tversky and Kah-
neman’s major early papers, all of which appear in the cited anthology” (p. 232).
Scientometrics
Heuristics are models of how ordinary people—who face limits in knowledge available to
them, information-processing capacity, and time—make judgments and decisions.
At the close of this essay, let us take a look at how heuristics may represent a key for
opening the door for good judgment and decision-making in science evaluation.
What istheremedy?
Before we start another commentary is warranted. In the introduction to this essay, we
wrote that quantifications have been turned into the new opium of the people. We did not
write that they are opium. Opium is a dangerous drug; quantifications are not drugs, but
they can cloud one’s thinking like drugs. Like many drugs, quantifications can not only do
harm but, when insightfully and diligently used, they can be extremely beneficial.
So let us be clear: in what follows, we do not advocate getting rid of quantifications
in judgement. From experimentation to computer simulation, for countless judgments,
quantifications are absolutely indispensable. What we do advocate is that those who invoke
bibliometric numbers to assess the ‘quality’ of research, people, or institutions do not
pretend, at the same time, to also avoid judgment and the uncertainties such judgments
entail, particularly when it comes to ill-structured problems. Likewise, we advocate that
those who dare to rely on their judgment are not automatically forced to work with some
form of quantitative tool (e.g., statistics serving as ‘facts’) even when doing so does not
make much sense.
A toolbox forhandling uncertainty may aid good judgment
To take a look at how heuristics may represent a key for opening the door for good
judgment in science evaluation, let us return to our initial idea of conceptualizing such
evaluation as well-structured problem. As we have pointed out, with each change in the
assumed cost–benefit structure, a classifier’s performance can change. As a consequence,
there will be no universal and automatic classifier for such problems. In that regard, science
evaluation is by no means special: Benjamin etal. (2018), for instance, point out “…that
the significance threshold selected for claiming a new discovery should depend on … the
relative cost of type I versus type II errors, and other factors that vary by research topic” (p.
8).
This non-universality parallels what Gigerenzer (2004a), in writing about NHST,
stresses for inductive inference more generally: “There is no uniformly most powerful
test…” (p. 604). And in his “Statistical methods and scientific inference”, first published in
1956, Fisher (1990b) remarked:
“The concept that the scientific worker can regard himself as an inert item in a vast
co-operative concern working according to accepted rules, is encouraged by directing
attention away from his duty to form correct scientific conclusions, to summarize
them and to communicate them to his scientific colleagues, and by stressing his
supposed duty mechanically to make a succession of automatic ‘decisions’, deriving
spurious authority from the very incomplete mathematics of the Theory of Decision
Functions… the Natural Sciences can only be successfully conducted by responsible
and independent thinkers applying their minds and theirimaginations to the detailed
interpretation of verifiable observations. The idea that this responsibility can be
Scientometrics
delegated to a giant computer programmed with Decision Functions belongs to the
phantasy of circles rather remote from scientific research” (pp. 104–105).14
But there are more parallels between statistics and research evaluation. To speak with
Simon:
“… I have experienced more frustration from the statistical issue [tests of
significance] than from almost any other problem I have encountered in my scientific
career. (There is a possible competitor – the reaction of economists to suggestions
that human beings may be not global optimizers…) To be accurate, the frustration
lies not in the statistical issue itself, but in the stubbornness with which psychologists
hold to a misapplication of statistical methodology that is periodically and
consistently denounced by mathematical statisticians” (Simon, 1979, p. 261; the text
in parentheses represents a footnote in the original).
What significance testing is for those psychologists Simon had in mind, may be citation
rates for research evaluators: certain administrators concerned with science evaluations,
namely those obsessed with some kind of evaluative number crunching procedure.
Following the old ideals of universalism and automatism, many treat bibliometric results
as if they could be used as ‘classifiers’ that were informative in all situations and their
classifications independent of how different people used them.
Importantly, those administrators do not only seem to handle science evaluation and
other judgment problems as if there were just one type of universal tool available (e.g.,
h-indices for all assessments of researchers). Also, the idea that simple, common-sense
judgments could outwit detailed, seemingly rational analysis may seem counterintuitive
to them. Rational decision-making warrants full information: searching for all ‘facts’ and
integrating them, is the best approach to judgment, so the logic goes. Here we may see a
reflection of ideals of ‘rational, analytic decision making’; captured by optimization models
(e.g., expected utility-maximization) as model of, tool for, and norm for rational-decision
making in parts of business, economics, psychology, and even biology (e.g., Becker, 1976;
Stephens & Krebs, 1986).
Yet, all ‘fact’-considerations notwithstanding, do we know which technology will be
invented tomorrow? Is it knowable whether the scientist with the h-index X will in five,
ten, fifteen years from now make the discovery that revolutionizes the field? Science
evaluation and many other judgment tasks we face in our lives do not resemble gambles
where all information—all possible decisional options, their consequences, and the
probability that each consequence occurs—can, in theory, be assessed and used to calculate
best bets—‘rational expectations’ of sorts. Instead, real-world judgment is characterized
by uncertainty. In neo-Simonian research on heuristics the term uncertainty has come to
be employed to refer to such situations where the unexpected can happen and “...where
not all alternatives, consequences, and probabilities are known, and/or where the available
information is not sufficient to estimate these reliably” (Hafenbrädl etal., 2016, p. 217).15
15 The notion of uncertainty as Gigerenzer (e.g., 2014; see also e.g., Mousavi & Gigerenzer, 2017) uses
them can be distinguished from other (related) notions in different disciplines (see e.g., Bookstein, 1997,
for an example from informetrics). Particularly, Gigerenzer and colleagues discuss uncertainty relative to
notions from Knight (1921) and Savage (1954/1972).
14 Those remarks were made in the context of a discussion of others’ (e.g., Neyman’s) work. Yet, as Yates
(1990) points out, in general “Fisher had little interest in computers… and referred to their calculations as
‘Mecano arithmetic’” (p. xxxi).
Scientometrics
Under uncertainty, so the fast-and-frugal heuristics framework (Gigerenzer, Todd, & the
ABC Research Group, 1999) posits, people can make accurate classifications and other
judgments, because they can adaptively draw from a toolbox of heuristics as a function
of a person’s goals and context. That is, shocking with the ideals of automatism and
universalism, under uncertainty the performance of a tool is neither independent from an
individual herself, nor is any tool useful universally. What is more, shocking with pretense
of optimization, computer-simulation work on heuristics has shown that computational
models of heuristics can match or outperform information-greedy statistical tools even in
well-structured problems featuring uncertainty due to differences between calibration and
validation data (e.g., Czerlinski etal., 1999; Gigerenzer & Brighton, 2009; Katsikopoulos
etal., 2020). Many of those information-greedy tools optimize in one way or the other:
for instance, in regressions a form of optimization is the computation of beta-weights that
minimize the distance between the predictions made and data. A key insight from such
work is that knowing when to rely on which tool is at the heart of good decision-making.16
We think that this simple insight could be key to improving science evaluation.
If one thinks of science evaluation in terms of judgments under uncertainty, and if one
furthermore understands that many problems of science evaluation—but possibly not
all—belong more to the ill-structured rather than the well-structured spectrum, then the
consequence may be that one needs to be prepared to choose from a repertoire of different
tools. That toolbox may include different indicator and non-indicator-based ones. In line
with Simon and Newell’s (1958) vision, some of those tools may, moreover, be cast as
computational models of heuristics, ready to be implemented in computer programs (see
e.g.,Bornmann etal., 2022; Bornmann, 2020, for ideas).Elsewhere, we have referred to
such tools as bibliometric-based heuristics (BBH) and suggested a corresponding research
program (Bornmann &Marewski, 2019). Yet, we do believe that those tools are not the
only ones. Yet others, may come as qualitative, common-sensical rules of thumb instead
(Katsikopoulos etal., 2024; Marewski etal., 2024).
To illustrate our point, imagine a large pile of grayish-colored human skulls, with
their dark, empty eye-sockets staring at you, and the black empty mouths motionlessly
exhibiting their rotten teeth, all in silence. There is a sign attached to those skulls, with
just two sentences written on it: “What you are now, we have been. What we are now, - you
will turn into”.17 Life-expectancy and death (e.g., morality rates) can be quantified, but
numbers may not be able to capture the sentiments and understanding that arises in you, as
your (inner) gaze switches back and forth between the words and skulls. It is also not clear
16 Neo-Simonian research on fast-and-frugal heuristics is not to be confounded with Kahneman, Tversky’s
and colleagues’ heuristics-and-biases program. The tension between those frameworks is reflected and dis-
cussed, for instance, in Gigerenzer (1996) and Kahneman and Tversky (1996). The fast-and-frugal heuris-
tics framework places emphasis on the heuristics’ ecological rationality (Goldstein & Gigerenzer, 2002),
that is, how cognition and task-environments fit to each other (see below), akin to how that ecological fit
is apparent in Simon’s (e.g., 1956) work. Simon’s ecological notion of bounded rationality is captured by a
metaphor, that of scissors, with one blade reflecting cognition and the other the environment (Simon, 1990).
Both blades are needed to cut. Yet, as Petracca (2021) pointed out, Simon treated the blades separately in
two major articles. One, published in an economics flagship journal, stressed cognition (i.e., Simon, 1955a);
the other, published in a psychology flagship journal, stressed the environment (i.e., Simon, 1956). This
audience-specific focus may have given risen to different perspectives on bounded rationality in econom-
ics and in psychology (Petracca, 2021). It could also be that here is a root of “...the persistent confusion
between Simon’s and Kahneman and Tversky’s contribution” (Petracca, 2021, p. 1).
17 One can find that sign (with the corresponding two sentences in German) and its pile of human remnants
in the small German village of Greding.
Scientometrics
if a computer ever will ever be able to do so. However, to you as a human, the bones and
the words will mean something (and a human may then, in consequence decide to dare to
live her/his life, making use of her/his judgment—or to take opium as fear-alleviating drug
instead).
So, what exactly, do we mean by qualitative rules of thumb? Newell had been introduced
to heuristics as “...an undergraduate physics major at Stanford...” (Simon, 1996, p. 199),
where he had taken courses with the mathematician Pólya. In his book, “How to solve it”,
Pólya (1945) conceived of heuristics in terms of qualitative guiding principles and thought
of proverbs—such as “Who understands ill, answers ill” (p. 222) or “He thinks not well
that thinks not again. Second thoughts are best” (p. 224)—as heuristics for mathematical
problem-solving. For example, those two, so Pólya (1945) points out, prescribe that “The
very first thing we must do for our problem is to understand it” (p. 222) and “Looking
back at the completed solution is an important and instructive phase of the work” (p. 224),
respectively.
While those proverbs seem applicable as heuristics that could guide the actions of peer
reviewers (e.g., As a first step, make sure you understand the paper you review!, Once done
with the review, re-review your own review!) and larger-scale science evaluators (e.g., As
a first step, understand the discipline and its challenges!), also other common-sense rules
of thumb could help. An example of one such rule is to read and reflect upon the research
one evaluates: Only use citation-based indicators alongside expert judgements of papers!
A result of that simple rule of thumb would be that those evaluating others would have to
be experts in the same areas. Generalists can, at best, only judge the quality of work from
the outside by relying on citation and publication numbers or other seemingly universal
indicators.
Expertise inaresearch field andexpertise inbibliometrics aid good judgment
The toolbox view on judgment in research evaluation has its equivalent in the toolbox
view on statistics: rather than pretending that the statistical toolbox contained just one type
procedure for making statistical inferences (e.g., NHST), statistics can be best conceived of
in terms of a repertoire of tools (e.g., Fisher’s null hypothesis testing, Bayes’ rule, Neyman
and Pearson’s framework) for different situations. Good human judgment is needed to
discern when to rely on which tool: as Gigerenzer (2004a) puts it bluntly, “[j]udgment is
part of the art of statistics” (p. 604) and “[p]racticing statisticians rely on … their expertise
to select a proper tool…” (Gigerenzer & Marewski, 2015, p. 422). For instance, despite
all p-value bashing, Neyman-Pearson hypothesis testing can serve model selection and
the bias-variance dilemma can help understanding this and other quantitative tools (e.g.,
Bayesian information criterion, Akaike information criterion) performance (Forster, 2000).
But tests of statistical significance may nevertheless not be suitable when it comes to “...
evaluating the fit of computer programs to data...” as Simon (1992, p. 159) remarked.
In our view, at least two kinds of expertise can aid good judgment if metrics are used
by administrators. First, in line with the simple rule of thumb to only use bibliometric
indicators alongside expert judgements of papers, when conducting a bibliometric
evaluation of a scientist, journal, or an institution, judgments should be made within the
field, and not by people outside of that area of research. Judgments made by afield’s
experts are necessary to choose the appropriate database for the bibliometric analysis. In
certain areas of research, multi-disciplinary databases such as Scopus do not cover the
corresponding literature, why field-specific databases such as Chemical Abstracts (provided
Scientometrics
by Chemical Abstracts Service) for chemistry and related areas should be selected. Experts
are also necessary to interpret the numbers in a bibliometric report and to place them in
the institutional and field-specific context. This is what the notion of informed peer review
is about: “…the idea [is] that the judicious application of specific bibliometric data and
indicators may inform the process of peer review, depending on the exact goal and context
of the assessment. Informed peer review is in principle relevant for all types of peer review
and at all levels of aggregation” (Wilsdon etal., 2015, p. 64).
Second, expertise in a field ought to be combined with expertise in bibliometrics.
Measurements involve the careful selection of dimensions within a property space (Bailey,
1972). It is clear that a non-physician or a non-pilot should not attempt to diagnose patients
or fly airplanes even if convenient diagnosis tools are sold on the internet or if flight
simulation software is readily available (Himanen etal., 2024). We think it should similarly
be unimaginable that staff with insufficient bibliometric expertise (e.g., administrators)
is put in positions where those non-experts must assess units or scientists—even if
bibliometric platforms seemingly make those tasks as simple as computing p-values with
statistical software. Either professional bibliometricians should be involved in research
evaluations or people involved in evaluations should be trained in bibliometrics. The
trained staff or the bibliometric experts should then be provided with basic information
alongside a bibliometric analysis such as data sources, definitions of indicators, and reasons
for the selection of a specific indicator set.
Professional bibliometricians do not only have access to a repertoire of different
databases and indicators and are used to choosing among their tools, they may also advise
the client against a bibliometric report in cases where bibliometrics can be scarcely applied
(e.g., in the humanities) or point to other problems alongside with possible solutions. Hicks
etal. (2015) formulated 10 principles—the Leiden Manifesto—guiding experts in the field
of scientometrics (see also Bornmann & Haunschild, 2016). For example, performance
should be measured against the research missions of the institution, group, or researcher
(principle no. 2) or the variation by field in publication and citation practices should be
considered by using field-normalized citation scores in cross-field evaluations (principle
no. 6). Thus, if a report is commanded from bibliometricians, the responsible administrator
should try to understand the report by discussing it with the bibliometricians and the expert
in the concerned field: Seek understanding, rather than only experts’ rubberstamps.
Expertise instatistics (and their history) can aid good judgment
Bringing good judgment into science evaluation calls for more than just expertise in
bibliometrics and in the domain of research. Any individual involved with research
evaluation should have basic knowledge of statistics—at a level as is typical in social
science research in economics or psychology. It is the comprehension of quantifications (1)
that will help administrators to discuss bibliometric reports with the bibliometricians and
scientists from the field, and (2) that will aid administrators to see the bibliometric report’s
limitations. Similarly, it is statistical knowledge that allows playing the devil’s advocate
on quantifications compiled by others or by oneself. Ideally, science evaluators and
scientists know how to program computer simulations to scrutinize their own judgments
and classifications. Just imagine what would happen if everybody followed the simple rule
of thumb to only rely on numbers (e.g., Bayes factors, p-values, bibliometric indicators)
and quantitative models (e.g., regression analysis, classification trees) they are able to fully
calculate and program—respectively—from scratch themselves, that is, without any help of
Scientometrics
off-the-shelf software? Likely, there would be all five: more reading, more thinking, more
informed discussions about contents, less uninformed abuse of quantifications, and more
experts in bibliometric methods and statistics.
In line with that rule of thumb, scientists and science evaluators ought to be familiar
with basic principles for statistical reasoning under uncertainty. This might be testing one’s
predictions on data differing from the data that served the development of the predictions
(that is, out of sample or out of population; Gigerenzer, 2004a; for more explanations, see
e.g., Roberts & Pashler, 2000). To give another example, evaluators should be familiar with
different techniques of exploring, representing, visualizing, and communicating data and
results: Look at the very same numbers presented in various meaningful ways, would be
the guiding rule of thumb here. Changing the representation format can aid understanding.
An example is the powerful heuristic to convert probabilities into natural frequencies
(e.g., Hoffrage etal., 2000), which has been found to aid judgment in different fields (e.g.,
medicine).
In research on heuristics and statistics, repertoires of different judgmental and statistical
tools are not simply invented. Instead, the performance of different tools is investigated
in extensive mathematical analyses and/or computer simulations (see e.g., Gigerenzer
et al., 1999). By analogy, science evaluators ought to be sufficiently familiar with data
analysis techniques to be able to evaluate their own tools for science evaluation. For
instance, evaluators ought to understand the parallels between science evaluations and any
classification problem, as described above. Correspondingly, they ought to know how to
develop and test different classifiers out of sample.
But also non-quantiative expertise on statistics and other forms of quantification matters.
Knowing about the historical origins of a specific form of quantification can aid to better
understand why we are doing what we are doing today, as well as to ask critical questions
about what might be widely-accepted practices. NHST, the JIF, as well intelligence-testing
are cases in point we touched upon above. But there is much more. For instance, averages
unquestionably rule, together with their standard deviations (errors), much research on
humans. Yet, it was, among others, a scholar coming from astronomy—Adolphe Quetlet
(1796–1874)—who had paved the way for the ‘average man’ (e.g., Desrosières, 1998;
Hacking, 1990). History can teach one to be humble: We do not know what we do not
know!
A note onhowtoaid good judgment inpractice
We do not advocate for any of those recommendations to be put in practice in isolation, but
rather it is their simultaneous implementation that may aid science evaluation. To illustrate
this, imagine the recommendation would be implemented that expert bibliometricians
perform evaluations. The result could be that those experts become too powerful in
a research evaluation: decision making may then be based too much on technical
bibliometric criteria rather than on substantive (including qualitative) considerations made
by researchers from the field. In contrast, if those who perform the evaluation (1) are
experts in bibliometrics, (2) are experts in the field under study, (3) understand that they
are making decisions under uncertainty and that, under uncertainty, simple rules of thumb
may help, then there might be room for both quantitative evaluation and substantive (i.e.,
qualitative and field-specific) considerations. Finally, if all involved have good statistical
knowledge then they might additionally be in a better position to know, for instance,
Scientometrics
when which of many different quantitative indicators is useful, or how to best analyze
bibliometric data for a given field. Knowledge is power—in several ways!
Importantly, when conceiving of science evaluation from a statistical point of view, it
becomes clear that error management is needed. Designs of institutions, work contracts,
and review procedures aid dealing with the mistakes (e.g., false negatives) that any
judgement under uncertainty can entail. That does not only entail bolstering the effects of
errors, but it also entails investigating the potential sources of errors to avoid them in the
future. Failures are opportunities for learning.
The word ‘investigating’ is put in italics, since steady empirical research is required to
continuously evaluate the effects of science evaluations. This also includes research on
how the evaluations change people’s behavior and science as a social system (see e.g., de
Rijcke etal., 2016). For instance, as Levinthal and March (1993) remark with respect to
organizations more generally, “...exploitation tends to drive out exploration” (p. 107). As
a rule of thumb, exploring new ideas (be they, e.g., novel products in business-contexts or
new research paradigms in academia) may come with more uncertain, more distant rewards
than continuing to exploit existing ones. That is, it is easy to imagine why indices counting
the numbers of publications a person or an institution produces could lead to a “success
trap” (Levinthal & March, 1993, p. 106), hindering the exploration of new—and potentially
risky—avenues for research, and eventually hampering innovation. However, what
qualitative rules of thumb for research evaluationcould—perhaps counterintuitively—have
exactly the same harmful effects as their quantitative, indicator-based counterparts?
In a sense, it is ironic that those who advocate systematic metrics-based science
evaluation and monitoring do not advocate, with the same fervor, a ceaseless systematic
empirical evaluation of science evaluation itself. We have recently discussed in detail
how the fast-and-frugal heuristics program, as it was originally developed in the decision
sciences, might lend itself to systematically investigate metric-based science evaluation,
notably bibliometric-based heuristics (Bornmann & Marewski, 2019).
In fact, we believe that many of us—as scientists—may operate by our own qualitative
rules of thumb—and possibly transmit them to our students. Examples may come with
heuristics for discovery (e.g., Katsikopoulous etal., 2024) and scientific writing (e.g.,
Marewski etal., 2024: Start your paper with a familiar story; see pp. 297-298), citation
(e.g., Read what you cite!), or data analysis (e.g., First plot your data in many different
ways!). After all, ‘experts’ may develop strategies to manage tasks—and possibly transmit
problem-solving skills—that fall within their domain of expertise. For instance, as
Marewski etal. (2024) argue, corresponding rules of thumb abound in business contexts
(e.g., Bingham & Eisenhardt, 2011). We see no reason why researchers trying to make up
their mind on other researchers’ work should not be in a good position to identify, similarly,
own rules of thumb for research evaluation that could then be systematically studied.
What could that mean? “As the wind blows you must set your sail” (Pólya, 1945; p.
223). Broad distinctions between situations—such as that between ill-structured and well-
structured problems—can aid shaping-up one’s thinking on what broad classes of tools
might work, but they are not sufficient for navigation. Rather, the science of heuristics—
and Simon’s notion of bounded rationality—prescribe understanding the precise nature
of the situation to which each tool applies (e.g., Simon, 1956, 1990). In the context of
research on fast-and-frugal heuristics, typically the term environment is used to refer to
situations, and the fit between each heuristic and its environment is systematically studied
(e.g., Gigerenzer & Brighton, 2009; Martignon & Hoffrage, 1999).
Analogous work would also have to be carried out for the various ‘heuristics’
proposed above as well as those others may come up with. Interestingly, here research
Scientometrics
on heuristics may meet research in scientometrics. Notably, the connection between
proverbial expressions and scientometrics is not new. For instance, Ruocco etal. (2017)
discuss a common principle “success breeds success” (p. 2), albeit not viewing it as
a heuristic, but means to characterize bibliometric distributions, as in Schubert and
Glänzel’s (1984) work. From the perspective of research on heuristics, this usage of the
expression corresponds to a characterization of a specific type of environment in which
heuristics (e.g., of scientific authors, reviewers, administrators) may operate: one in
which past success leads to future success. This example also illustrates a point that we
have not addressed in this article—namely that the boundary between qualitative (e.g.,
proverbial) and numerical insights is fluid; one complements the other. For instance,
Simon (1955b) himself has, early on, studied such environmental patterns quantitatively.
Conclusions
To conclude, we agree that it would be wonderful if universal norms for establishing
what one might call ‘rational’ judgments existed! Yet, despite all number crunching,
many judgments in science—be it about findings or research institutions—will neither
be straightforward, clear, and unequivocal, nor will it be possible to ‘validate’ and
‘objectify’ all judgments by seemingly universal external standards. To speak with
Simon and Newell (1958), “The basic fact we have to recognize is that no matter how
strongly we wish to treat problems with the tools our science provides us, we can only
do so when the situations that confront us lie in the area to which the tools apply” (p. 6).
We propose replacing the quest for universal tools with an endorsement of judgment—
and the willingness to act as a consequence of that endorsement. This may not be easy,
but for sure it will be easier than trying to treat scientific bureaucratization madness as
well as obsessions with control and accountability with real opium.
Research evaluation is an activity that can be characterized as judgment—and
eventually also decision-making—under uncertainty (and ambiguity). Uncertainty
is an attribute when ‘selecting under unknown future conditions’; ambiguity and
heterogeneous preferences (biases) regarding evaluation criteria (either individual or
emerging from different fields’ practices) are other attributes. Experienced and trained
evaluators (with expertise in the evaluated field, in bibliometrics, and in statistics) could
be less prone to biased judgements in research evaluation. Wisely and professionally
used, numbers (indicator scores) may play the role of ‘cognitive clues’ that help
experienced and trained evaluators ‘to think fast’ (e.g., under workload). Our essay may
be understood as a wake-up call for more sensible use of indicators (bibliometric) in the
heterogeneous practices of research evaluation.
Acknowledgements We would like to thank Justin Olds and Guido Palazzo for helpful comments as well
as Gerd Gigerenzer for his encouraging feedback on earlier versions of this manuscript. We thank Jérémy
Orsat and Caroline Plumez for additional checks. We are grateful to the helpful and inspiring comments by
two reviewers.
Funding Open Access funding enabled and organized by Projekt DEAL.
Declarations
Conflict of interest Lutz Bornmann is Editorial Board Member of Scientometrics. No funding was received
Scientometrics
for writing this paper. The authors declare they have no financial interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com-
mons licence, and indicate if changes were made. The images or other third party material in this article
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
Aitkenhead, D. (2013). Peter Higgs: I wouldn’t be productive enough for today’s academic system.
Retrieved July 5, 2016, from https:// www. thegu ardian. com/ scien ce/ 2013/ dec/ 06/ peter- higgs-
boson- acade mic- system.
Ajiferuke, I., & Famoye, F. (2015). Modelling count response variables in informetric studies: Compari-
son among count, linear, and lognormal regression models. Journal of Informetrics, 9(3), 499–
513. https:// doi. org/ 10. 1016/j. joi. 2015. 05. 001
Annual Review of Statistics and Its Application. (2020). Journal home. Retrieved January 31, 2020,
from https:// www. annua lrevi ews. org/ journ al/ stati stics.
Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and
recommendations. The Leadership Quarterly, 21(6), 1086–1120. https:// doi. org/ 10. 1016/j. leaqua.
2010. 10. 010
Bailey, K. D. (1972). Polythetic reduction of monothetic property space. Sociological Methodology, 4,
83–111. https:// doi. org/ 10. 2307/ 270730
Becker, G. S. (1976). The economic approach to human behavior. The University of Chicago Press.
[Paperback edition from 1978, published 1990]
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on
cognition and affect. Journal of Personality and Social Psychology, 100(3), 407https:// doi. org/ 10.
1037/ a0021 524–425.
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., Brembs,
B., Brown, L., Camerer, C., Cesarini, D., & Johnson, V. E. (2018). Redefine statistical signifi-
cance. Nature Human Behaviour, 2(1), 6–10. https:// doi. org/ 10. 1038/ s41562- 017- 0189-z
Bible. (n.d.). Retrieved March 27, 2024, from https:// www. bible gatew ay. com.
Biesenbender, S., & Hornbostel, S. (2016). The research core dataset for the German science system:
Challenges, processes and principles of a contested standardization project. Scientometrics,
106(2), 837–847. https:// doi. org/ 10. 1007/ s11192- 015- 1816-y
Bingham, C. B., & Eisenhardt, K. M. (2011). Rational heuristics: The ‘simple rules’ that strategists learn
from process experience. Strategic Management Journal, 32, 1437–1464.
Bookstein, A. (1997). Informetric distributions. III. Ambiguity and randomness. Journal of the American
Society for Information Science, 48(1), 2–10. https:// doi. org/ 10. 1002/ (SICI) 1097- 4571(199701)
48:1% 3C2:: AID- ASI2% 3E3.0. CO;2-2
Bornmann, L. (2013). Research misconduct—Definitions, manifestations and extent. Publications, 1(3),
87–98. https:// doi. org/ 10. 3390/ publi catio ns103 0087
Bornmann, L. (2017). Measuring impact in research evaluations: A thorough discussion of methods for,
effects of and problems with impact measurements. Higher Education, 73(5), 775–787. https:// doi.
org/ 10. 1007/ s10734- 016- 9995-x
Bornmann, L. (2020). Bibliometrics-based decision trees (BBDTs) based on bibliometrics-based heuris-
tics (BBHs): Visualized guidelines for the use of bibliometrics in research evaluation. Quantitative
Science Studies, 1, 171–182.https:// doi. org/ 10. 1162/ qss_a_ 00012
Bornmann, L., & Bauer, J. (2015). Which of the world’s institutions employ the most highly cited
researchers? An analysis of the data from highlycited.com. Journal of the Association for Informa-
tion Science and Technology, 66(10), 2146–2148. https:// doi. org/ 10. 1002/ asi. 23396
Bornmann, L., & Daniel, H.-D. (2010). The usefulness of peer review for selecting manuscripts for pub-
lication: A utility analysis taking as an example a high-impact journal. PLoS ONE, 5(6), e11344.
https:// doi. org/ 10. 1371/ journ al. pone. 00113 44
Scientometrics
Bornmann, L., Ganser, C., & Tekles, A. (2022). Simulation of the h index use at university departments
within the bibliometrics-based heuristics framework: Can the indicator be used to compare indi-
vidual researchers? Journal of Informetrics, 16, 101237.https:// doi. org/ 10. 1016/j. joi. 2021. 101237
Bornmann, L., & Haunschild, R. (2016). To what extent does the Leiden manifesto also apply to altmetrics?
A discussion of the manifesto against the background of research into altmetrics. Online Information
Review, 40(4), 529–543. https:// doi. org/ 10. 1108/ OIR- 09- 2015- 0314
Bornmann, L., & Marewski, J. N. (2019). Heuristics as conceptual lens for understanding and studying the
usage of bibliometrics in research evaluation. Scientometrics, 120(2), 419–459. https:// doi. org/ 10.
1007/ s11192- 019- 03018-x
Callaway, E. (2016). Beat it, impact factor! Publishing elite turns against impact factor. Nature, 535(7611),
210–211.https:// doi. org/ 10. 1038/ nature. 2016. 20224
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312.https:// doi. org/
10. 1037/ 10109- 028
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https:// doi. org/ 10.
1037/ 0003- 066X. 49. 12. 997
Cruz-Castro, L., & Sanz-Menendez, L. (2021). What should be rewarded? Gender and evaluation criteria
for tenure and promotion. Journal of Informetrics, 15(3), 101196. https:// doi. org/ 10. 1016/j. joi. 2021.
101196
Czerlinski, J., Gigerenzer, G., & Goldstein, D. G. (1999). How good are simple heuristics? In G. Gigerenzer,
P. M. Todd, & the ABC Research Group, Simple heuristics that make us smart (pp. 97–118). Oxford
University Press.
Dahler-Larsen, P. (2012). The evaluation society. Stanford University Press.
Dahler-Larsen, P. (2018). Theory-based evaluation meets ambiguity: The role of janus variables. American
Journal of Evaluation, 39(1), 6–23. https:// doi. org/ 10. 1177/ 10982 14017 716325
Danziger, K. (1990). Constructing the subject: Historical origins of psychological research. Cambridge
University Press.
Daston, L. (1994). Enlightenment calculations. Critical Inquiry, 21(1), 182–202.https:// www. jstor. org/ sta-
ble/ 13438 91.
Daston, L. (1995). Classical probability in the Enlightenment. Princeton University Press [Second
printing)].
de Rijcke, S., Wouters, P. F., Rushforth, A. D., Franssen, T. P., & Hammarfelt, B. (2016). Evaluation prac-
tices and effects of indicator use—A literature review. Research Evaluation, 25(2), 161–169. https://
doi. org/ 10. 1093/ resev al/ rvv038
Desrosières, A. (1998). The politics of large numbers: A history of statistical reasoning. Harvard University
Press.
Douglas, H. (2004). The irreducible complexity of objectivity. Synthese, 138(3), 453–473. https:// doi. org/
10. 1023/B: SYNT. 00000 16451. 18182. 91
Douglas, H. E. (2009). Science, policy, and the value-free idea. University of Pittsburgh Press.
Espeland, W. (2015). Narrating numbers. In R. Rottenburg (Ed.), The world of indicators (pp. 56–75). Cam-
bridge University Press.
Espeland, W. N., Sauder, M., & Espeland, W. (2016). Engines of anxiety: Academic rankings, reputation,
and accountability. Russell Sage Foundation.
European Research Council. (2012). Guide for ERC grant holders. European Research Council (ERC).
Fisher, R. A. (1990a). Statistical methods for research workers (14th edition). In R. A. Fisher (Ed. J.H. Ben-
nett), Statistical methods, experimental design, and scientific inference. Oxford University Press
Fisher, R. A. (1990b). Statistical methods and scientific inference (3rd edition). In R. A. Fisher (Ed. J.H.
Bennett), Statistical methods, experimental design, and scientific inference. Oxford University Press
Forster, M. R. (2000). Key concepts in model selection: Performance and generalizability. Journal of Math-
ematical Psychology, 44, 205–231.https:// doi. org/ 10. 1006/ jmps. 1999. 1284
Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of the Royal Sta-
tistical Society: Series A (statistics in Society), 180(4), 967–1033. https:// doi. org/ 10. 1111/ rssa. 12276
Gibson, J. J. (1979/1986). The ecological approach to visual perception. Psychology Press [Original work
published 1979].
Gigerenzer, G. (1987). Probabilistic thinking and the fight against subjectivity. In L. Krüger, G. Gigerenzer,
& M. S. Morgan (Eds.), The probabilistic revolution: Ideas in the sciences (Vol. 2, pp. 11–33). MIT
Press.
Gigerenzer, G. (1991). From tools to theories: A heuristic of discovery in cognitive-psychology. Psychologi-
cal Review, 98(2), 254–267. https:// doi. org/ 10. 1037/ 0033- 295X. 98.2. 254
Scientometrics
Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis
(Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–
339). Erlbaum.
Gigerenzer, G. (1996). On narrow norms and vague heuristics: A reply to Kahneman and Tversky (1996).
Psychological Review, 103(3), 592–596. https:// doi. org/ 10. 1037/ 0033- 295x. 103.3. 592
Gigerenzer, G. (2002a). Adaptive thinking: Rationality in the real world. Oxford University Press.
Gigerenzer, G. (2002b). Reckoning with risk: Learning to live with uncertainty. Penguin Books.
Gigerenzer, G. (2004a). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606. https:// doi.
org/ 10. 1016/j. socec. 2004. 09. 033
Gigerenzer, G. (2004b). Striking a blow for sanity in theories of rationality. In M. Augier, & J. G. March
(Eds.), Models of a man: Essays in memory of Herbert A. Simon (pp. 389–409). MIT Press
Gigerenzer, G. (2014). Risk savvy: How to make good decisions. Viking.
Gigerenzer, G. (2015). On the supposed evidence for libertarian paternalism. Review of Philosophy and
Psychology, 6(3), 361–383. https:// doi. org/ 10. 1007/ s13164- 015- 0248-1
Gigerenzer, G. (2018). Statistical rituals: The replication delusion and how we got there. Advances in
Methods and Practices in Psychological Science, 1(2), 198–218. https:// doi. org/ 10. 1177/ 25152
45918 771329
Gigerenzer, G., & Brighton, H. (2009). Homo heuristicus: Why biased minds make better inferences.
Topics in Cognitive Science, 1, 107–143.https:// doi. org/ 10. 1111/j. 1756- 8765. 2008. 01006.x
Gigerenzer, G., & Marewski, J. N. (2015). Surrogate science: The idol of a universal method for sci-
entific inference. Journal of Management, 41(2), 421–440. https:// doi. org/ 10. 1177/ 01492 06314
547522
Gigerenzer, G., & Murray, D. J. (1987). Cognition as intuitive statistics. Lawrence Erlbaum Associates.
Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Krueger, L. (1989). The empire of
chance: How probability changed science and everyday life. Cambridge University Press.
Gigerenzer, G., Todd, P. M., & The ABC Research Group. (1999). Simple heuristics that make us smart.
Oxford University Press.
Glänzel, W., & Moed, H. F. (2013). Opinion paper: Thoughts and facts on bibliometric indicators. Scien-
tometrics, 96(1), 381–394. https:// doi. org/ 10. 1007/ s11192- 012- 0898-z
Glasman, J. (2020). Humanitarianism and the quantification of human needs: Minimal humanity.
Routledge.
Goldstein, D. G., & Gigerenzer, G. (2002). Models of ecological rationality: The recognition heuristic.
Psychological Review, 109(1), 75–90.https:// doi. org/ 10. 1037/ 0033- 295x. 109.1. 75
Hacking, I. (1990). The taming of chance. Cambridge University Press.
Hafenbrädl, S., Waeger, D., Marewski, J. N., & Gigerenzer, G. (2016). Applied decision making with
fast-and-frugal heuristics. Journal of Applied Research in Memory and Cognition, 5(2), 215–231.
https:// doi. org/ 10. 1016/j. jarmac. 2016. 04. 011
Hazelkorn, E. (2011). Rankings and the reshaping of higher education: The battle for world-class excel-
lence. Palgrave Macmillan.
Heckman, J. J., & Moktan, S. (2020). Publishing and promotion in economics: The tyranny of the top
five. Journal of Economic Literature, 58(2), 419–470. https:// www. jstor. org/ stable/ 27030 437
Helbing, D., Frey, B. S., Gigerenzer, G., Hafen, E., Hagner, M., Hofsteter, Y., Van Den Hoven, J., Zicari,
R. V., Zwitter, A. (2017). Will democracy survive big data and artificial intelligence? Retrieved
February 8, 2018, from https:// www. scien tific ameri can. com/ artic le/ will- democ racy- survi ve- big-
data- and- artifi cial- intel ligen ce/.
Hicks, D., Wouters, P., Waltman, L., de Rijcke, S., & Rafols, I. (2015). Bibliometrics: The Leiden mani-
festo for research metrics. Nature, 520(7548), 429–431. https:// doi. org/ 10. 1038/ 52042 9a
Himanen, L., Conte, E., Gauffriau, M., Strøm, T., Wolf, B., & Gadd, E. (2024). The SCOPE framework?
Implementing the ideals of responsible research assessment [version 2; peer review: 2 approved].
F1000Research, 12(1241). https:// doi. org/ 10. 12688/ f1000 resea rch. 140810.2
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the
National Academy of Sciences of the United States of America, 102(46), 16569–16572. https:// doi.
org/ 10. 1073/ pnas. 05076 55102
Hobbes, T. (1651). Leviathan. Retrieved March 27, 2024, from https:// www. guten berg. org/ files/ 3207/
3207-h/ 3207-h. htm# link2H_ 4_ 0068.
Hoffrage, U., Lindsey, S., Hertwig, R., & Gigerenzer, G. (2000). Communicating statistical information.
Science, 290(5500), 2261–2262. https:// doi. org/ 10. 1126/ scien ce. 290. 5500. 2261
Hoffrage, U., & Marewski, J. N. (2015). Unveiling the Lady in Black: Modeling and aiding intuition.
Journal of Applied Research in Memory and Cognition, 4(3), 145–163.https:// doi. org/ 10. 1016/j.
jarmac. 2015. 08. 001
Scientometrics
Holton, G., Chang, H., & Jurkowitz, E. (1996). How a scientific discovery is made: A case history.
American Scientist, 84(4), 364–375.https:// www. jstor. org/ stable/ 29775 708.
Hug, S. E. (2022). Towards theorizing peer review. Quantitative Science Studies, 3(3), 815–831. https://
doi. org/ 10. 1162/ qss_a_ 00195
Hug, S. E. (2024). How do referees integrate evaluation criteria into their overall judgment? Evidence from
grant peer review. Scientometrics,1,1231–1253.https:// doi. org/ 10. 1007/ s11192- 023- 04915-y
Hug, S. E., & Aeschbach, M. (2020). Criteria for assessing grant applications: A systematic review. Pal-
grave Communications, 6, 37.https:// doi. org/ 10. 1057/ s41599- 020- 0412-9
Hvistendahl, M. (2013). China’s publication bazaar. Science, 342(6162), 1035–1039. https:// doi. org/ 10.
1126/ scien ce. 342. 6162. 1035
Kahneman, D., & Tversky, A. (1996). On the reality of cognitive illusions. Psychological Review,
103(3), 582–591.https:// doi. org/ 10. 1037/ 0033- 295x. 103.3. 582
Kahnemann, D., Slovic, P., & Tversky, A. (1982). Judgment under uncertainty: Heuristics and biases.
Cambridge University Press.
Katsikopoulos, K. V., Marewski, J. N., & Hoffrage, U. (2024). Heuristics for metascience: Simon and
Popper. In G. Gigerenzer, S. Mousavi, & R. Viale (Eds.), Elgar companion to Herbert Simon (pp.
300–311). Edward Elgar Publishing.
Katsikopoulos, K. V., Simsek, Ö., Buckmann, M., & Gigerenzer, G. (2020). Classification in the wild:
The science and art of transparent decision making. MIT Press.
Katz, J. S., & Hicks, D. (1997). Desktop scientometrics. Scientometrics, 38(1), 141–153. https:// doi. org/
10. 1007/ bf024 61128
Kleinert, A. (2009). Der messende Luchs. Zwei verbreitete Fehler in der Galilei-Literatur. NTM
Zeitschrift für Geschichte der Wissenschaften, Technik und Medizin, 17(2), 199–206. https:// doi.
org/ 10. 1007/ s00048- 009- 0335-4
Kline, R. B. (2015). The mediation myth. Basic and Applied Social Psychology, 37(4), 202–213. https://
doi. org/ 10. 1080/ 01973 533. 2015. 10493 49
Knight, F. H. (1921). Risk, uncertainty and profit. Houghton Mifflin.
Krüger, L., Daston, L. J., & Heidelberger, M. (Eds.). (1987). The probabilistic revolution. Volume 1:
Ideas in history. MIT Press
Kruschke, J. K. (2010). Bayesian data analysis. Wires Cognitive Science, 1(5), 658–676. https:// doi. org/
10. 1002/ wcs. 72
Langfeldt, L., Nedeva, M., Sörlin, S., & Thomas, D. A. (2020). Co-existing notions of research quality:
A framework to study context-specific understandings of good research. Minerva, 58, 115–137.
https:// doi. org/ 10. 1007/ s11024- 019- 09385-2
Langfeldt, L., Reymert, I., & Aksnes, D. W. (2021). The role of metrics in peer assessments. Research
Evaluation, 30(1), 112–126. https:// doi. org/ 10. 1093/ resev al/ rvaa0 32
Leibniz, G. W. (1677/1951). Toward a universal characteristic. In P. P. Wiener (Ed.), Leibniz selections
(pp. 17–25). Scribner’s Sons [Original work published 1677].
Levinthal, D. A., & March, J. G. (1993). The myopia of learning. Strategic Management Journal, 14,
95–112.https:// doi. org/ 10. 1002/ smj. 42501 41009
Leydesdorff, L., Wouters, P., & Bornmann, L. (2016). Professional and citizen bibliometrics: Comple-
mentarities and ambivalences in the development and use of indicators—A state-of-the-art report.
Scientometrics, 109(3), 2129–2150. https:// doi. org/ 10. 1007/ s11192- 016- 2150-8
Lopes, L. L. (1991). The rhetoric of irrationality. Theory & Psychology, 1(1), 65–82. https:// doi. org/ 10.
1177/ 09593 54391 011005
Lopes, L. L. (1992). Three misleading assumptions in the customary rhetoric of the bias literature. The-
ory & Psychology, 2, 231–236.https:// doi. org/ 10. 1177/ 09593 54392 022010
Macilwain, C. (2013). Halt the avalanche of performance metrics. Nature, 500(7462), 255. https:// doi.
org/ 10. 1038/ 50025 5a
Manski, C. F. (2011). Choosing treatment policies under ambiguity. Annual Review of Economics, 3(1),
25–49.https:// doi. org/ 10. 1146/ annur ev- econo mics- 061109- 080359
Manski, C. F. (2013). Public policy in an uncertain world. Harvard University Press.
March, J. G. (1991). Exploration and exploitation in organizational learning. Organization Science, 2(1),
71–87.
Marewski, J. N., & Bornmann, L. (2018). Opium in science and society: Numbers. Retrieved January 14,
2020, from https:// arxiv. org/ abs/ 1804. 11210
Marewski, J. N., & Hoffrage, U. (2021). The winds of change: The Sioux, Silicon Valley, society, and
simple heuristics. In R. Viale (Ed.), Routledge handbook of bounded rationality (pp. 280–312).
Routledge.
Scientometrics
Marewski, J. N. & Hoffrage, U. (2024). Heuristics: How simple models of the mind can serve as tools for
transparent scientific justification. Manuscript submitted for publication.
Marewski, J.N., Katsikopoulos, K.V., & Guercini, S. (2024). Simon’s scissors: Meta-heuristics for decision-
makers. Management Decision, 62(13), 283-308. https:// doi. org/ 10. 1108/ MD- 06- 2023- 1073
Marewski, J. N., & Schooler, L. J. (2011). Cognitive niches: An ecological model of strategy selection.
Psychological Review, 118(3), 393–437. https:// doi. org/ 10. 1037/ a0024 143
Martignon, L., & Hoffrage, U. (1999). Why does one-reason decision making work? A case study in eco-
logical rationality. In G. Gigerenzer, P. M. Todd, & The ABC Research Group, Simple heuristics
that make us smart (pp. 119–140). Oxford University Press.
Martin, B. R., & Irvine, J. (1983). Assessing basic research—Some partial indicators of scientific progress
in radio astronomy. Research Policy, 12(2), 61–90. https:// doi. org/ 10. 1016/ 0048- 7333(83) 90005-7
Marx, W. (2014). The Shockley-Queisser paper—A notable example of a scientific sleeping beauty.
Annalen der Physik, 526(5–6), A41–A45. https:// doi. org/ 10. 1002/ andp. 20140 0806
Marx, W., & Bornmann, L. (2010). How accurately does Thomas Kuhn’s model of paradigm change
describe the transition from the static view of the universe to the big bang theory in cosmology? A
historical reconstruction and citation analysis. Scientometrics, 84(2), 441–464. https:// doi. org/ 10.
1007/ s11192- 009- 0107-x
Marx, W., & Bornmann, L. (2013). The emergence of plate tectonics and the Kuhnian model of paradigm
shift: A bibliometric case study based on the Anna Karenina principle. Scientometrics, 94(2), 595–
614. https:// doi. org/ 10. 1007/ s11192- 012- 0741-6
Marx, W., & Bornmann, L. (2015). On the causes of subject-specific citation rates in Web of Science. Scien-
tometrics, 102(2), 1823–1827. https:// doi. org/ 10. 1007/ s11192- 014- 1499-9
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of
soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834. https:// doi. org/ 10.
1037/ 0022- 006X. 46.4. 806
Merriam Webster, Statistics. Retrieved October 10, 2023, from https:// www. merri am- webst er. com/ dicti
onary/ stati stics.
Mischel, W. (2008). The toothbrush problem. Retrieved December 6, 2019, from https:// www. psych ologi
calsc ience. org/ obser ver/ the- tooth brush- probl em.
Moed, H. F., & Halevi, G. (2015). Multidimensional assessment of scholarly research impact. Journal of
the American Society for Information Science and Technology, 66(10), 1988–2002. https:// doi. org/ 10.
1002/ asi. 23314
Mousavi, S., & Gigerenzer, G. (2017). Heuristics are tools for uncertainty. Homo Oeconomicus, 34(4), 361–
379. https:// doi. org/ 10. 1007/ s41412- 017- 0058-z
Newell, A., Shaw, J. C., & Simon, H. A. (1958). Elements of a theory of human problem solving. Psycho-
logical Review, 65, 151–166.https://doi.org/10.1037/h0048495
Newell, A., & Simon, H. A. (1956). The logic theory machine. A complex information processing system.
Paper presented at the Symposium on Information Theory, Cambridge, MA, USA. Retrieved 30
April, 2021 from https:// exhib its. stanf ord. edu/ feige nbaum/ catal og/ ct530 kb5673
Oakes, M. (1990). Statistical inference. Epidemiology Resources Inc.
O’Neil, C. (2016). Weapons of math destruction: How big data increases inequality and threatens democ-
racy. Penguin Random House.
Oreskes, N., & Conway, E. M. (2011). Merchants of doubt. How a handful of scientists obscured the truth
on issues from tobacco smoke to global warming. Bloomsbury Press.
Petracca, E. (2021). On the origins and consequences of Simon’s modular approach to bounded rationality
in economics. The European Journal of the History of Economic Thought, 28, 708–732.https:// doi.
org/ 10. 1080/ 09672 567. 2021. 18777 60
Pfeffer, J., Salancik, G. R., & Leblebici, H. (1976). Effect of uncertainty on use of social influence in organi-
zational decision-making. Administrative Science Quarterly, 21(2), 227–245. http:// www. jstor. com/
stable/ 23920 44.
Pólya, G. (1945). How to solve it. A new aspect of mathematical method. Princeton University Press [New
Princeton Library Edition, 2014].
Porter, T. M. (1992). Quantification and the accounting ideal in science. Social Studies of Science, 22(4),
633–652. https:// doi. org/ 10. 1177/ 03063 12920 22004 004
Porter, T. M. (1993). Statistics and the politics of objectivity. Revue De Synthèse, 114(1), 87–101. https://
doi. org/ 10. 1007/ bf031 81156
Porter, T. M. (1995). Trust in numbers: The pursuit of objectivity in science and public life. Princeton Uni-
versity Press.
Porter, T. M. (2015). The flight of the indicator. In R. Rottenburg (Ed.), The world of indicators (pp. 34–55).
Cambridge University Press.
Scientometrics
Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resampling strategies for assessing and comparing
indirect effects in multiple mediator models. Behavior Research Methods, 40(3), 879–891. https:// doi.
org/ 10. 3758/ Brm. 40.3. 879
REF 2029. (2024). What is the REF? Retrieved August 22, 2024, from https:// www. ref. ac. uk/ about/
what- is- the- ref/
REF. (2014). Research excellence framework. Retrieved August 22, 2024, from https:// 2014. ref. ac. uk/.
Reich, E. S. (2013). Science publishing: The golden club. Nature, 502(7471), 291–293. https:// doi. org/ 10.
1038/ 50229 1a
Retzer, V., & Jurasinski, G. (2009). Towards objectivity in research evaluation using bibliometric indicators:
A protocol for incorporating complexity. Basic and Applied Ecology, 10(5), 393–400. https:// doi. org/
10. 1016/j. baae. 2008. 09. 001
Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychologi-
cal Review, 107(2), 358–367. https:// doi. org/ 10. 1037/ 0033- 295X. 107.2. 358
Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in
psychological science. American Psychologist, 44(10), 1276–1284. https:// doi. org/ 10. 1037/
10109- 027.
Ruocco, G., Daraio, C., Folli, V., & Leonetti, M. (2017). Bibliometric indicators: The origin of their
log-normal distribution and why they are not a reliable proxy for an individual scholar’s talent.
Palgrave Communications, 3, 17064.https:// doi. org/ 10. 1057/ palco mms. 2017. 64
Salancik, G. R., & Pfeffer, J. (1978). Uncertainty, secrecy, and the choice of similar others. Social Psy-
chology, 41(3), 246–255. https:// doi. org/ 10. 2307/ 30335 61
Savage, L. J. (1954/1972). The foundation of statistics. Dover Publications [Original work published
1954].
Schatz, G. (2014). The faces of big science. Nature Reviews Molecular Cell Biology, 15(6), 423–426.
https:// doi. org/ 10. 1038/ nrm38 07
Schubert, A., & Glänzel, W. (1984). A dynamic look at a class of skew distributions. A model with sci-
entometric applications. Scientometrics, 6, 149–167.https:// doi. org/ 10. 1007/ BF020 16759
Seglen, P. O. (1992). The skewness of science. Journal of the American Society for Information Sci-
ence, 43(9), 628–638. https:// doi. org/ 10. 1002/ (SICI) 1097- 4571(199210) 43:9% 3c628:: AID- ASI5%
3e3.0. CO;2-0.
Severson, K. (2011). Thousands sterilized, a state weighs restitution. Retrieved December 6, 2019, from
https:// www. nytim es. com/ 2011/ 12/ 10/ us/ redre ss- weigh ed- for- forced- steri lizat ions- in- north- carol
ina. html.
Shockley, W., & Queisser, H. J. (1961). Detailed balance limit of efficiency of p-n junction solar cells.
Journal of Applied Physics, 32(3), 510. https:// doi. org/ 10. 1063/1. 17360 34
Simon, H. A. (1947/1997). Administrative behavior. A study of decision-making processes in administra-
tive organizations (4th Edition). The Free Press [Original work published in 1947].
Simon, H. A. (1955a). A behavioral model of rational choice. Quarterly Journal of Economics, 69,
99–118. https:// doi. org/ 10. 2307/ 18848 52
Simon, H. A. (1955b). On a class of skew distribution functions. Biometrika, 42, 425–440.https:// doi.
org/ 10. 1093/ biomet/ 42.3- 4. 425
Simon, H. A. (1956). Rational choice and the structure of the environment.Psychological Review,63,
129–138.https://doi.org/10.1037/h0042769
Simon, H. A. (1973). The structure of ill structured problems. Artificial Intelligence, 4, 181–201.https://
doi. org/ 10. 1016/ 0004- 3702(73) 90011-8
Simon, H. A. (1979). Models of thought. Yale University Press.
Simon, H. A. (1990). Invariants of human behavior. Annual Review of Psychology, 41, 1–19.https:// doi.
org/ 10. 1146/ annur ev. ps. 41. 020190. 000245
Simon, H. A. (1992). What is an “explanation” of behavior? Psychological Science, 3(3), 150–
161.https:// doi. org/ 10. 1111/j. 1467- 9280. 1992. tb000 17.x
Simon, H. A. (1996). Models of my life. MIT Press [First published 1991 by Basic books].
Simon, H. A., & Newell, A. (1958). Heuristic problem solving: The next advance in operations research.
Operations Research, 6, 1–10.https:// doi. org/ 10. 1287/ opre.6. 1.1
Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. Royal Society Open Sci-
ence, 3(9), 1–17. https:// doi. org/ 10. 1098/ rsos. 160384
Steinle, F. (2008). Explorieren – Entdecken – Testen. Spektrum der Wissenschaft, 9, 34–41.
Stephens, D. W., & Krebs, J. R. (1986). Foraging theory. Princeton University Press.
Tahamtan, I., & Bornmann, L. (2018). Core elements in the process of citing publications: Conceptual
overview of the literature. Journal of Informetrics, 12(1), 203–216. https:// doi. org/ 10. 1016/j. joi.
2018. 01. 002
Scientometrics
Thaler, R. H., & Sunstein, C. R. (2009). Nudge: Improving decisions about health, wealth, and happi-
ness. Penguin Books.
Thomas, D. A., Nedeva, M., Tirado, M. M., & Jacob, M. (2020). Changing research on research eval-
uation: A critical literature review to revisit the agenda. Research Evaluation, 29(3), 275–288.
https:// doi. org/ 10. 1093/ resev al/ rvaa0 08
Thonon, F., Boulkedid, R., Delory, T., Rousseau, S., Saghatchian, M., van Harten, W., O’Neill, C., &
Alberti, C. (2015). Measuring the outcome of biomedical research: A systematic literature review.
PLoS ONE, 10(4), e0122239. https:// doi. org/ 10. 1371/ journ al. pone. 01222 39
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science,
185(4157), 1124–1131.https:// doi. org/ 10. 1126/ scien ce. 185. 4157. 1124
van Raan, A. F. J. (2004). Sleeping beauties in science. Scientometrics, 59(3), 467–472. https:// doi. org/
10. 1023/B: SCIE. 00000 18543. 82441. f1
van Raan, A. F. J. (2008). Bibliometric statistical properties of the 100 largest European research universi-
ties: Prevalent scaling rules in the science system. Journal of the American Society for Information
Science and Technology, 59(3), 461–475. https:// doi. org/ 10. 1002/ asi. 20761
van Raan, A. F. J., van Leeuwen, T. N., & Visser, M. S. (2011). Severe language effect in university rank-
ings: Particularly Germany and France are wronged in citation-based rankings. Scientometrics, 88(2),
495–498. https:// doi. org/ 10. 1007/ s11192- 011- 0382-1
Vickers, B. (1992). Francis Bacon and the progress of knowledge. Journal of the History of Ideas, 53(3),
495–518.https:// www. jstor. org/ stable/ 27098 91.
Vindolanda Tablets Online. (2018). Vindolanda tablets online (Tablet 154). Retrieved March 22, 2018, from
http:// vindo landa. csad. ox. ac. uk/.
Vinkler, P. (2010). The evaluation of research by scientometric indicators. Chandos Publishing.
Waltman, L., & van Eck, N. J. (2012). The inconsistency of the h-index. Journal of the American Society for
Information Science and Technology, 63(2), 406–415. https:// doi. org/ 10. 1002/ asi. 21678
Waltman, L., & van Eck, N. J. (2016). The need for contextualized scientometric analysis: An opinion paper.
In I. Ràfols, J. Molas-Gallart, E. Castro-Martínez, & R. Woolley (Eds.), Proceedings of the 21. Inter-
national Conference on Science and Technology Indicator (pp. 541–549). Universitat Politècnica de
València
Weingart, P. (2005). Impact of bibliometrics upon the science system: Inadvertent consequences? Sciento-
metrics, 62(1), 117–131. https:// doi. org/ 10. 1007/ s11192- 005- 0007-7
Wildgaard, L., Schneider, J. W., & Larsen, B. (2014). A review of the characteristics of 108 author-level
bibliometric indicators. Scientometrics, 101(1), 125–158. https:// doi. org/ 10. 1007/ s11192- 014- 1423-3
Wilsdon, J., Allen, L., Belfiore, E., Campbell, P., Curry, S., Hill, S., & Johnson, B. (2015). The metric
tide: Report of the independent review of the role of metrics in research assessment and management.
Higher Education Funding Council for England (HEFCE). https:// doi. org/ 10. 13140/ RG.2. 1. 4929.
1363
Yates, F (1990). Foreword. In R. A. Fisher (Ed. J.H. Bennett), Statistical methods, experimental design, and
scientific inference. Oxford University Press
Young, K. (1922). Intelligence tests of certain immigrant groups. The Scientific Monthly, 15(5), 417–
434.https:// www. jstor. org/ stable/ 6403.
Zenker, O. (2015). Failure by the numbers? Settlement statistics as indicators of state performance in South
Africa land restitution. In R. Rottenburg (Ed.), The world of indicators (pp. 102–126). Cambridge
University Press.
Ziliak, S. & McCloskey, D. N. (2012). The cult of statistical significance. How the standard error costs
jobs, justice, and lives. The University of Michigan Press
Ziman, J. (2000). Real science. What it is, and what it means. Cambridge University Press.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.