ArticlePDF Available

Reliability of Researcher Capacity Estimates and Count Data Dispersion: A Comparison of Poisson, Negative Binomial, and Conway-Maxwell Poisson Models

Authors:

Abstract and Figures

Item-response models from the psychometric literature have been proposed for the estimation of researcher capacity. Canonical items that can be incorporated in such models to reflect researcher performance are count data (e.g., number of publications, number of citations). Count data can be modeled by Rasch's Poisson counts model that assumes equidispersion (i.e., mean and variance must coincide). However, the mean can be larger as compared to the variance (i.e., underdispersion), or b) smaller as compared to the variance (i.e., overdispersion). Ignoring the presence of overdispersion (underdispersion) can cause standard errors to be liberal (conservative), when the Poisson model is used. Indeed, number of publications or number of citations are known to display overdispersion. Underdispersion, however, is far less acknowledged in the literature. In the current investigation the flexible Conway-Maxwell-Poisson count model is used to examine reliability estimates of capacity in relation to various dispersion patterns. It is shown, that reliability of capacity estimates of inventors drops from .84 (Poisson) to .68 (Conway-Maxwell-Poisson) or .69 (negative binomial). Moreover, with some items displaying overdispersion and some items displaying underdispersion, the dispersion pattern in a reanalysis of Mutz and Daniel's (2018) researcher data was found to be more complex as compared to previous results. To conclude, a careful examination of competing models including the Conway-Maxwell-Poisson count model should be undertaken prior to any evaluation and interpretation of capacity reliability. Moreover, this work shows that count data psychometric models are well suited for decisions with a focus on top researchers, because conditional reliability estimates (i.e., reliability depending on the level of capacity) were highest for the best researchers.
Content may be subject to copyright.
Vol.:(0123456789)
Scientometrics
https://doi.org/10.1007/s11192-021-03864-8
1 3
Reliability ofresearcher capacity estimates andcount data
dispersion: acomparison ofPoisson, negative binomial,
andConway‑Maxwell‑Poisson models
BorisForthmann1 · PhilippDoebler2
Received: 10 August 2020 / Accepted: 5 January 2021
© The Author(s) 2021
Abstract
Item-response models from the psychometric literature have been proposed for the esti-
mation of researcher capacity. Canonical items that can be incorporated in such models
to reflect researcher performance are count data (e.g., number of publications, number
of citations). Count data can be modeled by Rasch’s Poisson counts model that assumes
equidispersion (i.e., mean and variance must coincide). However, the mean can be larger
as compared to the variance (i.e., underdispersion), or b) smaller as compared to the vari-
ance (i.e., overdispersion). Ignoring the presence of overdispersion (underdispersion) can
cause standard errors to be liberal (conservative), when the Poisson model is used. Indeed,
number of publications or number of citations are known to display overdispersion. Under-
dispersion, however, is far less acknowledged in the literature. In the current investigation
the flexible Conway-Maxwell-Poisson count model is used to examine reliability estimates
of capacity in relation to various dispersion patterns. It is shown, that reliability of capacity
estimates of inventors drops from .84 (Poisson) to .68 (Conway-Maxwell-Poisson) or .69
(negative binomial). Moreover, with some items displaying overdispersion and some items
displaying underdispersion, the dispersion pattern in a reanalysis of Mutz and Daniel’s
(2018b) researcher data was found to be more complex as compared to previous results.
To conclude, a careful examination of competing models including the Conway-Maxwell-
Poisson count model should be undertaken prior to any evaluation and interpretation of
capacity reliability. Moreover, this work shows that count data psychometric models are
well suited for decisions with a focus on top researchers, because conditional reliability
estimates (i.e., reliability depending on the level of capacity) were highest for the best
researchers.
Keywords Researcher capacity· Item response models· Rasch Poisson count model·
Conway-Maxwell-Poisson count model· Dispersion· Reliability
Mathematics subject classification 62P25
* Boris Forthmann
boris.forthmann@wwu.de
1 Institute ofPsychology inEducation, University ofMünster, Münster, Germany
2 Department ofStatistics, TU Dortmund University, Dortmund, Germany
Scientometrics
1 3
JEL classification C18
Introduction
Count data are part of the daily bread and butter of scientometric researchers. Perhaps most
eminently, number of publications and number of citations are examined. A default statisti-
cal model for count data relies on the Poisson distribution. For example, a Poisson model was
used to describe scientific productions of mathematicians, physicists, and inventors (Huber
2000; Huber and Wagner-Döbler 2001a, 2001b). However, in particular citation count distri-
butions are known to have a smaller mean as compared to the variance (Didegah and Thelwall
2013; Ketzler and Zimmermann 2013). This phenomenon is known as overdispersion and it
violates equidispersion (i.e., mean and variance coincide) as an important assumption for valid
statistical inference based on the Poisson model (Hilbe 2011). This issue has been acknowl-
edged in numerous scientometric studies (Didegah and Thelwall 2013; Ketzler and Zimmer-
mann 2013; Sun and Xia 2016) with some of them dating back to the early 1980s (Cohen
1981). Underdispersion, however, has been less an issue in the scientometric literature. This
is illustrated by a simple search in GoogleScholar (https ://schol ar.googl e.com/) that reveals 69
hits for the search string “source:scientometrics overdispersion” and 4 hits for the search string
“source:scientometrics underdispersion” (retrieved on August the 4th in 2020). This huge dif-
ference can most likely be explained because conservative inference based on overestimated
standard errors (Faddy and Bosch 2001) is perceived to be less problematic as compared to
liberal inference.
Moreover, Forthmann et al. (2019a) have recently shown that such dispersion issues
affect also the measurement precision of individual capacity estimates in item-response the-
ory (IRT) models. In a large simulation study, they demonstrated that reliability estimates
of person capacity derived from the Rasch Poisson count model (RPCM; i.e., an IRT model
relying on the Poisson distribution; Rasch 1960) can be huge overestimates for an overdis-
persed population model. Analogously, RPCM-based reliability was found to be underesti-
mated when data were simulated according to an underdispersed population model. In addi-
tion, also more complex population models with fractions of the items being overdispersed
and underdispersed, respectively, resulted in mis-estimated RPCM-based reliabilities. These
observations are critically important for applications of IRT count data models to quantify
research performance. For example, the RPCM has been proposed as an elegant approach to
measure researcher capacity based on a variety of bibliometric indicators (Mutz and Daniel
2018b). However, analogous to the known problems related to statistical inference, reliabil-
ity for researcher capacity estimates derived from the RPCM can be heavily overestimated
in the presence of overdispersion. Mutz and Daniel (2018b) found an impressive reliability
estimate of 0.96 which might be misestimated. Hence, the aim of the current study is to inves-
tigate potentially misestimated measurement precision of researcher capacities by applying the
RPCM, IRT count models based on the negative binomial distribution, and the more flexible
Conway-Maxwell-Poisson count model in a large dataset of inventors and a reanalysis of Mutz
and Daniel’s dataset.
Scientometrics
1 3
Estimating researcher capacity byitem response theory models
The measurement of research performance is critically important for many practical pur-
poses such as allotment of funding, candidate selection for academic jobs, interim evalua-
tions of researchers after starting a new position, management of research strategies at the
institutional level (Mutz and Daniel 2018b; Sahel 2011; Zhang etal. 2019). Most often,
single indicators such as the h index (Hirsch 2005) are used to quantify performance at
the level of individual researchers. But a mere deterministic use of bibliometric indicators
has shortcomings that can be overcome: Probabilistic accounts of bibliometric measures
allow, for example, the quantification of measurement precision (Glänzel and Moed 2013).
In this vein, it should be noted that stochastic variants of bibliometric indicators such as the
hindex exist (Burrell 2007). The contrast between deterministic vs. probabilistic measure-
ment accounts is paralleled in the psychometric literature. IRT models are probabilistic in
nature (Lord 1980) and used to link properties of test items (i.e., item parameters such as
difficulty) and capacity of individuals (i.e., ability to solve the items) with observed behav-
ior (i.e., has an item be solved by a person or not). Indeed, an IRT model from the Rasch
family was used by Alvarez and Pulgarín (1996) to scale journal impact based on citation
and publication counts. However, this approach did not resonate well in the scientometric
literature (Glänzel and Moed 2013).
More recently, Mutz and Daniel (2018b) revived the idea to use IRT scaling for the
assessment of researcher capacity and highlighted five problems with commonly used
approaches to assess researcher capacity [for more details see (Mutz and Daniel 2018b)
and the references therein]: (a) indicators are observed and do not represent an abstract
latent variable reflecting a researcher’s competency instead of mere observed performance,
(b) measurement error (or precision) is not appropriately taken into account, (c) the count
data nature of most performance indicators is not appropriately taken into account, (d)
multiple indicators can be reduced to rather few underlying latent dimensions (suggest-
ing that looking at many indicators for evaluation purposes might be unnecessary), and (e)
indicators are potentially not comparable between different scientific fields. Mutz and Dan-
iel (2018b) suggested to use Doebler etal.’s (2014) extension of the RPCM to overcome
these problems.
From theRasch Poisson counts model totheConway‑Maxwell‑Poisson
counts model: gaining modeling exibility
In the psychological literature, the RPCM was used to scale many different cognitive abili-
ties such as reading competency (Jansen 1995; Jansen and van Duijn 1992; Rasch 1960),
intelligence (Ogasawara 1996), mental speed (Baghaei et al. 2019; Doebler and Holling
2016; Holling etal. 2015), or divergent thinking (Forthmann etal. 2016, 2018). The model
was further used for migraine attacks (Fischer 1987) or sports exercises such as sit-ups
(Zhu and Safrit 1993). All these examples have in common with most bibliometric indica-
tors that they provide count data as responses for each of the respective test items (e.g.,
the number of correctly read words within an allotted test time). In this work, the RPCM
and the other IRT models will be introduced and used within a Generalized Linear Mixed
Model (GLMM) framework that allows linking the expected value with a log-linear term
including a researcher capacity parameter and an item easiness parameter (De Boeck etal.
Scientometrics
1 3
2011). In addition, a variance function exists for each of the models, i.e. a variance depends
on an (item-specific) dispersion parameter and the person- and item-specific conditional
mean. For example, in the RPCM the expected value μji of count yji for person j and item
i is modeled as a multiplicative function of capacity parameter εj > 0 and item easiness
parameter σi > 0
Then, the mean in the RPCM is in fact log-linear:
for the log-transformed capacity parameter θj = ln(
𝜀j
) and item easiness parameter
βi = ln(
𝜎i
). The higher βi, the easier it is to obtain a comparably large item score. The prob-
ability to observe yji is Poisson, so the probability mass function is
This distributional assumption implies equidispersion, i.e. Var(Yji) = E(Yji). The disper-
sion parameter φi equals 1 for all items, and the variance function is Vjiji, φi) = μjiφi = μji
for the RPCM.
Mutz and Daniel (2018b) used Doebler etal.’s (2014) extension of the RPCM in which
an upper asymptote for performance is proposed, intended for speed of processing meas-
ures in which simple cognitive tasks have to be solved under time constraints. The Doe-
bler et al. (2014) model is coined the Item Characteristic Curve Poisson counts model
(ICCPCM), because it borrows the sigmoid shape of Item Characteristic Curves in binary
IRT, to describe the conditional means. Because of the known dispersion issues with bib-
liometric indicators (Didegah and Thelwall 2013; Ketzler and Zimmermann 2013; Sun and
Xia 2016), Mutz and Daniel (2018b) used a Poisson-Gamma mixture extension of Doebler
et al.’s speed model (i.e., a negative binomial model). They performed a thorough sim-
ulation study on the parameter recovery of this model and recommended to use it only
with large sample sizes (N = 400 or higher). Hence, to reduce sample size requirements for
the purpose of the current work, we use the comparably less complex negative binomial
model (Hung 2012) in which the expected value is modeled following Eq.2, but replacing
Eq.3 by a negative binomial assumption. The variance function for this model is Vjiji,
φi) = μji +
μ2
ji
i = μji +
μ2
ji
/exp(τi) which allows item-specific overdispersion, but not underd-
ispersion. It should be further noted, that this variant of the negative binomial distribution
has been coined NB2 in the literature because of the quadratic nature of its variance param-
eterization (Hilbe 2011). This model will be referred to as NB2 counts model (NB2CM)
in this work. For completeness, we will also consider a NB1 counts model (NB1CM; i.e.,
a model with a linear variance parameterization; Hilbe 2011) with μji = exp(βi + θj) and
Vjiji, φi) = μji(1 + φi) = μji(1 + exp(τi)). The RPCM results as a border case of the NB1CM:
when τi approaches -∞, exp(τi) approaches 0 for all items and hence Vjiji, φi) goes to μji.
The third and final model considered in this work is the Conway-Maxwell-Poisson
counts model (CMPCM; Forthmann et al. 2019a). The CMPCM extends the RPCM
to allow for both overdispersion and underdispersion at the level of items. Hence, with
respect to dispersion modeling the CMPCM can be considered the most flexible approach
examined in this work. Importantly, the CMPCM is based on Huang’s (2017) mean param-
eterization of the Conway-Maxwell-Poisson distribution that leads to a log-linear model
(1)
𝜇ji =𝜀j𝜎i
(2)
𝜇
ji =𝜀j𝜎i=𝜇
𝜃j,𝛽i
=exp
𝛽i+𝜃j
(3)
P
(Yji =yji
|
𝜀j,𝜎i)=
𝜇
ji
y
ji
y
ji
!exp(−𝜇ji
)
Scientometrics
1 3
for the conditional mean as in Eq.2. The conditional variance function for the CMPCM,
however, cannot be provided in a simple formula. However, as it is the case for the other
count models above, the variance is a function of the mean and dispersion parameter (see
Huang 2017). Importantly, this parameterization is different from other suggested regres-
sion models based on the Conway-Maxwell-Poisson distribution (e.g., Guikema and Gof-
felt 2008; Sellers and Shmueli 2010), and is a bona fide generalized linear model (GLM;
Huang 2017). Finally, in all four models, RPCM, NB2CM, NB1CM, and CMPCM, it is
assumed that the capacity variable θ follows a marginal normal distribution with mean zero
and standard deviation
𝜎2
𝜃.
This distributional assumption for θ implies that the item easi-
ness parameters βi can be interpreted as expected value for an item on the log-scale when
researchers have average capacity of zero on log-scale. In addition, dispersion parameters
τi reflect error variance and local reliability for the models based on the CMP and negative
binomial distributions (concrete interpretation depends on the respective distributions).
All model parameters are estimated by means of a marginal maximum likelihood (MML)
approach which is common in the IRT literature (e.g., De Boeck etal. 2011).
Forthmann etal. (2019a) found in a simulation study that item easiness parameters and
standard deviation of the latent capacity parameters in a CMPCM were already consist-
ently estimated with sample sizes as small as N = 100. In addition, item-specific dispersion
parameter estimates were found to be only slightly underestimated (but increasing the num-
ber of items seemed to prevent this bias). Hence, sample size requirements for this flexible
IRT model are less demanding as compared to Mutz and Daniel’s (2018b) complex NB2
extension of the ICCPCM (Doebler etal. 2014). This renders the CMPCM a potentially
useful candidate model that needs to be examined in relation to negative binomial models
that are perhaps most often applied in the scientometric literature to account for dispersion
issues (e.g., Didegah and Thelwall 2013; Ketzler and Zimmermann 2013).
Reliability ofresearcher capacity estimates
Reliability has been defined within the framework of classical test theory (e.g., Gulliksen
1950). The basic equation of classical test theory proposes that an observed test score X
results as the sum of a true score T and an error term E (i.e., X = T + E). Reliability refers
to the ratio of true score variance to observed score variance or alternatively one minus
the ratio of error variance to observed score variance. Hence, reliability estimates quantify
measurement precision of test scores and have an intuitive metric ranging from zero to one.
In the tradition of IRT, however, measurement precision of ability estimates refers to usage
of standard errors and confidence intervals conditional on the ability level, but summary
indices of measurement precision that are based on the information available conditional
on ability were developed also for IRT. In this vein, reliability has been coined reliability
of person separation, marginal reliability, or empirical reliability as early as in the 1980s
(Green etal. 1984; Wright and Masters 1982) and it is defined as the ratio of estimated
true ability variance adjusted for measurement error (i.e., error variance as quantified by
the average squared standard error across all ability-specific squared standard errors) and
the uncorrected estimated true ability variance. Hence, reliability here displays to some
degree a conceptual similarity with reliability as it is defined within classical test theory
(but it should not be confused with it). For example, error variance is constant in classi-
cal test theory, but it varies as a function of persons within IRT which requires averag-
ing across error variances to yield a reliability estimate (Wang 1999). Alternatively, it can
Scientometrics
1 3
be understood as the squared correlation (again this is analogous to classical test theory)
between estimated ability parameters and the true ability parameters (Brown and Crou-
dace, 2015; Brown 2018). This quantification of reliability is easy to calculate for a variety
of available estimators of ability parameters (Brown and Croudace 2015; Brown 2018) and
provides an established intuitive metric.
Notably, the empirical reliability estimates are based on the estimate of the variance of
the capacity distribution and the standard errors of the capacity estimates (more details are
provided in the method section below). Hence, biases in these estimates would directly
result in misestimated empirical reliability. In a previous simulation study that focused on
the CMPCM (Forthmann etal. 2019a), we found accurate estimates of the ability variance
and accurate standard errors even for sample sizes as small as N = 100. Parameter recovery
for the CMPCM has not been examined yet for sample sizes smaller than N = 100, hence
for situations in which only smaller samples are available new simulations should be run
to examine potential bias of reliability estimates. Furthermore, if simulation findings gen-
eralize to the negative binomial models is also an open question and requires attention for
situations with smaller sample sizes. As a more general remark, it should be noted that
reliability is population specific and invariance across samples is not guaranteed (e.g., sam-
ples that are affected by range restriction may not result in an accurate estimate of capacity
variance). Finally, measurement precision of reliability estimates can be assumed to be a
function of the value of reliability with wider confidence intervals for reliability around
the 0.50s and quite narrow confidence intervals for excellent reliability above 0.90. Ana-
lytically, this is guaranteed by results of Feldt etal. (1987) and valid for cases when Cron-
bach’s α and reliability coincide.
Aim ofthecurrent study
The goal of this study is a thorough comparison of various available count data IRT models
(i.e., RPCM, NB2CM, NB1CM, and CMPCM) based on two scientometric datasets. All
of these models allow modeling of the expected value in a log-linear fashion. The mod-
els differ with respect to their capability to model dispersion in the data. The RPCM is
the least flexible model that is based on the assumption of equidispersion. The NB2CM
and NB1CM allow global or item-specific overdispersion modeling, whereas the CMPCM
is the most flexible model allowing global or item-specific overdispersion and/or under-
dispersion. Hence, the first goal is to compare these models in terms of their relative fit
to the data based on information criteria that also take model parsimony into account. In
addition, a series of likelihood ratio tests are used to compare nested models of increas-
ing complexity. The second and main aim of this work is to examine the reliability of the
researcher capacity estimates for the best fitting model and the other competing candidate
models. Standard errors are known to be liberal (conservative) when data are overdispersed
(underdispersed) which biases statistical inference based on the simple Poisson model.
However, Forthmann etal. (2019a) demonstrated that these known problems further affect
the reliability of capacity estimates. The RPCM was found to overestimate reliability with
an overdispersed simulation model and an underdispersed population model resulted in
underestimation of capacity reliability estimates. Hence, this work extends the promising
work by Mutz and Daniel (2018b) in two important ways: a) comparing a greater variety of
distributional models, b) examining less complex models with lower demands in terms of
Scientometrics
1 3
sample size, and c) providing a detailed check of the inter-relatedness of capacity reliability
estimates and whether or how dispersion was taken account.
Method
Data sources
Patent dataset
The first dataset is a subset of the patent dataset provided by the National Bureau of Eco-
nomic Research (https ://data.nber.org/paten ts/). The full dataset is described in Hall etal.
(2001) and we use the file in which inventors were identified by an disambiguation algo-
rithm (Li etal. 2014). The disambiguated data are openly available at the Harvard Data-
verse (https ://datav erse.harva rd.edu/datav erse/paten t). Here we use the same subset of
N = 3055 inventors that was used by Forthmann etal. (2019b). To control for issues aris-
ing from data truncation only inventors with careers within the years from 1980 to 2003
were used. Moreover, inventors were required to have at least one patent in each of the six
four-year intervals in which the data were split. In this study, we use the number of patents
granted as an indicator of inventive capacity. Hence, the number of patents for each of the
six four-year intervals, respectively, was used as one item in this study (i.e., six items in
total). It has been argued that productivity within a period of time (annual productivity is
perhaps most often used in this regard) is highly relevant as a measure of scientific excel-
lence (Yair and Goldstein 2020).
Mutz andDaniel’s dataset
The second dataset comprises of N = 254 German social sciences researchers who were
listed in a membership directory of a quantitative methodology division of an unspeci-
fied academic society (Mutz and Daniel 2018b). The following six bibliometric indica-
tors measure researcher capacity (Mutz and Daniel 2018b): a) TOTCIT: total number of
citations received (excluding the highest cited paper), b) SHORTCIT: number of citations
received within a 3-year citation window, c) NUMPUB: total number of published articles,
d) TOP10%: number of publications in the top 10% of the researcher’s scientific field, e)
PUBINT: the number of papers published together with international co-authors, and f)
NUMCIT: the number of papers that received at least one citation. The dataset is openly
available for reanalysis (Mutz and Daniel 2018a).
Analytical approach
All models were fitted with the statistical software R (R Core Team 2019) by means of the
glmmTMB package (Brooks etal. 2017). All R scripts and links for download of the data-
sets are provided in the online repository of this work (https ://osf.io/em642 /). All models
were fitted with the same log-linear model to predict the expected value (see Eq.2) and
a normally distributed capacity parameter θ on the log-scale. The mean of the capacity
parameter distribution was fixed to a value of zero and the variance
was estimated (see
Forthmann etal. 2019a). The glmmTMB package provides empirical Bayes estimates for
Scientometrics
1 3
each θj by means of the maximum a posteriori (MAP) estimator. The NB2CM, NB1CM,
and CMPCM were fit in two variants. First, more parsimonious models with only one dis-
persion parameter for all items were estimated. Then, models with item-specific disper-
sion parameters were fit. Dispersion parameters in glmmTMB are modeled with a log-
link. For the negative binomial models this implies that large negative estimated values
imply the absence of overdispersion. For the CMPCMs, however, negative values imply
underdispersion, whereas a value of zero implies equidispersion and positive values imply
Table 1 Patent data: Model estimation results for RPCM and CMPCMs
N = 3055 (18,330 observations); τ = dispersion parameter (CMPCMs: τ < 0 indicates underdispersion; τ = 0
indicates equidispersion; and τ > 0 indicates overdispersion; NB2CMs: zero is not a useful reference for τ).
Likelihood ratio tests for NB1CMs: NB1CM with global dispersion is compared with RPCM and NB1CM
with item-specific dispersion is compared with CMPCM with global dispersion. Likelihood ratio tests for
CMPCMs: CMPCM with global dispersion is compared with RPCM and CMPCM with item-specific dis-
persion is compared with CMPCM with global dispersion. Δχ2 = likelihood ratio statistic; AIC = Akaike’s
Information Criterion; BIC = Bayesian Information Criterion. Lower values of information criteria imply
better model fit when also model parsimony is taken into account. BIC values model parsimony more than
AIC. Values in bold indicate the best fitting model based on AIC and BIC, respectively
*p < .05; **p < .01; ***p < .001
RPCM NB2CM with
global disper-
sion
NB2CM with
item-specific
dispersion
CMPCM with
global disper-
sion
CMPCM with
item-specific
dispersion
Fixed effects β (
SE𝛽
)β (
SE𝛽
)β (
SE𝛽
)β (
SE𝛽
)β (
SE𝛽
)
1980–1983 0.719 (0.015)*** 0.769 (0.017)*** 0.760 (0.016)*** 0.762 (0.018)*** 0.750 (0.015)***
1984–1987 1.192 (0.013)*** 1.221 (0.016)*** 1.218 (0.015)*** 1.219 (0.016)*** 1.220 (0.015)***
1988–1991 1.414 (0.013)*** 1.429 (0.015)*** 1.428 (0.015)*** 1.434 (0.015)*** 1.438 (0.015)***
1992–1995 1.504 (0.012)*** 1.507 (0.015)*** 1.511 (0.016)*** 1.519 (0.015)*** 1.525 (0.016)***
1996–1999 1.438 (0.013)*** 1.455 (0.015)*** 1.461 (0.016)*** 1.458 (0.015)*** 1.467 (0.016)***
2000–2003 1.029 (0.014)*** 1.075 (0.016)*** 1.081 (0.017)*** 1.064 (0.017)*** 1.072 (0.017)***
Random effects
̂𝜎 2
𝜃
0.259 0.208 0.204 0.218 0.206
SE
2
𝜃
0.042 0.064 0.063 0.067 0.066
Empirical reli-
ability
0.837 0.694 0.689 0.692 0.682
Dispersion τ τ (
SE𝜏
)τ (
SE𝜏
)τ (
SE𝜏
)τ (
SE𝜏
)
Global 0 1.411 (0.022)*** 1.015 (0.020)***
1980–1983 1.883 (0.089)*** 0.243 (0.048)***
1984–1987 1.525 (0.062)*** 0.805 (0.051)***
1988–1991 1.505 (0.058)*** 0.960 (0.050)***
1992–1995 1.246 (0.049)*** 1.400 (0.058)***
1996–1999 1.261 (0.050)*** 1.353 (0.058)***
2000–2003 1.275 (0.055)*** 1.123 (0.062)***
Model comparison
Δχ2(df) 6717.07 (1)*** 58.90 (5)*** 4792.74 (1)*** 231.66 (5)***
AIC 91,533.90 84,818.82 84,769.92 86,743.15 86,521.49
BIC 91,588.61 84,881.36 84,871.54 86,805.68 86,623.11
Akaike weights 0 0 1 0 0
Scientometrics
1 3
Table 2 Data from Mutz and Daniel (2018a, b): Model estimation results for RPCM and CMPCMs
N = 254 (1524 observations); τ = dispersion parameter (CMPCMs: τ < 0 indicates underdispersion; τ = 0
indicates equidispersion; and τ > 0 indicates overdispersion; NB2CMs: zero is not a useful reference for τ).
Likelihood ratio tests for NB1CMs: NB1CM with global dispersion is compared with RPCM and NB1CM
with item-specific dispersion is compared with CMPCM with global dispersion. Likelihood ratio tests for
CMPCMs: CMPCM with global dispersion is compared with RPCM and CMPCM with item-specific dis-
persion is compared with CMPCM with global dispersion. Δχ2 = likelihood ratio statistic; AIC = Akaike’s
Information Criterion; BIC = Bayesian Information Criterion. Lower values of information criteria imply
better model fit when also model parsimony is taken into account. BIC values model parsimony more than
AIC. aThe item dispersion parameter estimates for Item 3 and Item 6 were constrained to be equal for the
NB1PCM with item-specific dispersion
*p < .05; **p < .01; ***p < .001
RPCM NB1CM with
global disper-
sion
NB1CM with
item-specific
dispersiona
CMPCM with
global disper-
sion
CMPCM with
item-specific
dispersion
Fixed effects β (
SE𝛽
)β (
SE𝛽
)β (
SE𝛽
)β (
SE𝛽
)β (
SE𝛽
)
TOTCIT 3.900 (0.130)*** 4.178
(0.109)***
4.688 (0.108)*** 4.415
(0.117)***
3.946 (0.124)***
SHORTCIT 3.079 (0.130)*** 3.383
(0.111)***
3.868 (0.103)*** 3.333
(0.119)***
3.518 (0.116)***
NUMPUB 1.542 (0.130)*** 1.940
(0.117)***
2.330 (0.096)*** 2.169
(0.124)***
2.194 (0.101)***
TOP10% −0.542
(0.135)***
0.291 (0.134)*0.247 (0.110)*−0.201 (0.142) 0.076 (0.116)
PUBINT 0.376 (0.132)** 0.908
(0.126)***
1.164 (0.105)*** 0.811
(0.132)***
0.995 (0.111)***
NUMCIT 1.298 (0.131)*** 1.764
(0.118)***
2.085 (0.096)*** 1.835
(0.125)***
1.951 (0.100)***
Random effects
̂𝜎 2
𝜃
4.198 2.824 1.868 3.169 2.478
SE
2
𝜃
0.078 0.119 0.082 0.182 0.067
Empirical reli-
ability
0.981 0.958 0.956 0.942 0.973
Dispersion τ τ (
SE𝜏
)τ (
SE𝜏
)τ (
SE𝜏
)τ (
SE𝜏
)
Global 0 2.208
(0.060)***
3.494
(0.020)***
TOTCIT 5.106 (0.106)*** 8.566 (1.934)***
SHORTCIT 3.717 (0.099)*** 5.260 (0.312)***
NUMPUB −25.306
(0.129)***
−0.322 (0.116)**
TOP10% 0.214 (0.234) 1.260 (0.230)***
PUBINT 0.963 (0.161)*** 1.695 (0.189)***
NUMCIT −25.306
(0.129)***
−2.173 (0.169)***
Model comparison
Δχ2(df) 6102.10 (1)*** 1483.00 (5)*** 6932.91 (1)*** 700.34 (5)***
AIC 17,579.28 11,479.16 10,004.21 10,648.37 9958.03
BIC 17,616.58 11,521.80 10,068.16 10,691.00 10,027.31
Akaike weights 0 0 0 0 1
Scientometrics
1 3
overdispersion (Forthmann etal. 2019a). The tables in which model results are reported
include a note on these different interpretations of the dispersion parameters to facilitate
interpretation (see Tables 1 and 2). Models with increasing complexity were compared
based on likelihood ratio tests (denoted by Δχ2) and the Akaike information criterion (AIC;
Akaike 1973) and the Bayesian information criterion (Schwarz 1978). The information cri-
teria take also model parsimony into account with the BIC imposing a stronger penalty for
complex models. We also calculated Akaike weights for multi-model inference as imple-
mented in the R package MuMIn (Barton 2019). For the NB2PCMs and the NB1PCMs we
first looked at model comparison statistics and —for the sake of brevity—report here only
the respective better fitting variants of the negative binomial models. Results for the not
reported models can be found in the online repository for this work (https ://osf.io/em642 /).
Reliability of the researcher capacity estimates was globally determined based on
empirical reliability (Brown and Croudace 2015; Green etal. 1984):
with
SE
2
𝜃
being the average standard error of the researcher capacity estimates and
̂𝜎 2
𝜃
the
estimated variance of the researcher capacity distribution. In addition, conditional reliabil-
ity (i.e., the reliability for a specific capacity level) can be calculated analogously to Eq.4
(Green etal. 1984):
with
SE2
𝜃
j
being the standard error of specific capacity estimate θj. We compared condi-
tional reliability estimates to explore the dependence of reliability on the capacity level.
Our main aim in this regard was to compare both empirical and conditional reliability esti-
mates between the respective best fitting model and the alternative models to reveal how
model selection might influence the evaluation of reliability and important related interpre-
tations (e.g., deciding that estimates are accurate enough for a high-stakes decision).
Results
Patent dataset
The parameter estimates of all models can be found in Table1 [with the exception of the
NB1PCMs (global dispersion: AIC = 86,680.06, BIC = 86,742.59; item-specific disper-
sion: AIC = 86,247.58, BIC = 86,349.19) that were found to fit less well to the data as
compared to the NB2PCMs (global dispersion: AIC = 84,818.82, BIC = 84,881.36; item-
specific dispersion: AIC = 84,769.92, BIC = 84,871.54)]. The item easiness parameter
estimates were highly comparable across all fitted models. These parameters indicate an
increase of productivity up to the interval from 1992 to 1995. Then, productivity seemed
to decrease slightly up to the final intervals of inventors’ careers, but productivity did not
fall back to the level of the first career interval. All models that take deviations from the
Poisson assumption of equidispersion into account fitted better than the RPCM (as indi-
cated by likelihood ratio tests and information criteria; see Table1). All models (i.e., those
with general and item-specific dispersion parameters) clearly indicated the presence of
overdispersion.
(4)
Rel
(𝜃)=1SE
2
𝜃
̂𝜎
2
𝜃
(5)
Rel(
𝜃j
)
=1SE
2
𝜃
j
̂𝜎
2
𝜃
Scientometrics
1 3
The overall best fitting model was the NB2PCM with item-specific dispersion param-
eters. The estimated dispersion parameters indicated that overdispersion was strongest for
the first career interval. In addition, the amount of overdispersion was found to decrease
monotonically over inventor’s careers (see Table1). It is noteworthy that this pattern of
dispersion parameters was not found to be paralleled by the CMPCM with item-specific
dispersion. To examine whether this observation was masked by the complex interplay
between conditional mean and dispersion, we checked the item-specific dispersion index
[Var(Yi)/E(Yi); Bonat etal. 2018; Consul and Famoye 1992] with θj = 0 (i.e., average capac-
ity). Sellers and Shmueli’s (2010) approximate variance formula was used for the cal-
culation of item-specific dispersion indices (which was justified because all ν were < 1).
However, dispersion indices did also yield a different pattern as compared to the NB2CM
item-specific dispersion parameters. That is, the findings revealed that the dispersion index
increased from 1.18 to 2.58 at the fourth time interval (1992–1995) and, then, decreased to
a value of 1.90 for the last interval (2000–2003).
Furthermore, as expected when overdispersion is present, the RPCM led to a huge over-
estimation of empirical reliability of capacity estimates (0.837) as compared to any of the
other models that take overdispersion into account (range of empirical reliability estimates:
0.682–0.694). This overestimation resulted from both an overestimation of the variance
of the capacity estimate distribution as well as an underestimation of the average standard
errors of the capacity estimates (see Table2). While empirical reliability estimates were
found to be highly comparable across all models with dispersion parameters (see Table2),
this was not the case for conditional reliability (see Fig.1). In Fig.1, conditional reliabil-
ity estimates are plotted against the z-transformed capacity estimates for all count models.
Conditional reliability based on the RPCM was clearly overestimated for the full range
Fig. 1 Bivariate scatterplot of the conditional reliability estimates for the respective researcher capacity
estimates (y axis) against the z-standardized capacity estimates (x axis) for the patent data
Scientometrics
1 3
of capacity estimates. In addition, conditional reliability estimates for the NB2PCMs and
CMPCMs were quite comparable up to a capacity of 1.5 SDs above the mean. For values
greater than 1.5 SDs above the mean, it is clearly visible in Fig.1 that the CMPCMs over-
estimate conditional reliability for highly productive inventors (this interpretation is only
feasible in comparison to the NB2PCM with item-specific dispersion as the best fitting
model for this dataset). The NB2CM with item-specific dispersion displayed also some
slightly lower conditional reliabilities in the top capacity range as compared to the NB2CM
with global dispersion parameter. Hence, particularly for the most productive inventors in
this sample, the choice of the distributional model and approach to dispersion modeling
was crucial for an accurate assessment of reliability of capacity estimates.
Mutz andDaniel’s (2018a, b) dataset
All parameter estimates of all fitted models (with the exception of the NB2PCM) can
be found in Table2. The NB2PCM with global dispersion parameter (AIC = 10,350.75;
BIC = 10,393.38) fitted better as compared to the NB1PCM with global dispersion
parameter (AIC = 11,479.16; BIC = 11,521.80). However, for this dataset we chose the
NB1PCMs above the NB2PCMs because the NB1PCM with item-specific dispersion
(AIC = 10,004.21; BIC = 10,068.16) fitted better to the data as compared to the NB2PCM
with item-specific dispersion (the model estimation had many technical problems so that
information criteria could not be calculated) and, thus, provided stronger competition for
the overall model comparison procedure. In addition, we had to fix the item-specific dis-
persion parameters for NUMPUB and NUMCIT to the same value to deal with generally
observed technical problems with the negative binomial models with item-specific dis-
persion. Notably, these problems were not unexpected because Mutz and Daniel (2018b)
found that the dispersion parameters for these items adhered to the Poisson model.
The absolute values of the item-easiness parameter estimates (see Table2) across the fit-
ted models differed considerably stronger as compared to the patent dataset (see Table1).
Both models with a general dispersion parameter displayed overdispersion and fitted better
as compared to the RPCM (see Table2). The order of the dispersion parameter estimates
in both models with item-specific dispersion parameters was highly comparable. However,
the CMPCM with item-specific dispersion displayed underdispersion for NUMPUB and
NUMCIT, whereas the large negative values for log-dispersion in the NB1PCM for these
two items indicated that dispersion was at the lower limit of the parameter space (i.e., these
items adhered to the Poisson model; see also Mutz and Daniel 2018b). The CMPCM with
item-specific dispersion with unambiguously the best fitting model across all criteria. This
observation is crucial because it highlights the inability of negative binomial models to
take underdispersion into account. Clearly, the presence of underdispersion here caused
technical problems for the estimation of negative binomial models. In addition, it is impor-
tant to consider that IRT models deal with conditional distributions that can display under-
dispersion even when the unconditional distribution of, for example, the number of publi-
cations does not display underdispersion.
Empirical reliability estimates for this dataset were much more comparable across
fitted models (range from 0.942 to 0.981; see Table2) as compared to the patent data
(see Table1). Again, the highest reliability estimate resulted for the RPCM. However,
the best fitting model here produced a reliability estimate of almost the same size (i.e.,
0.973). Hence, even in situations in which the data require item-specific dispersion
Scientometrics
1 3
modeling, it can be the case that compared to the RPCM reliability appears to be pretty
accurately estimated because some items may display overdispersion (here: TOTCIT,
SHORTCIT, TOP10%, and PUBINT) and some items may display underdispersion
(here: NUMPUB and NUMCIT). Nonetheless, the variance of the capacity distribution
was clearly overestimated in the RPCM as compared to the CMPCM with item-spe-
cific dispersion and also the average standard error was slightly overestimated. All other
models resulted in higher average standard errors because these models all modeled
overdispersion (i.e., the negative binomial models can only model overdispersion and
the CMPCM with global dispersion empirically demonstrated overdispersion).
Figure 2 shows the conditional reliability plot for Mutz and Daniel’s dataset. The
plot clearly shows that both NB1CMs and the CMPCM with global dispersion param-
eter would have led to strong underestimates of conditional reliability across almost
the full range of capacity. In addition, the differences between the best fitting CMPCM
with item-specific dispersion and all other dispersion models decreases with increasing
capacity. The RPCM as compared to the best fitting CMPCM with item-specific disper-
sion tends to overestimate conditional reliability slightly stronger towards the lower tail
of capacity (see Fig.2).
Fig. 2 Bivariate scatterplot of the conditional reliability estimates for the respective researcher capacity
estimates (y axis) against the z-standardized capacity estimates (x axis) for the BQ data (Mutz and Daniel
2018b)
Scientometrics
1 3
Discussion
Accurate evaluations of researcher capacity are of high practical value in several con-
texts requiring selection decisions (e.g., for academic jobs, funding allotment, or research
awards). The attractiveness of IRT models in this context has been highlighted in recent
research (Mutz and Daniel 2018b) and the current work extends this idea in important
ways. First, less complex models to predict the expected values were considered to reduce
requirements with respect to sample size. Second, a model based on the Conway-Maxwell-
Poisson distribution was added to the pool of candidate distributions. This distribution is
more flexible as compared to the frequently used negative binomial distribution as it can
handle equidispersion, overdispersion, and underdispersion at the item level. Finally, this
study focused on the reliability of researcher capacity and the complex interplay of the cho-
sen distributional model and conditional reliability (i.e., measurement precision at specific
capacity levels.).
This study further amplifies the call for item-specific dispersion modeling because
across the studied datasets models with item-specific dispersion fitted best. Moreover, it
has been demonstrated that CMPCMs are a useful alternative model for bibliometric indi-
cators. Bibliometric indicators are commonly known to have unconditional distributions
that display overdispersion very often. However, IRT count data models deal with condi-
tional distributions and the findings for Mutz and Daniel’s dataset convincingly demon-
strate that underdispersion at the item level can be present in the data and needs to be taken
appropriately into account. Otherwise, technical problems and inaccurate estimation of the
reliability of capacity estimates are expected. It is therefore highly recommended that a
careful comparison of various competing IRT count data models precedes any examination
of reliability.
Beyond the reliability concept as examined in this work (i.e., person separation of
capacity as a latent variable) one might be interested in the reliability of a single observed
indicator (see Allison, 1978). Dispersion parameters in the used IRT models in this work
were found to be indicator-specific across both studied datasets, and they were estimated
jointly with all other model parameters. However, when only a subset of indicators is avail-
able with the others missing at random, the derived model parameters can be used to cal-
culate ability based on the subset and also the conditional variance and reliability. This
can also be done in the extreme case of a single indicator (e.g., total number of published
articles). This is conceptually similar, but mathematically distinct from Allison’s (1978)
approach, which is based on a conditional Poisson assumption for each indicator, and does
not require parameter estimates from a whole set of indicators. These reliability considera-
tions are beyond the scope of the current work and we strongly recommend future research
that closely examines varying conceptions of measurement precision.
This study is clearly limited to less complex models as used by Mutz and Daniel
(2018b). They used variants of the ICCPCM (Doebler etal. 2014) including a negative
binomial extension of the model within a Bayesian estimation framework. We decided
to use more parsimonious models in this work for several reasons. First, the focus was
on the interplay between reliability of researcher capacity estimates and the used distri-
butional models which is already quite complex. Hence, we chose simpler models for a
better focus on our research question. Second, a ICCPCM extension to the CMPCM (i.e.,
an ICCCMPCM) seems possible and straightforward in terms of theory, but estimation
routines that allow application of such an extension are to the best of our knowledge cur-
rently not available. For example, Bayesian estimation of the CMPCM – as the basis for an
Scientometrics
1 3
ICCCMPCM – is currently not available in ready-to-use software packages for GLMMs.
Hence, given this might become an available alternative in the future, potential extensions
of the CMPCM such as an ICCCMPCM will deserve a close examination.
Moreover, there are calls in the literature to use a variety of indicators (Moed and Halevi
2015) and others consider mere productivity as the most important indicator for funding
and promotion (Yair and Goldstein 2020). In this work, productivity across careers of
inventors and multiple bibliometric indices for a sample of social science researchers were
studied, but the intention of this work was not to take any position of which approach to
measurement (i.e., mere productivity vs. multi-faceted indicators) might be better. In fact,
any evaluation of researchers’ performance should be best guided by the concrete goals
and consequences related to the respective selection decision (or other reason for such an
evaluation).
Finally, it should be noted that recent research has extended the CMP distribution, espe-
cially with the goal to make the NB distribution one of the limit cases (Chakraborty and
Ong 2016; Chakraborty and Imoto 2016; Imoto 2014). This is a potential avenue for a
unification of the presented models, albeit regression modelling software including one of
the extensions and random effects is currently not available. We also caution that none of
the mentioned distributions lends itself naturally to a mean parametrization in the sense of
Huang (2017), complicating model interpretation.
Conclusion
The current work has added important points to consider when such an evaluation is based
on IRT-based estimates of researcher capacity. A sufficiently large set of alternative models
has been examined with lower requirements with respect to sample size as compared to
previous studies (Mutz and Daniel 2018b). Importantly, the CMPCM was introduced in
this work that balances needed flexibility and sample size requirements. Most importantly,
this study clearly demonstrated that model choice must precede any analysis of the reli-
ability of researcher capacity estimates. Similarly, when researcher capacity standard errors
are to be used, say to construct confidence intervals, model choice is equally important.
Finally, count data models in general were shown to be well-suited for contexts in which
decisions about the best researchers are to be made because all count data models dis-
played the highest level of conditional reliability at the top-ranges of researcher capacity.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Com-
mons licence, and indicate if changes were made. The images or other third party material in this article
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.
Scientometrics
1 3
References
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In B. N. Petrov
& F. Csáki (Eds.), 2nd International Symposium on Information Theory 267–281. Akadémiai Kiadó
Allison, P. D. (1978). The reliability of variables measured as the number of events in an interval of time.
Sociological Methodology, 9, 238. https ://doi.org/10.2307/27081 1.
Alvarez, P., & Pulgarín, A. (1996). Application of the Rasch model to measuring the impact of scientific
journals. Publishing Research Quarterly, 12(4), 57–64. https ://doi.org/10.1007/BF026 80575 .
Baghaei, P., Ravand, H., & Nadri, M. (2019). Is the d2 test of attention rasch scalable? analysis with the
rasch poisson counts model. Perceptual and Motor Skills, 126(1), 70–86. https ://doi.org/10.1177/00315
12518 81218 3.
Barton, K. (2019). MuMIn: Multi-Model Inference. R package version 1.43.15. https ://CRAN.R-proje ct.org/
packa ge=MuMIn
Bonat, W. H., Jørgensen, B., Kokonendji, C. C., Hinde, J., & Demétrio, C. G. B. (2018). Extended pois-
son–tweedie: Properties and regression models for count data. Statistical Modelling: An Interna-
tional Journal, 18(1), 24–49. https ://doi.org/10.1177/14710 82X17 71571 8.
Brooks, M, E, Kristensen, K, Benthem, K, J,van, Magnusson A, Berg CW, Nielsen A, Skaug, HJ,
Mächler M, & Bolker, BM (2017) Glmm Balances Speed and Flexibility Among Packages for Zero
inflated Generalized Linear Mixed Modeling. The R Journal 9(2), 378–390 https ://doi.org/10.32614
/RJ-2017-066
Brown, A., & Croudace, T. J. (2015). Scoring and estimating score precision using multidimensional
IRT models. In S. P. Reise & D. A. Revicki (Eds.), Multivariate applications series. Handbook
of item response theory modeling: Applications to typical performance assessment (pp. 307–333).
Routledge/Taylor & Francis Group.
Brown, Anna. (2018). Item Response Theory Approaches to Test Scoring and Evaluating the Score
Accuracy. In P. Irwing, T. Booth, & D. J. Hughes (Eds.), The Wiley Handbook of Psychometric
Testing (pp. 607–638). John Wiley & Sons, Ltd. https://doi.org/https ://doi.org/10.1002/97811 18489
772.ch20
Burrell, Q. L. (2007). Hirsch’s h-index: A stochastic model. Journal of Informetrics, 1(1), 16–25. https ://
doi.org/10.1016/j.joi.2006.07.001.
Chakraborty, S., & Ong, S. H. (2016). A COM-Poisson-type generalization of the negative binomial
distribution. Communications in Statistics - Theory and Methods, 45(14), 4117–4135. https ://doi.
org/10.1080/03610 926.2014.91718 4.
Chakraborty, S., & Imoto, T. (2016). Extended Conway-Maxwell-Poisson distribution and its proper-
ties and applications. Journal of Statistical Distributions and Applications, 3(1), 5. https ://doi.
org/10.1186/s4048 8-016-0044-1.
Cohen, J. E. (1981). Publication rate as a function of laboratory size in three biomedical research institu-
tions. Scientometrics, 3(6), 467–487. https ://doi.org/10.1007/BF020 17438 .
Consul, P. C., & Famoye, F. (1992). Generalized poisson regression model. Communications in Statis-
tics - Theory and Methods, 21(1), 89–109. https ://doi.org/10.1080/03610 92920 88307 66.
De Boeck, P., Bakker, M., Zwitser, R., Nivard, M., Hofman, A., Tuerlinckx, F., & Partchev, I. (2011).
The estimation of item response models with the lmer function from the lme package in R. Journal
of Statistical Software. https ://doi.org/10.18637 /jss.v039.i12.
Didegah, F., & Thelwall, M. (2013). Which factors help authors produce the highest impact research?
Collaboration, journal and document properties. Journal of Informetrics, 7(4), 861–873. https ://doi.
org/10.1016/j.joi.2013.08.006.
Doebler, A., Doebler, P., & Holling, H. (2014). A latent ability model for count data and applica-
tion to processing speed. Applied Psychological Measurement, 38(8), 587–598. https ://doi.
org/10.1177/01466 21614 54351 3.
Doebler, A., & Holling, H. (2016). A processing speed test based on rule-based item generation: An
analysis with the Rasch Poisson Counts model. Learning and Individual Differences, 52, 121–128.
https ://doi.org/10.1016/j.lindi f.2015.01.013.
Faddy, M. J., & Bosch, R. J. (2001). Likelihood-based modeling and analysis of data underdispersed
relative to the poisson distribution. Biometrics, 57(2), 620–624. https ://doi.org/10.1111/j.0006-
341X.2001.00620 .x.
Feldt, L. S., Woodruff, D. J., & Salih, F. A. (1987). Statistical inference for coefficient alpha. Applied
Psychological Measurement, 11(1), 93–103. https ://doi.org/10.1177/01466 21687 01100 107.
Fischer, G. H. (1987). Applying the principles of specific objectivity and of generalizability to the meas-
urement of change. Psychometrika, 52(4), 565–587. https ://doi.org/10.1007/BF022 94820 .
Scientometrics
1 3
Forthmann, B., Çelik, P., Holling, H., Storme, M., & Lubart, T. (2018). Item response modeling of
divergent-thinking tasks: A comparison of Rasch’s Poisson model with a two-dimensional model
extension. International Journal of Creativity and Problem Solving, 28(2), 83–95.
Forthmann, B., Gerwig, A., Holling, H., Çelik, P., Storme, M., & Lubart, T. (2016). The be-creative
effect in divergent thinking: The interplay of instruction and object frequency. Intelligence, 57,
25–32. https ://doi.org/10.1016/j.intel l.2016.03.005.
Forthmann, B., Gühne, D., & Doebler, P. (2019). Revisiting dispersion in count data item response the-
ory models: The Conway–Maxwell–Poisson counts model. British Journal of Mathematical and
Statistical Psychology. https ://doi.org/10.1111/bmsp.12184 .
Forthmann, B., Szardenings, C., & Dumas, D. (2019). Testing equal odds in creativity research. Psychol-
ogy of Aesthetics, Creativity, and the Arts. https ://doi.org/10.1037/aca00 00294 .
Glänzel, W., & Moed, H. F. (2013). Opinion paper: Thoughts and facts on bibliometric indicators. Scien-
tometrics, 96(1), 381–394. https ://doi.org/10.1007/s1119 2-012-0898-z.
Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984). Technical guide-
lines for assessing computerized adaptive tests. Journal of Educational Measurement, 21(4), 347–
360. https ://doi.org/10.1111/j.1745-3984.1984.tb010 39.x.
Guikema, S. D., & Goffelt, J. P. (2008). A flexible count data regression model for risk analysis. Risk
Analysis, 28(1), 213–223. https ://doi.org/10.1111/j.1539-6924.2008.01014 .x.
Gulliksen, H. (1950). Theory of Mental Tests. John Wiley & Sons.
Hall, B. H., Jaffe, A. B., & Trajtenberg, M. (2001). The NBER patent citation data file: Lessons, insights
and methodological tools. National Bureau of Economic Research. https ://www.nber.org/paper s/
w8498 .pdf
Hilbe, J. M. (2011). Negative binomial regression (2nd ed). Cambridge University Press.
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the
National Academy of Sciences, 102(46), 16569–16572. https ://doi.org/10.1073/pnas.05076 55102 .
Holling, H., Böhning, W., & Böhning, D. (2015). The covariate-adjusted frequency plot for the Rasch
Poisson counts model. Thailand Statistician, 13, 67–78.
Huang, A. (2017). Mean-parametrized Conway–Maxwell–Poisson regression models for dis-
persed counts. Statistical Modelling: An International Journal, 17(6), 359–380. https ://doi.
org/10.1177/14710 82X17 69774 9.
Huber, J. C. (2000). A statistical analysis of special cases of creativity. The Journal of Creative Behav-
ior, 34(3), 203–225. https ://doi.org/10.1002/j.2162-6057.2000.tb012 12.x.
Huber, J. C., & Wagner-Döbler, R. (2001). Scientific production: A statistical analysis of authors in
mathematical logic. Scientometrics, 50(2), 323–337. https ://doi.org/10.1023/A:10105 81925 357.
Huber, J. C., & Wagner-Döbler, R. (2001). Scientific production: A statistical analysis of authors in
physics, 1800–1900. Scientometrics, 50(3), 437–453. https ://doi.org/10.1023/A:10105 58714 879.
Hung, L.-F. (2012). A negative binomial regression model for accuracy tests. Applied Psychological
Measurement, 36(2), 88–103. https ://doi.org/10.1177/01466 21611 42954 8.
Imoto, T. (2014). A generalized Conway–Maxwell–Poisson distribution which includes the nega-
tive binomial distribution. Applied Mathematics and Computation, 247, 824–834. https ://doi.
org/10.1016/j.amc.2014.09.052.
Jansen, M. G. H. (1995). The Rasch Poisson counts model for incomplete data: An application of the EM
algorithm. Applied Psychological Measurement, 19(3), 291–302. https ://doi.org/10.1177/01466
21695 01900 307.
Jansen, M. G. H., & van Duijn, M. A. J. (1992). Extensions of Rasch’s multiplicative poisson model.
Psychometrika, 57(3), 405–414. https ://doi.org/10.1007/BF022 95428 .
Ketzler, R., & Zimmermann, K. F. (2013). A citation-analysis of economic research institutes. Sciento-
metrics, 95(3), 1095–1112. https ://doi.org/10.1007/s1119 2-012-0850-2.
Li, G.-C., Lai, R., D’Amour, A., Doolin, D. M., Sun, Y., Torvik, V. I., etal. (2014). Disambiguation and
co-authorship networks of the US patent inventor database (1975–2010). Research Policy, 43(6),
941–955. https ://doi.org/10.1016/j.respo l.2014.01.012.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. L. Erlbaum
Associates.
Moed, H. F., & Halevi, G. (2015). Multidimensional assessment of scholarly research impact: The mul-
tidimensional assessment of scholarly research impact. Journal of the Association for Information
Science and Technology, 66(10), 1988–2002. https ://doi.org/10.1002/asi.23314 .
Mutz, R., & Daniel, H.-D. (2018a). The Bayesian Poisson Rasch model. Data [Text/csv,0.007 MB]
Research Collection. https ://doi.org/10.3929/ETHZ-B-00027 1425.
Scientometrics
1 3
Mutz, R., & Daniel, H.-D. (2018b). The bibliometric quotient (BQ), or how to measure a researcher’s
performance capacity: A Bayesian Poisson Rasch model. Journal of Informetrics, 12(4), 1282–
1295. https ://doi.org/10.1016/j.joi.2018.10.006.
Ogasawara, H. (1996). Rasch’s multiplicative poisson model with covariates. Psychometrika, 61(1),
73–92. https ://doi.org/10.1007/BF022 96959 .
R Core Team. (2019). R: A Language and Environment for Statistical Computing. R Foundation for Sta-
tistical Computing. https ://www.R-proje ct.org/
Rasch, G. (1960). Studies in mathematical psychology: I. Probabilistic models for some intelligence and
attainment tests: Nielsen & Lydiche.
Sahel, J.-A. (2011). Quality versus quantity: Assessing individual research performance. Science Trans-
lational Medicine, 3(84), 8413. https ://doi.org/10.1126/scitr anslm ed.30022 49.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. https
://doi.org/10.1214/aos/11763 44136 .
Sellers, K. F., & Shmueli, G. (2010). A flexible regression model for count data. The Annals of Applied
Statistics, 4(2), 943–961. https ://doi.org/10.1214/09-AOAS3 06.
Sun, Y., & Xia, B. S. (2016). The scholarly communication of economic knowledge: A citation analysis of
google scholar. Scientometrics, 109(3), 1965–1978. https ://doi.org/10.1007/s1119 2-016-2140-x.
Wang, W.-C. (1999). Direct estimation of correlations among latent traits within IRT framework. Methods
of Psychological Research Online, 4(2), 47–68.
Wright BD, & Masters, GN (1982) Rating scale analysis. Mesa Press.
Yair, G., & Goldstein, K. (2020). The Annus Mirabilis paper: Years of peak productivity in scientific
careers. Scientometrics, 124(2), 887–902. https ://doi.org/10.1007/s1119 2-020-03544 -z.
Zhang, F., Bai, X., & Lee, I. (2019). Author impact: Evaluations, predictions, and challenges. IEEE Access,
7, 38657–38669. https ://doi.org/10.1109/ACCES S.2019.29059 55.
Zhu, W., & Safrit, M. J. (1993). The calibration of a sit-ups task using the rasch poisson counts model.
Canadian Journal of Applied Physiology, 18(2), 207–219. https ://doi.org/10.1139/h93-017.
... Using IRT models for research assessment comes along with several advantages such as more realistic assumptions and quantification of measurement precision (Mutz & Daniel, 2018). In addition, Forthmann and Doebler (2021) have shown that the Conway-Maxwell Poisson counts model (CMPCM; Forthmann et al., 2020a) is another IRT model that is useful for researcher performance assessment. Specifically, they reanalyzed Mutz and Daniel's (2018) dataset of German social science researchers and found reliability of researcher performance capacity estimates to be excellent (Reliability = 0.973). ...
... In this work, we seek to integrate the important role of productivity into IRT modeling of researcher performance capacity (Forthmann & Doebler, 2021;Mutz & Daniel, 2018) by comparing the reliability of researcher performance capacity estimates across models: Model 1: an IRT model in which productivity is excluded as an item (i.e., a baseline model affected by size-dependence); Model 2: an IRT model in which raw productivity is included as a person covariate (i.e., a naïve model that one might consider for comprehensiveness); Model 3: an IRT model in which log-transformed productivity (cf. Abramo et al., 2010;Sinatra et al., 2016) is included as a person covariate; and Model 4: a variant of Model 3 where the log-transformed productivity is employed as an offset. ...
... The conditional expected value in IRT count models for bibliometric indicator i and researcher j is modeled as follows (cf. Forthmann & Doebler, 2021): with indicator easiness parameter σ i (higher values imply higher scores), and researcher performance capacity parameter ε j (higher values imply higher scores). ...
Article
Full-text available
Are latent variables of researcher performance capacity merely elaborate proxies of productivity? To investigate this research question, we propose extensions of recently used item-response theory models for the estimation of researcher performance capacity. We argue that productivity should be considered as a potential explanatory variable of reliable individual differences between researchers. Thus, we extend the Conway-Maxwell Poisson counts model and a negative binomial counts model by incorporating productivity as a person-covariate. We estimated six different models: a model without productivity as item and person-covariate, a model with raw productivity as person-covariate, a model with log-productivity as person covariate, a model that treats log-productivity as a known offset, a model with item-specific influences of productivity, and a model with item-specific influences of productivity as well as academic age as person-covariate. We found that the model with item-specific influences of productivity fitted two samples of social science researchers best. In the first dataset, reliable individual differences decreased substantially from excellent reliability when productivity is not modeled at all to inacceptable levels of reliability when productivity is controlled as a person-covariate, while in the second dataset reliability decreased only negligibly. This all emphasizes the critical role of productivity in researcher performance capacity estimation.
... This model prevents overestimation of person parameter reliability when items are overdispersed (cf. Forthmann & Doebler, 2021). ...
... The model displayed good parameter recovery with consistent estimates already for rather small sample sizes such as N ¼ 100 (Forthmann, G€ uhne, et al., 2020). In several studies, the CMPCM was found to outperform the RPCM when verbal fluency tasks, tasks for the measurement of general language ability (Forthmann, Grotjahn, et al., 2020), or researcher capacity as indicated by bibliometric indicators (Forthmann & Doebler, 2021) were measured. In all of these studies, underdispersed items were found which highlights the importance of focusing on all possible deviations from equidispersion (i.e., not only overdispersion). ...
... Of course, this could also be the starting point for reducing the more complex CMPCM with item-specific dispersion to less complex models, if empirical findings allow. Here, in line with previous work using the CMPCM (Forthmann, Grotjahn, et al., 2020;Forthmann, G€ uhne, et al., 2020;Forthmann & Doebler, 2021), we have simply chosen the RPCM as a starting point for model comparison, as this approach compares a less established count data model (i.e., the CMPCM) with a more established one. ...
Article
Full-text available
While a rich methodology for analyzing response patterns for accuracy and time-on-task is at hand via Item Response Theory (IRT), tests with time cutoffs are so far harder to handle. Given that this test mode is widely applied, especially in the context of paper-and-pencil testing, there is a lack of psychometric techniques for a relevant number of tests. In this context, the original work of Rasch and his Rasch Poisson Counts model indeed offers an approach for this scenario that is adequate to solve the problem but which leads to model violations in many cases. Recent developments in statistical modeling – the so-called Conway Maxwell Poisson Counts Model (CMPCM) – can solve the problem of under- and overdispersion. We apply this model to the norm data of the ELFE II reading comprehension test and analyze patterns of over- and underdispersion with regard to speededness and mode effects. CMPCM with subtest-specific dispersion was adequate to model the raw test data, with underdispersion occurring mainly in highly speeded subtests with low difficulty and overdispersion in less speeded subtests with high difficulty. Thus, the CMPCM could contribute to psychometric methodology to appropriately model tests with time cutoffs on the subtest level.
... 2009; or extended to a bi-or multi-dimensional model; Forthmann et al., 2018;Myszkowski & Storme, 2021;Wedel et al., 2003), while others generalized the RPCM to allow for overdispersed conditional responses (i.e., the conditional variance exceeds the conditional mean; e.g., Hung, 2012;Mutz & Daniel, 2018). Underdispersed conditional responses (i.e., the conditional variance is smaller than the conditional mean) were unaccounted for by count IRT models for a long time, despite empirical evidence (Doebler & Holling, 2016;Forthmann & Doebler, 2021;Forthmann, G€ uhne, et al., 2020) and associated underestimation of model-implied reliability (Forthmann, G€ uhne, et al., 2020). The Conway Maxwell Poisson Counts Model (CMPCM; Forthmann, G€ uhne, et al., 2020) was the first IRT model for count data with a conditional distribution allowing for under-, over-and equidispersion. ...
... Studying differences in dispersion parameters can help to understand sources of over-and underdispersion in (conditional) item responses. This corresponds to research questions regarding measurement precision of certain items (e.g., extend work by Forthmann & Doebler, 2021;Mutz & Daniel, 2018: examine whether citation or publication related bibliometric indicators yield differently precise measurements of researcher capacity). ...
... The obtained results can help researchers to better understand tests, self-reports, and the constructs they measure. Perhaps of particular interest for count IRT, researchers can use the DRTM to investigate sources of observed over-and underdispersion in existing count data tasks (e.g., in Doebler & Holling, 2016;Forthmann & Doebler, 2021;Forthmann, G€ uhne, et al., 2020). ...
Article
In psychology and education, tests (e.g., reading tests) and self-reports (e.g., clinical questionnaires) generate counts, but corresponding Item Response Theory (IRT) methods are underdeveloped compared to binary data. Recent advances include the Two-Parameter Conway-Maxwell-Poisson model (2PCMPM), generalizing Rasch’s Poisson Counts Model, with item-specific difficulty, discrimination, and dispersion parameters. Explaining differences in model parameters informs item construction and selection but has received little attention. We introduce two 2PCMPM-based explanatory count IRT models: The Distributional Regression Test Model for item covariates, and the Count Latent Regression Model for (categorical) person covariates. Estimation methods are provided and satisfactory statistical properties are observed in simulations. Two examples illustrate how the models help understand tests and underlying constructs.
... First, confidence intervals in overdispersed data tend to be too narrow and the p-value overoptimistic, leading to the detection of a false positive effect (i.e., an inflated Type I error). The opposite is true for underdispersion, where the main problem is a loss of power which may lead to the non-detection of a positive effect (i.e., a Type II error) when it does actually exist (Forthmann and Doebler, 2021;Hartig, 2022). In the particular case of trend estimates, failure to take into account overdispersion can lead to distortions in their evaluation (Tirozzi et al., 2022). ...
... In the particular case of trend estimates, failure to take into account overdispersion can lead to distortions in their evaluation (Tirozzi et al., 2022). Underdispersion is perceived as less problematic due to its conservative inference (resulting from overestimated standard errors), and has thus been less of an issue in the literature (Forthmann and Doebler, 2021). However, the inability to detect truly significant trends is a critical issue for the design of efficient management strategies for conservation-dependent species (Tirozzi et al., 2022). ...
Article
While many studies have illustrated the decline of animal populations—particularly of farmland birds—the statistical analyses, design, and protocols used have raised some concerns and criticism. Using a 27-year dataset (1996–2022) based on recording the number of skylarks (Alauda arvensis) at 160 longitudinal count points, our study confronts two approaches commonly used to model long-term trends. The first uses a single model (based on a priori ecological knowledge), while the second is an a posteriori approach that relies on a multi-model selection among candidate models that account for probability distributions to describe the error structure. Here we investigate whether the statistical distribution of modelled variables and the method of including covariates in the model affect trend estimates. With a large amount of data and in the case of underdispersion, we found that the model distribution used had no impact on the estimation of the long-term trend. Moreover, adding confounding covariates did not change or improve the trend estimation, at least when data were obtained from a well-designed protocol (our case). In contrast to other studies reporting an effect of the model’s distribution on long-term trends, especially in the presence of overdispersion, our results offer a new perspective on the presence of underdispersion, where simple models perform equally well as complex ones. Further research is now needed on multiple species data or on smaller data sets to check the generality of our findings.
... The estimation of latent variable models ( Of these different approaches, the most popular by far for latent variable models for count responses has been marginal maximum likelihood (MML) estimation implemented via numerical integration (e.g., Beisemann, 2022;Beisemann et al., 2024;Forthmann & Doebler, 2021;Forthmann et al., 2020;Hung, 2012;Jansen, 1995;Jansen & van Duijn, 1992;Jansen, 1994;H. Liu, 2007;Magnus & Thissen, 2017;Rabe-Hesketh et al., 2004;Shiyko et al., 2012;Wang, 2010). ...
Article
Full-text available
Structured latent curve models (SLCMs) for continuous repeated measures data have been the subject of considerable recent research activity. In this article, we develop a first-order SLCM for repeated measures count data where the underlying change process is theorized to develop in distinct phases. Parameters of the multiphase or piecewise growth model, including changepoints, are allowed to vary across individuals. Exposure is allowed to vary across both individuals and time. We demonstrate our modeling approach on empirical expressive language data (grammatical morpheme counts) drawn from multiple distinct corpora available in the Child Language Data Exchange System (CHILDES), where the acquisition of grammatical morphology is understood to occur in distinct phases in typically developing children. A multiphase SLCM is fit to summarize individuals’ data as well as the average developmental pattern. Change in time-varying dispersion (unexplained variability in morpheme counts) over the course of early childhood is modeled concurrently to provide additional insights into acquisition. Unique characteristics of count data create modeling, identification, estimation, and diagnostic challenges that are exacerbated by incorporating growth models with nonlinear random effects. These are discussed at length. We provide annotated software code for each of models used in the empirical example.
... We modeled the number of fetuses using GLMMs with group identity and the female from which the scans were taken fitted as a random effect since scans were sometimes taken from the same individual during different pregnancies. We found that the data was underdispersed (θ = 0.530) which can result in standard errors being overestimated and biased inferences (Forthmann and Doebler 2021). We therefore used a Conway-Maxwell-Poisson distribution from the glmmTMB package (Brooks et al. 2017) which accounts for underdispersion (Shmueli et al. 2005). ...
Article
Full-text available
Reproduction is an energetically costly activity and so is often timed to occur when conditions are most favourable. However, human-induced changes in long-term, seasonal, and short-term climatic conditions have imposed negative consequences for reproduction across a range of mammals. Whilst the effect of climate change on reproduction in temperate species is well known, its effect on equatorial species is comparatively understudied. We used long-term ecological data (~20 years) to investigate the impact of changes in rainfall and temperature on reproduction in an equatorial mammal, the banded mongoose (Mungos mungo). After controlling for the effects of group-size, we found that more females were pregnant and gave birth following periods of high seasonal rainfall, pregnancies increased at higher seasonal temperatures, and births increased with long-term rainfall. This is likely beneficial as high rainfall is positively associated with pup growth and survival. Females cannot, however, carry and raise pups over the course of a single wet season, so females face a trade-off in reproductive timing between maximising resource availability during gestation or the early life of pups, but not both. Since the duration of the wet seasons is predicted to increase with climate change, the optimum conditions for banded mongoose reproduction may be extended. However, the potential benefits of extended wet seasons may be counteracted by the negative impacts of high temperatures on pup growth and survival. Our results highlight the importance of seasonality in reproduction of tropical mammals and the complex impacts of anthropogenic climate change on recruitment in equatorial species.
... For example, when individual criteria (e.g., academic reputation, citations per faculty) are combined into a global index, there is no discussion of whether such a global index is theoretically sound or whether it adheres to assessment-related quality criteria (e.g., reliability, validity, and fairness). Furthermore, while statistical approaches such as factor analysis (Grosul & Feist, 2014;Valderrama et al., 2022;van den Besselaar & Mom, 2023;Yue & Wilson, 2004a, 2004b and item-response theory (Alvarez & Pulgarín, 1996;Forthmann & Doebler, 2021;Mutz & Daniel, 2018)) suggest that scientific impact should be assessed via a reflective measurement model, this idea has not been widely discussed. For instance, although Wilson (2004a, 2004b) argue that a reflective measurement model should be used for an impact composite, their work has received little attention in the scientometric community. ...
Article
Full-text available
Various bibliometric indicators have been used to assess the researchers’ impact, but composites of such indicators, namely a metric that combines various individual indicators to describe a complex construct, have received a strong critique thus far. We employ concepts from psychometrics to revisit a composite proposed by Ioannidis et al. (2020) that aimed to represent researcher impact. Based on a selected sample of highly cited researchers, our proof-of-concept study presents a psychometrically principled composite formation. Specifically, by relying on the congeneric measurement model (and related models) rooted in classical test theory, we found that one of the proposed indicators clearly violated the congeneric model’s fundamental assumption of unidimensionality, and two other indicators were excluded for redundancy. The resulting composite based on only three bibliometric indicators was found to display excellent reliability. Importantly, the reliability approached that of the composite based on five indicators, and it was clearly better than the original six-indicator composite. Further, we found rather homogeneous effective weights (i.e., relative contributions of each indicator to composite variance) for simple sum scores, and these weights were close to those calculated using an algorithm for equally effective weights. While the congeneric measurement model also showed strong measurement invariance across sexes, this model’s loadings and intercepts were not measurement invariant across scientific fields and academic age groups. Notably, we found that various derived composites correlate positively with academic age, hinting at a lack of fairness of the composites.
... According to Alkhateeb and Algamal [1], the count response variable is commonly used to represent a number of real-world data problems, including social, healthcare, economic, auto insurance claims, physical sciences, and medical science. In particular, the count data regression model is used when the response or dependent variable (y) comes from a count distribution [2]. The Poisson regression model is one of the most popular techniques for modelling count data. ...
Article
Item-response theory (IRT) represents a key advance in measurement theory. Yet, it is largely absent from curricula, textbooks and popular statistical software, and often introduced through a subset of models. This Element, intended for creativity and innovation researchers, researchers-in-training, and anyone interested in how individual creativity might be measured, aims to provide 1) an overview of classical test theory (CTT) and its shortcomings in creativity measurement situations (e.g., fluency scores, consensual assessment technique, etc.); 2) an introduction to IRT and its core concepts, using a broad view of IRT that notably sees CTT models as particular cases of IRT; 3) a practical strategic approach to IRT modeling; 4) example applications of this strategy from creativity research and the associated advantages; and 5) ideas for future work that could advance how IRT could better benefit creativity research, as well as connections with other popular frameworks.
Article
Full-text available
Simonton’s equal odds baseline assumes that the number of creative hits is a positive linear function of the number of attempts (i.e., products). It has importance for productivity of innovators and scientists, small-group brainstorming, and divergent thinking research. It has been proposed within a stochastic model for productions in the field of scientific productivity (e.g., publications of scientists). Tests of the equal odds baseline rely commonly on tests of the correlation between quantity and additive quality of output which has been demonstrated to be inconclusive by Forthmann, Szardenings, and Holling (2018). In contrast, the current work uses a very strict version of equal odds (i.e., assuming constant hit ratios) as a starting point to examine the model at its roots. However, a deviation from such a stricter variant of the equal odds baseline is not only a dichotomous decision. The current work introduces meta-analytical techniques that provide useful statistics to quantify the amount of hit ratio variation attributable to individual differences in the equal odds baseline. This approach further allows to take varying levels of reliability in hit ratio as a function of total output into account. This is illustrated with datasets from the fields of productivity of innovators and scientists, small-group brainstorming, and divergent thinking research.
Article
Full-text available
Count data naturally arise in several areas of cognitive ability testing, such as processing speed, memory, verbal fluency, and divergent thinking. Contemporary count data item response theory models, however, are not flexible enough, especially to account for over‐ and underdispersion at the same time. For example, the Rasch Poisson counts model (RPCM) assumes equidispersion (conditional mean and variance coincide) which is often violated in empirical data. This work introduces the Conway–Maxwell–Poisson counts model (CMPCM) that can handle underdispersion (variance lower than the mean), equidispersion, and overdispersion (variance larger than the mean) in general and specifically at the item level. A simulation study revealed satisfactory parameter recovery at moderate sample sizes and mostly unbiased standard errors for the proposed estimation approach. In addition, plausible empirical reliability estimates resulted, while those based on the RPCM were biased downwards (underdispersion) and biased upwards (overdispersion) when the simulation model deviated from equidispersion. Finally, verbal fluency data were analysed and the CMPCM with item‐specific dispersion parameters fitted the data best. Dispersion parameter estimates indicated underdispersion for three out of four items. Overall, these findings indicate the feasibility and importance of the suggested flexible count data modelling approach.
Article
Full-text available
Author impact evaluation and prediction play a key role in determining rewards, funding, and promotion. In this paper, we first introduce the background of author impact evaluation and prediction. Then, we review recent developments of author impact evaluation, including data collection, data pre-processing, data analysis, feature selection, algorithm design, and algorithm evaluation. Thirdly, we provide an in-depth literature review on author impact predictive models and common evaluation metrics. Finally, we look into the representative research issues, including author impact inflation, unified evaluation standards, academic success gene, identification of the origins of hot streaks, and higher-order academic networks analysis. This paper should help the researchers obtain a broader understanding in author impact evaluation and prediction, and provides future research directions.
Article
Full-text available
The d2 test is a cancellation test to measure attention, visual scanning, and processing speed. It is the most frequently used test of attention in Europe. Although it has been validated using factor analytic techniques and correlational analyses, its fit to item response theory models has not been examined. We evaluated the fit of the d2 test to the Rasch Poisson Counts Model (RPCM) by examining the fit of six different scoring techniques. Only two scoring techniques—concentration performance scores and total number of characters canceled—fit the RPCM. The individual items fit the RPCM, with negligible differential item functioning across sex. Graphical model check and likelihood ratio test confirmed the overall fit of the two scoring techniques to RPCM.
Article
Full-text available
Regarding evaluation of individual researchers, the bibliometric indicators approach has been increasingly discussed recently, but there are some problems involved with it: construct definition, measurement errors, level of scale, dimensionality, normalization. Based on a psychometric model, the Rasch model, we developed a measuring scale for the theoretical construct 'researcher's performance capacity,' defined as the competency of a researcher to write influential papers. The aim was a scale that is one-dimensional and continuous, is applicable to bibliometric count variables, and takes measurement errors into account. In this paper we present the psychometric model (Bayesian Poisson Rasch model, BPR) and its assumptions and examine the behavior of the model under various sampling conditions. For a sample of N = 254 researchers in a quantitative methodology section of an undisclosed German academic society for social sciences, using the BPR model we developed a scale that we named 'Bibliometric Quotient' (BQ, M = 100, SD = 15) (following the term 'intelligence quotient'). The scale fulfills most of the test-theoretical requirements (e.g., high reliability αt=.96, no differential item functioning except for academic age and German states) and in addition allows researchers to be ranked. Women's BQ scores were 8.3 points lower on the scale than men's BQ scores.
Article
Full-text available
Count data can be analyzed using generalized linear mixed models when observations are correlated in ways that require random effects. However, count data are often zero-inflated, containing more zeros than would be expected from the typical error distributions. We present a new package, glmmTMB, and compare it to other R packages that fit zero-inflated mixed models. The glmmTMB package fits many types of GLMMs and extensions, including models with continuously distributed responses, but here we focus on count responses. glmmTMB is faster than glmmADMB, MCMCglmm, and brms, and more flexible than INLA and mgcv for zero-inflated modeling. One unique feature of glmmTMB (among packages that fit zero-inflated mixed models) is its ability to estimate the Conway-Maxwell-Poisson distribution parameterized by the mean. Overall, its most appealing features for new users may be the combination of speed, flexibility, and its interface's similarity to lme4.
Article
This paper defines the ‘miraculous year’ as the most productive year in academics’ scientific careers. It tests the hypothesis that annual productivity is unevenly distributed with a Lotka-like distribution at both individual and group levels. To gain generalizability, we model distributions of annual publication productivity in three independent mini studies. The studies include Israeli star scientists, highly cited physicists, and economists in an elite American university. The findings show that most scientists enjoy a peak annus mirabilis with a minority having a few such peaks. Academic age at which the annus mirabilis takes place gravitates towards the center of a career, especially amongst older scientists. The results support the hypothesis that scientific careers are punctuated by exceptionally productive years. We discuss how administrative constraints may affect levels of productivity. The paper opens up a new empirical domain for further empirical tests of career productivity and calls for policy discussions around the implications of the idea of the annus mirabilis.
Article
Item-response theory (IRT) models are test-theoretical models with many practical implications for educational measurement. For example, test-linking procedures and large-scale educational studies often build on IRT frameworks. However, IRT models have been rarely applied to divergent thinking which is one of the most important indicators of creative potential. This is most likely due to the fact that the best-known models, such as the one-parameter logistic Rasch model, can be only used for binary data. But its less known, and often overlooked, predecessor, the Rasch Poisson count model (RPCM), is well suited to model many important divergent-thinking outcomes such as fluency. In the current study we assessed RPCM fit to four different divergent thinking tasks. We further assessed the fit of the data to a two-dimensional variant of the RPCM to take into account construct differences due to verbal and figural task modality. We also compared estimated measurement precision based on the two-dimensional model, two separately estimated modality-specific unidimensional models, and a classic approach. The results indicated that the two-dimensional approach was advantageous, especially when correlations of latent variables are of interest. The RPCM and its more flexible multidimensional variants are discussed as a psychometric tool which possibly directs future research towards a better understanding of all the available divergent-thinking tasks.
Article
Conway–Maxwell–Poisson (CMP) distributions are flexible generalizations of the Poisson distribution for modelling overdispersed or underdispersed counts. The main hindrance to their wider use in practice seems to be the inability to directly model the mean of counts, making them not compatible with nor comparable to competing count regression models, such as the log-linear Poisson, negative-binomial or generalized Poisson regression models. This note illustrates how CMP distributions can be parametrized via the mean, so that simpler and more easily interpretable mean-models can be used, such as a log-linear model. Other link functions are also available, of course. In addition to establishing attractive theoretical and asymptotic properties of the proposed model, its good finite-sample performance is exhibited through various examples and a simulation study based on real datasets. Moreover, the MATLAB routine to fit the model to data is demonstrated to be up to an order of magnitude faster than the current software to fit standard CMP models, and over two orders of magnitude faster than the recently proposed hyper-Poisson model.