PreprintPDF Available

Flexible Bayesian estimation of incubation times

Authors:

Abstract and Figures

Motivation The incubation period is of paramount importance in infectious disease epidemiology as it informs about the transmission potential of a pathogenic organism and helps to plan public health strategies to keep an epidemic outbreak under control. Estimation of the incubation period distribution from reported exposure times and symptom onset times is challenging as the underlying data is coarse. Methodology We develop a new Bayesian methodology using Laplacian-P-splines that provides a semi-parametric estimation of the incubation density based on a Langevinized Gibbs sampler. A finite mixture density smoother informs a set of parametric distributions via moment matching and an information criterion arbitrates between competing candidates. Results Our method has a natural nest within EpiLPS, a tool originally developed to estimate the time-varying reproduction number. Various simulation scenarios accounting for different levels of data coarseness are considered with encouraging results. Applications to real data on COVID-19, MERS-CoV and Mpox reveal results that are in alignment with what has been obtained in recent studies. Conclusion The proposed flexible approach is an interesting alternative to classic Bayesian parametric methods for estimation of the incubation distribution.
Content may be subject to copyright.
Flexible Bayesian estimation of incubation times
Oswaldo Gressani1, Andrea Torneri1, Niel Hens1,2, Christel Faes1
1Interuniversity Institute for Biostatistics and statistical Bioinformatics (I-BioStat), Data
Science Institute, Hasselt University, Hasselt, Belgium.
2Centre for Health Economics Research and Modelling Infectious Diseases, Vaxinfectio,
University of Antwerp, Antwerp, Belgium.
Corresponding author: oswaldo.gressani@uhasselt.be
Abstract
Motivation: The incubation period is of paramount importance in infectious disease epi-
demiology as it informs about the transmission potential of a pathogenic organism and helps
to plan public health strategies to keep an epidemic outbreak under control. Estimation of
the incubation period distribution from reported exposure times and symptom onset times
is challenging as the underlying data is coarse.
Methodology: We develop a new Bayesian methodology using Laplacian-P-splines that
provides a semi-parametric estimation of the incubation density based on a Langevinized
Gibbs sampler. A finite mixture density smoother informs a set of parametric distributions
via moment matching and an information criterion arbitrates between competing candidates.
Results: Our method has a natural nest within EpiLPS, a tool originally developed to es-
timate the time-varying reproduction number. Various simulation scenarios accounting for
different levels of data coarseness are considered with encouraging results. Applications to
real data on COVID-19, MERS-CoV and Mpox reveal results that are in alignment with
what has been obtained in recent studies.
Conclusion: The proposed flexible approach is an interesting alternative to classic Bayesian
parametric methods for estimation of the incubation distribution.
Keywords: Incubation period, Laplace approximation, Bayesian P-splines, MCMC.
1 Introduction
Statistical methods and their underlying algorithmic implementation play an essential role in in-
fectious disease modeling as they permit to bridge the gap between observed data and estimates
of key epidemiologic quantities. The incubation period, defined as the time between infection
and symptom onset (Lessler et al.,2009) is pivotal in gauging the epidemic potential of an infec-
tious disease. Having information about the incubation period distribution is helpful for planning
optimal quarantine periods to taper off the spread of a contagious disease (Qin et al.,2020).
1
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.
Knowledge of incubation times helps in assessing the transmission potential of an infectious dis-
ease (Cheng et al.,2021;Basnarkov et al.,2022) as the incubation period can be used to estimate
the reproduction number (i.e., the average number of secondary cases generated by an infector in
a fully susceptible population). The incubation period is also of direct interest for case definition
(Virlogeux et al.,2016) and to measure the effectiveness of contact tracing. Moroever, it con-
tributes in quantifying the size of an epidemic (Backer et al.,2020) and improves the ecological
comprehension of adaptation strategies of a parasite (Nishiura,2007). The centrality of incuba-
tion features in epidemic analyses thus calls for solid methodologies that provide accurate and
reliable estimates of the incubation distribution to better understand the transmission dynamics
of a pathogen and to reach effective interventional public health strategies.
From a statistical point of view, the main obstacle for inferring the distribution of the in-
cubation period lies in the fact that infection times are almost never exactly observed (Chen
et al.,2022), while symptom onset times are more easily observed and reported. This incomplete
information set-up pushes towards a more challenging inference approach based on coarse data
(Reich et al.,2009), where infection times are only known to lie within a finite time interval. In
survival analysis, such a data structure is known as interval-censored data and several approaches
have been proposed to estimate the survival function and related quantities from coarsened ob-
servations. The paper of Peto (1973) is among the first to propose a modeling attempt under
an interval-censoring mechanism, where maximum likelihood estimation is carried out through a
constrained Newton-Raphson algorithm and applied to a survey on sexual maturity development.
A now popular extension considered by Turnbull (1976) consists in using a kind of expectation-
maximization algorithm to build a non-parametric estimate of the cumulative distribution function
under a more general form of data incompleteness that includes interval-censoring. These two pi-
oneering papers provided a fertile soil for the development of other frequentist approaches (see
e.g. omez et al.,2004,2009). Bayesian methods for interval-censored data are more recent as
practical implementations had to wait for the arrival of modern machines that facilitated the use
of Markov chain Monte Carlo (MCMC) algorithms to extract information from complex posterior
distributions. Sinha and Dey (1997) give a comprehensive review of semi-parametric Bayesian
methods for survival data characterizd by interval-censoring among others and the work of Calle
and omez (2001) presents a non-parametric Bayesian estimator of the survival curve using Gibbs
sampling under a Dirichlet process prior.
More directly related to infectious disease epidemiology, the work of Reich et al. (2009) pro-
poses frequentist parametric approaches to estimate the incubation period distribution using the
accelerated failure time model with applications to influenza A and RSV. Backer et al. (2020)
and Miura et al. (2022) use a Bayesian parametric approach to estimate the incubation period
of COVID-19 and of Mpox, respectively. Groeneboom (2021) derives a smooth non-parametric
estimator of the incubation time distribution by adding a bandwidth parameter that controls the
trade-off between noise and bias and Kreiss and Van Keilegom (2022) propose a semi-parametric
method to estimate the incubation period based on Laguerre polynomials.
In this paper, we develop a new Bayesian approach to estimate the incubation period distribu-
tion articulated around Laplacian-P-splines (Gressani and Lambert,2018;Gressani et al.,2022a).
Our strategy works in two steps. First, we compute a semi-parametric estimate of the incubation
density based on a finite mixture density. The component densities are all modeled by means
of penalized cubic B-splines (Eilers and Marx,1996) but with different data representations. In
the particular case of a two-component structure considered here, the first component density is
approached through a single interval-censored likelihood, while the second density is approached
through a midpoint imputation of the data, i.e. the missing infection time is artificially fixed at
the midpoint of the observed incubation interval. Markov chain Monte Carlo with a Langevinized
Gibbs sampler is used to construct the flexible semi-parametric incubation density estimate, where
the analytically derived gradient and Hessian of the conditional posterior of the spline components
2
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
are used to speed up the sampling process. Second, the semi-parametric density estimate of the
incubation period is used to fit popular parametric distributions that are often used to model
the distribution of the incubation period through a moment matching approach and the best
fitting model among the semi-parametric and fully parametric candidates is selected through the
Bayesian information criterion.
The article is organized as follows. Section 2 gives a detailed account of the methodology and
presents the Bayesian model to construct the semi-parametric estimate with Laplacian-P-splines.
We also show how our estimate based on the imputation approach directly benefits from the
negative binomial model of the EpiLPS architecture (Gressani et al.,2022b). The matching mo-
ment approach to fit the parametric distributions is also described here along with a formulation
of the chosen information criterion used for model comparison. A complete simulation study is
presented in Section 3, where we assess how our methodology performs under varying levels of
data coarseness, different target incubation distributions and sample size. In Section 4, we apply
our method to data on COVID-19, MERS and Mpox and make a comparison with results ob-
tained from recent studies. Finally, Section 5 concludes with a discussion for future research and
limitations of our work. A routine reflecting the proposed methodology to estimate the incuba-
tion density has been added to the EpiLPS package (Gressani,2021) and a dedicated repository
(https://github.com/oswaldogressani/Incubation) permits to reproduce the results of the
manuscript.
2 Methods
2.1 Coarsely observed data
The observed symptom onset time for individual iis denoted by tS
iand the (unobserved) infection
time is only known to lie within the closed exposure interval Ei=tEL
i, tER
i, where tEL
iand tER
i
denote the left and right bound, respectively, of the infecting exposure time. Without loss of
generality, we work from a continuous time perspective and assume that 0 tEL
i< tER
i< tS
iand
that symptom onset times are finite. The incubation time is thus at least tIL
i=tS
itER
iand at
most tIR
i=tS
itEL
i, so that the observed data at the resolution of individual iis given by the
bounds of the incubation period Di=tIL
i, tIR
iand the information of an observable set of size
nis thus D=n
i=1Di. Figure 1gives a graphical illustration of the relation between exposure
times, incubation bounds and the symptom onset time for individual i.
Figure 1: Relation between exposure times, incubation bounds and the symptom onset time.
3
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
2.2 Semi-parametric model with Bayesian P-splines
Let the incubation time Ibe a non-negative continuous random variable with probability density
function φ(·), hazard function h(·) and survival function S(·). Based on a dataset D, we propose
to estimate φ(·) by a two-component mixture density using a semi-parametric (SP) approach
based on P-splines. The candidate density estimator at a given time point t0 is denoted by
bφSP (t) = ωbφI C (t) + (1 ω)bφHS (t), with 0 ω1. The density estimator bφIC (·) is based
on single interval-censored (IC) data as shown in Figure 1, while bφH S (·) is a density estimator
resulting from a histogram smoother assuming a midpoint imputation rule for the infection time
in the exposure window E. The next two sections give a detailed outline of the models underlying
the latter two component densities.
2.2.1 Flexible density estimation for single interval-censored data
Following Rosenberg (1995), the (log-)hazard of the incubation period is approximated by a linear
combination of cubic B-spline basis functions:
log h(t) =
K
X
k=1
θkbk(t),(1)
where b(·) = b1(·), . . . , bK(·)is a cubic B-spline basis with equidistant knots on the compact
time interval T=0, tuwith upper bound tuand θ= (θ1, . . . , θK)is the K-dimensional latent
vector of B-spline amplitudes. While zero is a natural lower bound for the incubation period,
there is no natural choice for the upper bound tu. An intuitive candidate would be to fix it at
the largest observed right bound of the incubation time, i.e. tu= max{tIR
1, . . . , tIR
n}, however the
latter choice may restrict the B-spline basis to a domain that covers only a small fraction of the
domain of the true underlying incubation density φ(·). As such, we follow Eilers and Marx (2021)
and advise to pad tuto a value that is strictly larger than the largest observed incubation bound.
We defer the discussion on the guidelines for a smart padding choice to the real data applications
in Section 4. Regarding the number Kof B-spline basis functions, a default choice in the present
setting is K= 10, although larger numbers, say K= 20 or K= 30 may be necessary to capture
more flexible patterns (for instance if the underlying incubation density has multiple modes). As
noted by Eilers and Marx (2021), there is no fear to choose a “too large” number K, as the penalty
will act as a counterforce to the induced flexibility.
Using the relation between the survival and hazard functions, we recover:
S(t) = exp Zt
0
h(s)dsand
e
S(t)exp
j(t)
X
j=1
exp(θb(sj))∆
.(2)
The approximation in (2) is necessary as the integral has no analytic solution. As such, Tis
partitioned into a large number of J(e.g. J= 300) bins of equal width ∆, where sjdenotes
the center of the jth bin and j(t) {1, . . . , J}is an index function returning the bin number
containing t. Following Lang and Brezger (2004), a zero-mean Gaussian prior is imposed on the
vector of B-spline amplitudes θ|λ Ndim(θ)0,(λP )1, where λ > 0 is the penalty parameter
related to the spline model and P=D
rDr+εIdim(θ)is a square penalty matrix obtained from
rth order difference matrices Drof dimension (dim(θ)r)×dim(θ) perturbed by an ε-multiple
(here ε= 106) to ensure Pfulfils full rankedness. The Bayesian model is closed by assuming a
non-informative Gamma prior on the penalty parameter λ G(aλ, bλ) with shape aλ= 104and
4
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
rate bλ= 104(see e.g. Lambert and Eilers,2005,2009). The (log-)likelihood of incubation times
under single interval-censored data is (Reich et al.,2009):
L(θ;D) =
n
Y
i=1 ZtIR
i
tIL
i
φ(t)dt!
=
n
Y
i=1 S(tIL
i)S(tIR
i)
n
Y
i=1 exp
j(tIL
i)
X
j=1
exp(θb(sj))∆!exp
j(tIR
i)
X
j=1
exp(θb(sj))∆!!.
(θ;D) := log L(θ;D)
n
X
i=1
log exp
j(tIL
i)
X
j=1
exp(θb(sj))∆!exp
j(tIR
i)
X
j=1
exp(θb(sj))∆!!.(3)
From Bayes’ theorem, one obtains the (log-)conditional posterior density (for a given value of λ)
of the vector of B-spline coefficients:
p(θ|λ, D)exp (θ;D)p(θ|λ)
exp (θ;D)λ
2θPθ
log p(θ|λ, D) ˙= (θ;D)λ
2θPθ,(4)
where and ˙= are symbols used to denote equality up to a multiplicative and additive con-
stant, respectively. The Laplace approximation to the conditional posterior of the B-spline am-
plitudes is obtained by fitting a (multivariate) Gaussian density around the (unknown) mode
θM(λ) of p(θ|λ, D). A Newton-Raphson algorithm involving the gradient and Hessian matrix of
log p(θ|λ, D) is used to approximate the modal value, so that at convergence, one recovers the
Laplace approximation epG(θ|λ, D) with mean/mode θ(λ)θM(λ) and variance-covariance ma-
trix equal to the inverse of the negative Hessian matrix of log p(θ|λ, D) evaluated at θ(λ) denoted
by Σ(λ). To speed-up the mode finding algorithm, we compute the following analytical versions
of the gradient and Hessian:
θlog p(θ|λ, D) = θ(θ;D)λP θ,(5)
2
θlog p(θ|λ, D) = 2
θ(θ;D)λP. (6)
To ease the notation, let us define the following quantities related to the left bound of the incuba-
tion period ψL
ik := Pj(tIL
i)
j=1 h(sj)bk(sj), ψL
il := Pj(tIL
i)
j=1 h(sj)bl(sj), ψL
ikl := Pj(tIL
i)
j=1 h(sj)bk(sj)bl(sj)
and analogously for the right bound ψR
ik, ψ R
il and ψR
ikl. The gradient of the log-likelihood given in
(3) is shown to be (see detailed derivations in Appendix S1):
θ(θ;D) =
∂θ1
(θ;D),...,
∂θK
(θ;D)
,where
∂θk
(θ;D) =
n
X
i=1 e
S(tIL
i)e
S(tIR
i)1e
S(tIR
i)ψR
ik e
S(tIL
i)ψL
ik,for k= 1, . . . , K. (7)
The Hessian matrix of the log-likelihood is (details in Appendix S1):
5
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
2
θ(θ;D) = 2
θθ(θ;D),where the entry at the kth row and lth column is:
2
∂θkθl
(θ;D) =
n
X
i=1 (e
SI(tIL
i)e
S(tIR
i)1ne
S(tIR
i)ψR
ikl ψR
il ψR
ike
S(tIL
i)ψL
ikl ψL
il ψL
iko
e
S(tIR
i)ψR
ik e
S(tIL
i)ψL
ike
S(tIR
i)ψR
il e
S(tIL
i)ψL
il e
S(tIL
i)e
S(tIR
i)2).
Using the above gradient and Hessian in conjunction with (5) and (6), an iterative algorithm (e.g.
Newton-Raphson) can be used to obtain θ(λ) as a proxy for the posterior mode of p(θ|λ, D). The
mode of the Laplace approximation is conditional on the penalty parameter and we therefore need
a strategy to calibrate the amount of smoothing. The idea is to use an optimal smoothing approach
where the maximum a posteriori value of an approximate version of the marginal posterior of λ
following from Tierney and Kadane (1986) and Rue et al. (2009) is computed. Mathematically,
optimal smoothing means λM= argmaxλlog ep(λ|D), with the following (approximate) posterior
distribution for the penalty:
ep(λ|D)L(θ;D)p(θ|λ)p(λ)
epG(θ|λ, D)θ=θ(λ)
|Σ(λ)|0.5λ0.5K+aλ1exp (θ(λ); D)λ(0.5θ∗⊤(λ)Pθ(λ) + bλ).(8)
An approximation to λMdenoted by λis found by exploring log ep(λ|D) on a linear grid for log10(λ)
and the final resulting Laplace approximation is written by abuse of notation as epG(θ|λ,D) =
Ndim(θ)θ(λ),Σ(λ).
The mean θ(λ) and variance-covariance matrix Σ(λ) are essential quantities to build an
efficient MCMC algorithm to sample from the joint posterior of the model parameters p(θ, λ|D).
In fact, we can make use of the Langevinized Gibbs sampler (LGS) developed in Gressani et al.
(2022b), where the conditional posterior p(θ|λ, D) is sampled using a modified Langevin-Hastings
algorithm and the conditional posterior of the penalty parameter (λ|θ,D) G0.5K+aλ,0.5θPθ+
bλis sampled in a Gibbs step. The algorithm benefits from adaptive tuning to reach the optimal
acceptance rate of 0.57 (Roberts and Rosenthal,1998). Given θ(m):m= 1, . . . , M , a sample
of size Mof K-dimensional vectors θobtained frome the LGS algorithm, the point estimate of
the kth component is taken to be the posterior median of the sample θ(m)
k:m= 1, . . . , M
and we denote by b
θthe point estimate of θ. Plugging the latter in the formulas of the hazard
in (1) and the survival in (2), we obtain the point estimates b
h(t) and b
e
S(t) at a given time point
t. Finally, exploiting the relationship between the density, the hazard and the survival func-
tions, our semi-parametric estimate of the incubation density based on interval-censored data is
bφIC (t) = b
h(t)b
e
S(t)t0.
2.2.2 Flexible density estimation for midpoint imputation
The second component of the mixture density estimator bφH S (·) under the semi-parametric ap-
proach is obtained through a midpoint imputation technique. Starting from the incubation bounds
in D, we construct an artificial dataset e
D={e
ti:i= 1, . . . , n}, where the infection time of indi-
vidual iis assumed to be located in the middle of the incubation interval, so that the imputed
incubation period is:
6
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
e
ti= 0.5(tIL
i+tIR
i)
= 0.5(tS
itER
i+tS
itEL
i)
=tS
i0.5(tEL
i+tER
i).
Note that e
Dis seen as a random sample from the incubation density φ(·). From ideas in Eilers and
Marx (2010), we construct a histogram on a compact domain e
T= [0,e
tu] and recommmend to use
an upper bound that is at least equal to tuin Section 2.2.1, i.e. e
tutu.e
Tis partitioned in Lbins
with midpoint xland width hso that the lth bin is the half-open interval Bl= [xlh/2, xl+h/2)
and the last bin is a closed interval. Typically, the histogram smoother is insensitive to the
choice of the binwidth (Eilers and Borgdorff,2007) and it is advocated to use narrow bins (e.g.
h= 0.05). Another possibility is to use a binwidth hdetermined by a preliminary kernel smoother.
The number of imputed incubation periods falling in bin lis yl=Pn
i=1 I(e
ti Bl), where I(·) is
the indicator function.
The count variable ylis assumed to follow a negative binomial distribution ylNegBin(µl, ρ)
with mean µl>0 and overdispersion parameter ρ > 0. Following the footsteps of Section 2.2.1,
we impose a cubic B-spline basis on e
Tand model the log of the mean counts as log(µl) =
PK
k=1 θkbk(xl). The beauty behind such a formulation is that it allows us to recover exactly the
same model as in EpiLPS (Gressani et al.,2022b) to smooth case counts. We thus refer the reader
to the latter reference to consult all the equations related to the Laplacian-P-splines approach
leading to an estimate of the vector of B-spline coefficients b
θ. The density estimate resulting from
histogram smoothing is then given by: bφH S (t)=(nh)1exp PK
k=1 b
θkbk(t)t0 and assuming
equal weights ω= 0.5, our semi-parametric mixture density estimator for the incubation density
φ(t) at a given time point t0 is bφSP (t)=0.5bφIC (t) + bφHS (t).
2.3 Parametric fits using moment matching
In some situations it may be advantageous to fit the data by using well-known parametric dis-
tributions. Our methodology leaves a door open for this possibility by informing three classic
parametric distributions that are usually considered in the estimation of the incubation period,
namely the two-parameter lognormal, Gamma and Weibull families. We use αand βto generically
denote the two parameters of the latter families (see Appendix S2 for the detailed parameteriza-
tion). The moment matching strategy is illustrated in the following pseudo-algorithm:
Moment matching algorithm to fit parametric distributions.
1: for m= 1, . . . , M do.
2: From the LGS MCMC sample (Section 2.2.1), compute bφSP (t|θ(m)) = 0.5bφI C (t|θ(m))+ bφHS (t).
3: Obtain (numerically) the first moment and second central moment of Ias:
b
E(m)(I) = Z+
0
tbφSP (t|θ(m))dt,
b
V(m)(I) = Z+
0
(tb
E(I))2bφSP (t|θ(m))dt.
4: Use the above moments to estimate α(m)and β(m)of the chosen parametric distributions.
The posterior median of the samples {α(m):m= 1, . . . , M }and {β(m):m= 1, . . . , M}denoted
by bαand b
β, respectively, can be used to construct the lognormal density fit bφLN (t), the Gamma
density fit bφG(t) and the Weibull density fit bφW(t) to φ(t).
To choose between the four candidate density estimates {bφSP (·),bφLN (·),bφG(·),bφW(·)}, we use
7
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
the Bayesian information criterion (Schwarz,1978) computed as BICP=2(bα, b
β;D) + 2 log(n)
for the parametric fits, i.e. P {LN, G, W }and (bα, b
β;D) = Pn
i=1 log RtIR
i
tIL
ibφP(t)dt. For the
semi-parametric fit with P-splines, we use the formula BICSP =2(b
θ;D) + ED log(n), where
b
θis the estimate of θobtained from the LGS algorithm and ED is the effective dimension of the
model in Section 2.2.1 obtained as follows ED = Tr2
θ(b
θ;D) + b
λP 12
θ(b
θ;D), where
b
λis the median of the MCMC sample for λin the LGS algorithm and Tr(·) denotes the trace of
a matrix.
3 Results
To assess the performance of our methodology, we designed various simulation scenarios with
different target incubation densities, data coarseness and sample size. For the incubation density,
we use:
The lognormal density reported in Ferretti et al. (2020) with a mean of 5.5 days and standard
deviation of 2.1 days.
The Weibull density from Backer et al. (2020) with a mean of 6.4 days and standard deviation
of 2.3 days.
An artificial bimodal incubation density constructed as a mixture of two Weibulls with a
mean of 7.5 days and standard deviation of 4.6 days.
A Gamma density from Donnelly et al. (2003) with a mean of 3.8 days and standard deviation
of 2.9 days.
We assume two levels of data coarseness with average exposure window Eequal to one or two days
and exposure windows with maximum width of 7 days, reflecting a range that is often observed in
practice (see e.g. Yang et al.,2020). For the sample size, we fix n= 100 and n= 200. From the
combination of all these settings, we obtain a total of 4 ×2×2 = 16 scenarios. The features on
which we assess the performance are the mean and standard deviation of the incubation period
and additional quantiles that are typically of particular interest (e.g. the 5th and 95th percentiles
and the median). We also make a graphical evaluation of the fits by overlaying the density
estimates with the target incubation density. Moreover, we are also interested in the performance
of the selection process of our methodology, i.e. how many times our modeling approach selects
the correct parametric family that corresponds to the incubation distribution used in the data
generating mechanism.
For each scenario, we simulate S= 1000 datasets and use K= 10 B-spline basis functions
for all scenarios, except for the bimodal scenario where K= 20 to capture the more flexible
density pattern. The number of MCMC iterations for the LGS sampler is fixed at M= 1000 and
the acceptance rate varied in a close neighborhood of the optimal acceptance rate (57%) in all
scenarios. Simulations are implemented on an Intel Xeon W-2255 CPU @3.70GHz with 32Go of
RAM and it takes approximately one hour for each scenario (and a bit more for the case with
K= 20).
Tables 1-4summarize the results for selected pointwise features of the incubation density for all
scenarios (S1-S16). In general, the bias is relatively small for all features but is more pronounced
for the 95th percentile as less information is available in that region in the sense that less data
points are collected in such a remote location of the domain of the incubation density. In addition,
we observe that an increase in the sample size leads to a decrease in the root mean square error
8
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
(RMSE). This decrease comes sometimes at the cost of an increase in bias, reflecting the well-
known bias-variance trade-off.
From Figures 2-5, we see that in general the estimates provided by our method are able to nicely
capture the target incubation densities. Thanks to the flexibility of our approach, even bimodal
densities (Figure 4) are well reconstructed which would not be feasible with parametric approaches
relying on classic families. Moroever, the dashed curves (representing the pointwise median of
the estimates across the S= 1000 simulated datasets) are in most cases not distinguishable
from the target incubation density. Also, the fitted densities appear closer to the target with
n= 200 as compared to n= 100 as more information is available. This is corroborated by the
squared Hellinger distance (see the histograms provided in the GitHub repository associated to
this working paper) between the target incubation density φ(·) and our estimate bφ(·) computed
with the formula:
H2(bφ(t), φ(t)) = 1
2Zpbφ(t)pφ(t)2dt.
Finally, Table 5shows that our method is quite efficient in detecting the true underlying distri-
bution from which data is generated. For the lognormal incubation target, our LPS model selects
the lognormal model in more than 73% of cases with n= 100 and it goes up to more than 80%
of cases with n= 200. A correct selection is even made in 95% of cases in the Weibull setting.
Interestingly, our methodology never selects any parametric candidate when the underlying truth
is a bimodal density. Although this may not be the case for lower sample sizes, it is still an
enouraging sign. Finally, for the Gamma case, our model hesitates between a Gamma and a
Weibull but this is not really a problem as the main features of the true underling Gamma density
are still relatively well captured (see Table 4).
Average coarseness: 1 day
n= 100 (S1) n= 200 (S2)
True Average Bias RMSE Average Bias RMSE
Mean 5.528 5.477 -0.052 0.208 5.472 -0.056 0.158
SD 2.075 1.993 -0.082 0.195 1.997 -0.078 0.154
q0.05 2.849 2.828 -0.021 0.174 2.837 -0.012 0.131
q0.25 4.052 4.053 0.001 0.171 4.047 -0.005 0.124
q0.50 5.176 5.169 -0.007 0.197 5.156 -0.020 0.147
q0.75 6.612 6.559 -0.053 0.262 6.546 -0.066 0.203
q0.95 9.403 9.170 -0.234 0.542 9.181 -0.222 0.426
Average coarseness: 2 days
n= 100 (S3) n= 200 (S4)
True Average Bias RMSE Average Bias RMSE
Mean 5.528 5.431 -0.097 0.225 5.417 -0.111 0.186
SD 2.075 1.942 -0.133 0.222 1.934 -0.141 0.191
q0.05 2.849 2.836 -0.013 0.179 2.847 0.002 0.134
q0.25 4.052 4.044 -0.008 0.173 4.036 -0.016 0.127
q0.50 5.176 5.138 -0.038 0.205 5.118 -0.057 0.158
q0.75 6.612 6.492 -0.120 0.287 6.467 -0.145 0.241
q0.95 9.403 9.022 -0.381 0.624 9.002 -0.401 0.536
Table 1: Perfomance measures for selected features of the incubation density for two levels of data
coarseness with sample size n= 100 and n= 200. Results are for S= 1000 simulated datasets
with the lognormal incubation density of Ferretti et al. (2020).
9
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
Figure 2: Estimated incubation densities for Scenarios 1-4. The dash-dotted line is the pointwise
median across the S=1000 simulations and the solid black line is the lognormal incubation density
of Ferretti et al. (2020).
Average coarseness: 1 day
n= 100 (S5) n= 200 (S6)
True Average Bias RMSE Average Bias RMSE
Mean 6.403 6.393 -0.010 0.229 6.392 -0.011 0.169
SD 2.327 2.325 -0.002 0.151 2.327 0.001 0.110
q0.05 2.665 2.690 0.026 0.312 2.666 0.002 0.208
q0.25 4.734 4.726 -0.008 0.262 4.722 -0.012 0.191
q0.50 6.346 6.317 -0.029 0.257 6.327 -0.019 0.188
q0.75 7.995 7.961 -0.034 0.270 7.975 -0.020 0.199
q0.95 10.336 10.342 0.006 0.390 10.334 -0.002 0.285
Average coarseness: 2 days
n= 100 (S7) n= 200 (S8)
True Average Bias RMSE Average Bias RMSE
Mean 6.403 6.350 -0.053 0.232 6.338 -0.065 0.181
SD 2.327 2.275 -0.051 0.162 2.282 -0.045 0.116
q0.05 2.665 2.712 0.048 0.308 2.666 0.002 0.202
q0.25 4.734 4.722 -0.011 0.254 4.705 -0.029 0.190
q0.50 6.346 6.282 -0.064 0.256 6.282 -0.064 0.198
q0.75 7.995 7.886 -0.109 0.288 7.893 -0.102 0.225
q0.95 10.336 10.202 -0.133 0.434 10.187 -0.149 0.321
Table 2: Perfomance measures for selected features of the incubation density for two levels of
data coarseness with n= 100 and n= 200. Results are for S= 1000 simulated datasets and the
Weibull incubation density of Backer et al. (2020).
10
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
Figure 3: Estimated incubation densities for Scenarios 5-8. The dash-dotted line is the pointwise
median across the S=1000 simulations and the solid black line is the Weibull incubation density
of Backer et al. (2020).
Average coarseness: 1 day
n= 100 (S9) n= 200 (S10)
True Average Bias RMSE Average Bias RMSE
Mean 7.538 7.533 -0.005 0.468 7.532 -0.006 0.327
SD 4.622 4.593 -0.029 0.143 4.599 -0.023 0.098
q0.05 1.371 1.229 -0.142 0.293 1.280 -0.091 0.197
q0.25 3.050 3.026 -0.024 0.311 3.005 -0.045 0.207
q0.50 7.191 7.333 0.142 1.822 7.238 0.092 1.561
q0.75 12.080 12.027 -0.053 0.315 12.023 -0.057 0.215
q0.95 13.734 13.584 -0.150 0.291 13.594 -0.140 0.225
Average coarseness: 2 days
n= 100 (S11) n= 200 (S12)
True Average Bias RMSE Average Bias RMSE
Mean 7.538 7.493 -0.045 0.486 7.506 -0.032 0.328
SD 4.622 4.565 -0.057 0.155 4.578 -0.044 0.110
q0.05 1.371 1.228 -0.143 0.297 1.282 -0.089 0.196
q0.25 3.050 3.017 -0.033 0.310 2.996 -0.054 0.210
q0.50 7.191 7.275 0.084 1.940 7.248 0.057 1.627
q0.75 12.080 11.999 -0.081 0.315 12.005 -0.075 0.219
q0.95 13.734 13.376 -0.358 0.445 13.407 -0.327 0.377
Table 3: Perfomance measures for selected features of the incubation density for two levels of
data coarseness with n= 100 and n= 200. Results are for S= 1000 simulated datasets and an
artificial incubation density constructed as a mixture of two Weibulls.
11
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
Figure 4: Estimated incubation densities for Scenarios 9-12. The dash-dotted line is the pointwise
median across the S=1000 simulations and the solid black line is an artificial incubation density
constructed as a mixture of two Weibulls.
Average coarseness: 1 day
n= 100 (S13) n= 200 (S14)
True Average Bias RMSE Average Bias RMSE
Mean 3.810 3.756 -0.054 0.298 3.738 -0.072 0.205
SD 2.889 2.737 -0.151 0.308 2.745 -0.144 0.237
q0.05 0.561 0.564 0.003 0.134 0.554 -0.008 0.091
q0.25 1.693 1.721 0.028 0.209 1.699 0.006 0.136
q0.50 3.110 3.135 0.025 0.288 3.109 -0.001 0.189
q0.75 5.175 5.125 -0.050 0.414 5.105 -0.070 0.282
q0.95 9.451 9.063 -0.388 0.867 9.073 -0.377 0.649
Average coarseness: 2 days
n= 100 (S15) n= 200 (S16)
True Average Bias RMSE Average Bias RMSE
Mean 3.810 3.530 -0.280 0.375 3.528 -0.282 0.333
SD 2.889 2.462 -0.426 0.485 2.472 -0.417 0.449
q0.05 0.561 0.561 0.000 0.130 0.560 -0.002 0.089
q0.25 1.693 1.681 -0.012 0.186 1.675 -0.018 0.131
q0.50 3.110 3.008 -0.102 0.270 3.001 -0.109 0.210
q0.75 5.175 4.820 -0.355 0.503 4.818 -0.357 0.438
q0.95 9.451 8.272 -1.179 1.350 8.300 -1.151 1.244
Table 4: Perfomance measures for selected features of the incubation density for two levels of data
coarseness with sample size n= 100 and n= 200. Results are for S= 1000 simulated datasets
and the Gamma incubation density of Donnelly et al. (2003).
12
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
Figure 5: Estimated incubation densities for Scenarios 13-16. The dash-dotted line is the pointwise
median across the S=1000 simulations and the solid black line is the Gamma incubation density
of Donnelly et al. (2003).
n= 100 n= 200
φILognormal SP LN G W SP LN G W
(1 day coarseness) 0% 73% 26.5% 0.5% 0% 81.6% 18.4% 0%
(2 day coarseness) 0% 74.3% 23.7% 2% 0% 79.4% 20.6% 0%
φIWeibull n= 100 n= 200
(1 day coarseness) 3.9% 0.2% 9.1% 86.8% 1.8% 0% 3.2% 95%
(2 day coarseness) 4.5% 0.2% 9.4% 85.9% 3.6% 0% 2.6% 93.8%
φIWeibmix n= 100 n= 200
(1 day coarseness) 100% 0% 0% 0% 100% 0% 0% 0%
(2 day coarseness) 100% 0% 0% 0% 100% 0% 0% 0%
φIGamma n= 100 n= 200
(1 day coarseness) 2.7% 1.2% 58.1% 38% 0.5% 0.1% 62.8% 36.6%
(2 day coarseness) 3.9% 0.% 42.6% 52.9% 0.5% 0% 42.8% 56.7%
Table 5: Proportion of selected models across S= 1000 simulations under different scenarios.
13
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
4 Applications to real data
This section applies the proposed flexible estimation methodology to publicly available datasets
on reported exposures and symptom onset times. For the analyses, we use K= 20 B-splines and
a MCMC chain of length M= 20,000. A smart choice for tu(and hence e
tu), i.e. the upper bound
on which to fix the B-spline basis can for instance be based on information from previous studies
on the incubation period for a given pathogen. For instance, Virlogeux et al. (2015) reports the
99th percentile and range of the incubation period of human avian influenza A (H7N9) and the
systematic review of Lessler et al. (2009) on incubation periods of acute respiratory viral infections
gives an idea of the range of the incubation period for different diseases. Such empirical knowledge
can help in finding a choice for tuthat supports with high confidence most of the probability mass
of the incubation period distribution.
Another practical aspect worth mentioning is that exposure times and symptom onset times
are in practice reported at a daily time resolution (calendar dates), while our model is in continuous
time. A common strategy to transit from discrete to continuous observations is to assume that
exact times are uniformly distributed throughout the day and hence to perturb symptom onset
times and exposure window bounds by a uniform random variable between 0 and 1 (see e.g. Kreiss
and Van Keilegom,2022).
4.1 COVID-19 infections among travellers from Wuhan
First, we attempt to estimate the incubation density based on exposure times and symptom onset
dates of confirmed COVID-19 cases with travel history to Wuhan (Backer et al.,2020). The
analysis considers 25 visitors to Wuhan with a closed exposure window from which we removed
an individual who had a quite large exposure period (20 days) as compared to the remaining
observations. Backer et al. (2020) obtained a lognormal fit with a mean incubation period of 4.5
days (CI95%: 3.7-5.6) and a 95th percentile of 8.0 days (CI95%: 6.3-11.8). From a discussion
with the first author of the latter reference, we were informed that a Gamma density with a mean
of 4.6 days (CI95%: 3.8-5.4) and a 95th percentile of 7.4 days (CI95%: 6.2-9.7) fitted equally well.
Our methodology provides a similar fit, namely a Gamma density with mean 4.4 days (CI95%:
4.0-4.8) and a 95th percentile of 7.7 days (CI95%: 7.2-8.5).
4.2 Transmission pair data on COVID-19
Next, we consider a dataset on transmission pairs for COVID-19 from Hart et al. (2021) that was
analyzed (among others) in Xia et al. (2020). The latter reference obtained a Weibull fit for the
incubation density with a mean of 4.9 days (CI95%: 4.4-5.4) and a 95th percentile of 9.9 days
(CI95%: 8.9-11.2). Restricting our analysis to a subset of n= 74 individuals with closed exposure
windows, we obtain a Weibull with a mean of 4.5 days (CI95%: 4.2-4.9) and a 95th percentile of
10.5 days (CI95%: 9.8-11.4).
4.3 Middle East Respiratory Syndrome (MERS)
In a third application, we consider a dataset given in Cauchemez et al. (2014) that reports lower
and upper bounds of the incubation period for seven individual MERS-CoV cases in the United
Kindom, France, Italy and Tunisia. Based on this data, the latter reference obtains a best fit
to the incubation density that is lognormal with a mean of 5.5 days (CI95%: 3.6-10.2) and a
95th percentile of 10.2 days, extrapolated from the reported standard deviation in the reference
(CI95%: NA). Our approach selects the semi-parametric fit with a mean of 5.3 days (CI95%:
4.5-6.2) and a 95th percentile of 10.1 days (CI95%: 9.2-12.1).
14
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
4.4 Mpox
The last application is on a dataset reporting n= 18 confirmed Mpox cases in the Netherlands
(Miura et al.,2022). The latter analysis uses a parametric Bayesian approach similar to Backer
et al. (2020) and the best fitting model is given by a lognormal distribution with a mean incubation
period of 9.0 days (CI95%: 6.6-10.9) and a 95th percentile of 17.3 days (CI95%: 13.0-29.0).
Analyzing the same dataset with our flexible Bayesian approach, we obtain a lognormal fit with
mean incubation period of 8.9 days (CI95%: 7.9-9.9) and a 95th percentile of 16.6 days (CI95%:
14.7-19.1).
Backer et al. (2020)Hart et al. (2021)
Cauchemez et al. (2014)
Miura et al. (2022)
Figure 6: Incubation bounds and estimated probability density functions and cumulative distribu-
tion functions with the proposed flexible Bayesian approach for data on COVID-19, MERS-CoV
and Mpox.
15
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
5 Discussion
This article proposes a flexible approach to tackle the challenging problem of estimating the
incubation period distribution based on coarse data. This is done through Bayesian P-splines and
Laplace approximations. The semi-parametric model approximates the incubation density via a
finite mixture density smoother and the latter is used to fit three popular parametric distributions
that are often considered in the esitmation of incubation times. The Bayesian infromation criterion
is then able to arbitrate between the competing density estimators. Our methodology has a natural
place in the EpiLPS ecosystem as density smoothing under imputation methods can be formulated
as a histogram smoothing problem; a problem that has already been addressed by EpiLPS in the
context of smoothing case incidence data to estimate the time-varying reproduction number. The
current methodology also borrows from the exisiting Langevinized Gibbs sampler in EpiLPS and
thus incubation estimation as proposed here has a natural nest in the EpiLPS package.
The main strength of our work is that it permits to go beyond classic parametric Bayesian
approaches that are often considered in the literature. A clear disadvantage of such approaches
is that they may miss important features of the incubation period distribution if the latter turns
out to have a more flexible structure than what is proposed by parametric models. In addition,
the methodology developed here does not close the door to classic parametric models as the latter
can still be good candidates to get information on the incubation period. Another advantage is
that the algorithms underlying our work are available in the EpiLPS package. As such, it can
be of direct practical use for the scientific community and public health officers to analyze real
datasets. Note also that the routines underlying EpiLPS are really efficient as computationally
expensive parts are treated with C++, so that results are typically obtained in a few seconds.
The main limitation of this work is that in some rare cases, our approach may select a flexible fit
to the incubation density while in reality it should have chosen a candidate among the parametric
models. As suggested by the simulation study, this arises more frequently when the sample size
is small (less than 5% of cases) rather than with larger sample sizes. For instance, with n= 200,
this mismatch only happend in less than 3.6% of cases. Note however that even with the small
sample sizes considered in the real data applications, our estimates for the mean and standard
deviation of the incubation period seem to be well aligned with results from previous studies.
From here, there are several research paths to explore. We can for instance enrich the current
model by not only considering a two-component mixture in the semi-parametric approach, but
rather a multi-component mixture with a multiple imputation approach. Furthermore, if one has
good ideas to believe that infection times are more likely to appear at another location than the
midpoint of the exposure window, we can easily adjust this in our model. Note also that here,
we gave equal weights ω= 0.5 to the single-interval censored data and midpoint imputed data
methods. If prior knowledge is availabe on more specific locations of infection times, weights can
be adjusted accordingly. Another interesting research perspective is to handle estimation of the
generation interval (GI), i.e. the time difference between the infection event of a primary case
(infector) and of a secondary case (infectee). As we now have a flexible method for estimating
the incubation distribution, we could work under a convolution setting to propose an estimator
for the GI.
Acknowledgments
We thank Jantien Backer and Jacco Wallinga from the National Institute for Public Health and
the Environment (RIVM) for discussing their results on the incubation period estimation for
COVID-19 based on confirmed cases with Wuhan travel history.
16
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
Data availability
Simulation results and real data applications in this paper can be fully reproduced with the code
available on the GitHub repository (https://github.com/oswaldogressani/Incubation) based
on the EpiLPS package version 1.2.0.
Funding
This work was supported by the ESCAPE project (101095619) and the VERDI project (101045989),
funded by the European Union. Views and opinions expressed are however those of the authors
only and do not necessarily reflect those of the European Union or the Health and Digital Ex-
ecutive Agency (HADEA). Neither the European Union nor the granting authority can be held
responsible for them.
Competing interest
The authors have declared that there are no competing interests.
Appendix S1
Let us define the following quantities related to the left bound of the incubation period:
ψL
ik :=
j(tIL
i)
X
j=1
h(sj)bk(sj); ψL
il :=
j(tIL
i)
X
j=1
h(sj)bl(sj); ψL
ikl :=
j(tIL
i)
X
j=1
h(sj)bk(sj)bl(sj).
Analogously define the same quantities for the right bound ψR
ik, ψ R
il and ψR
ikl.
Gradient
Recall that the (approximated) log-likelihood is:
(θ;D)
n
X
i=1
log exp
j(tIL
i)
X
j=1
exp(θb(sj))∆!exp
j(tIR
i)
X
j=1
exp(θb(sj))∆!!.
n
X
i=1
log e
S(tIL
i)e
S(tIR
i)!
∂θk
(θ;D) =
n
X
i=1 e
S(tIL
i)e
S(tIR
i)!1
∂θke
S(tIL
i)
∂θke
S(tIR
i)!.
Note that:
∂θke
S(tIL
i) = exp
j(tIL
i)
X
j=1
h(sj)∆!j(tIL
i)
X
j=1
h(sj)bk(sj)∆
=e
S(tIL
i)ψL
ik.
∂θke
S(tIR
i) = e
S(tIR
i)ψR
ik.
It follows that the kth entry to θ(θ;D) is:
∂θk
(θ;D) =
n
X
i=1 e
S(tIL
i)e
S(tIR
i)!1 e
S(tIR
i)ψR
ik e
S(tIL
i)ψL
ik!
17
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
Hessian
Let us define:
γi(θ) := e
S(tIR
i)ψR
ik e
S(tIL
i)ψL
ik,
ηi(θ) := e
S(tIL
i)e
S(tIR
i),
so that the kth entry to θ(θ;D) is rewritten compactly as:
∂θk
(θ;D) =
n
X
i=1
γi(θ)
ηi(θ).
Deriving the above expression again with respect to the lth B-spline component gives:
2
∂θkθl
(θ;D) =
n
X
i=1 ηi(θ)2 ∂γi(θ)
∂θl
ηi(θ)γi(θ)∂ηi(θ)
∂θl!.
=
n
X
i=1 ηi(θ)1∂γi(θ)
∂θlγi(θ)ηi(θ)
∂θlηi(θ)2.
∂γi(θ)
∂θl
=
∂θl e
S(tIR
i)ψR
ik e
S(tIL
i)ψL
ik!
= e
S(tIR
i)
∂θl
ψR
ik +e
S(tIR
i)∂ψR
ik
∂θl! e
S(tIL
i)
∂θl
ψL
ik +e
S(tIL
i)∂ψL
ik
∂θl!
= e
S(tIR
i)ψR
il ψR
ik +e
S(tIR
i)ψR
ikl! e
S(tIL
i)ψL
il ψL
ik +e
S(tIL
i)ψL
ikl!
=e
S(tIR
i)ψR
ikl ψR
il ψR
ike
S(tIL
i)ψL
ikl ψL
il ψL
ık.
∂ηi(θ)
∂θl
=
∂θl e
S(tIL
i)e
S(tIR
i)!
=e
S(tIL
i)
∂θle
S(tIR
i)
∂θl
=e
S(tIL
i)ψL
il e
S(tIR
i)ψR
il !
=e
S(tIR
i)ψR
il e
S(tIL
i)ψL
il .
2
∂θkθl
(θ;D) =
n
X
i=1 (e
S(tIL
i)e
S(tIR
i)1ne
S(tIR
i)ψR
ikl ψR
il ψR
ike
S(tIL
i)ψL
ikl ψL
il ψL
iko
e
S(tIR
i)ψR
ik e
S(tIL
i)ψL
ike
S(tIR
i)ψR
il e
S(tIL
i)ψL
il e
S(tIL
i)e
S(tIR
i)2)
18
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
Appendix S2
Lognormal distribution
Notation XLogN orm(α, β2)
Parameters αRlocation; β > 0 scale
Density function p(x) = 1
x2πβ2exp 1
2log(x)α
β2
Support x > 0
1st moment (Mean) E(X) = exp α+β2
2
2nd central moment (Variance) V(X) = exp 2α+β2exp β21
Moment matching Root finding algorithm
Gamma distribution
Notation X G(α, β )
Parameters α > 0 shape; β > 0 rate
Density function p(x) = βα
Γ(a)xα1exp βx
Support x > 0
1st moment (Mean) E(X) = α
β
2nd central moment (Variance) V(X) = α
β2
Moment matching Analytically available
Weibull distribution
Notation XWeibull(α, β )
Parameters α > 0 shape; β > 0 scale
Density function p(x) = α
βαxα1exp x
βα
Support x > 0
1st moment (Mean) E(X) = βΓ(1 + 1
α)
2nd central moment (Variance) V(X) = β2Γ(1 + 2
α)Γ(1 + 1
α)2
Moment matching Root finding algorithm
Table 6: Description of the parametric distributions used in the moment matching approach.
19
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
References
Backer, J. A., Klinkenberg, D., and Wallinga, J. (2020). Incubation period of 2019 novel coro-
navirus (2019-nCoV) infections among travellers from Wuhan, China, 20–28 January 2020.
Eurosurveillance, 25(5):2000062.
Basnarkov, L., Tomovski, I., and Avram, F. (2022). Estimation of the basic reproduction number
of COVID-19 from the incubation period distribution. The European Physical Journal Special
Topics, 231(18):3741–3748.
Calle, M. L. and omez, G. (2001). Nonparametric Bayesian estimation from interval-censored
data using Monte Carlo methods. Journal of Statistical Planning and Inference, 98(1-2):73–87.
Cauchemez, S., Fraser, C., Van Kerkhove, M. D., Donnelly, C. A., Riley, S., Rambaut, A., Enouf,
V., van der Werf, S., and Ferguson, N. M. (2014). Middle east respiratory syndrome coronavirus:
quantification of the extent of the epidemic, surveillance biases, and transmissibility. The Lancet
Infectious Diseases, 14(1):50–56.
Chen, D., Lau, Y.-C., Xu, X.-K., Wang, L., Du, Z., Tsang, T. K., Wu, P., Lau, E. H., Wallinga,
J., Cowling, B. J., et al. (2022). Inferring time-varying generation time, serial interval, and
incubation period distributions for COVID-19. Nature Communications, 13(1):7727.
Cheng, C., Zhang, D., Dang, D., Geng, J., Zhu, P., Yuan, M., Liang, R., Yang, H., Jin, Y., Xie,
J., et al. (2021). The incubation period of COVID-19: a global meta-analysis of 53 studies and
a chinese observation study of 11 545 patients. Infectious Diseases of Poverty, 10(05):1–13.
Donnelly, C. A., Ghani, A. C., Leung, G. M., Hedley, A. J., Fraser, C., Riley, S., Abu-Raddad,
L. J., Ho, L.-M., Thach, T.-Q., Chau, P., et al. (2003). Epidemiological determinants of spread of
causal agent of severe acute respiratory syndrome in Hong Kong. The Lancet, 361(9371):1761–
1766.
Eilers, P. H. C. and Borgdorff, M. (2007). Non-parametric log-concave mixtures. Computational
Statistics & Data Analysis, 51(11):5444–5451.
Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical
Science, 11(2):89–121.
Eilers, P. H. C. and Marx, B. D. (2010). Splines, knots, and penalties. Wiley Interdisciplinary
Reviews: Computational Statistics, 2(6):637–653.
Eilers, P. H. C. and Marx, B. D. (2021). Practical smoothing: The joys of P-splines. Cambridge
University Press.
Ferretti, L., Wymant, C., Kendall, M., Zhao, L., Nurtay, A., Abeler-D¨orner, L., Parker, M.,
Bonsall, D., and Fraser, C. (2020). Quantifying SARS-CoV-2 transmission suggests epidemic
control with digital contact tracing. Science, 368(6491):eabb6936.
omez, G., Calle, M. L., and Oller, R. (2004). Frequentist and Bayesian approaches for interval-
censored data. Statistical Papers, 45:139–173.
omez, G., Calle, M. L., Oller, R., and Langohr, K. (2009). Tutorial on methods for interval-
censored data and their implementation in R. Statistical Modelling, 9(4):259–297.
Gressani, O. (2021). EpiLPS: a fast and flexible Bayesian tool for estimation of the time-varying
reproduction number. [Computer Software].
20
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
Gressani, O., Faes, C., and Hens, N. (2022a). Laplacian-P-splines for Bayesian inference in the
mixture cure model. Statistics in Medicine, 41(14):2602–2626.
Gressani, O. and Lambert, P. (2018). Fast Bayesian inference using Laplace approximations in
a flexible promotion time cure model based on P-splines. Computational Statistics & Data
Analysis, 124:151–167.
Gressani, O., Wallinga, J., Althaus, C. L., Hens, N., and Faes, C. (2022b). EpiLPS: A fast
and flexible Bayesian tool for estimation of the time-varying reproduction number. PLOS
Computational Biology, 18(10):e1010618.
Groeneboom, P. (2021). Estimation of the incubation time distribution for COVID-19. Statistica
Neerlandica, 75(2):161–179.
Hart, W. S., Maini, P. K., and Thompson, R. N. (2021). High infectiousness immediately be-
fore COVID-19 symptom onset highlights the importance of continued contact tracing. Elife,
10:e65534.
Kreiss, A. and Van Keilegom, I. (2022). Semi-parametric estimation of incubation and generation
times by means of Laguerre polynomials. Journal of Nonparametric Statistics, 34(3):570–606.
Lambert, P. and Eilers, P. H. C. (2005). Bayesian proportional hazards model with time-
varying regression coefficients: a penalized Poisson regression approach. Statistics in Medicine,
24(24):3977–3989.
Lambert, P. and Eilers, P. H. C. (2009). Bayesian density estimation from grouped continuous
data. Computational Statistics & Data Analysis, 53(4):1388–1399.
Lang, S. and Brezger, A. (2004). Bayesian P-splines. Journal of Computational and Graphical
Statistics, 13(1):183–212.
Lessler, J., Reich, N. G., Brookmeyer, R., Perl, T. M., Nelson, K. E., and Cummings, D. A.
(2009). Incubation periods of acute respiratory viral infections: a systematic review. The
Lancet Infectious Diseases, 9(5):291–300.
Miura, F., van Ewijk, C. E., Backer, J. A., Xiridou, M., Franz, E., de Coul, E. O., Brandwagt, D.,
van Cleef, B., van Rijckevorsel, G., Swaan, C., et al. (2022). Estimated incubation period for
monkeypox cases confirmed in the Netherlands, May 2022. Eurosurveillance, 27(24):2200448.
Nishiura, H. (2007). Early efforts in modeling the incubation period of infectious diseases with an
acute course of illness. Emerging Themes in Epidemiology, 4:1–12.
Peto, R. (1973). Experimental survival curves for interval-censored data. Journal of the Royal
Statistical Society: Series C, 22(1):86–91.
Qin, J., You, C., Lin, Q., Hu, T., Yu, S., and Zhou, X.-H. (2020). Estimation of incubation period
distribution of COVID-19 using disease onset forward time: a novel cross-sectional and forward
follow-up study. Science Advances, 6(33):eabc1202.
Reich, N. G., Lessler, J., Cummings, D. A., and Brookmeyer, R. (2009). Estimating incubation
period distributions with coarse data. Statistics in Medicine, 28(22):2769–2784.
Roberts, G. O. and Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin
diffusions. Journal of the Royal Statistical Society: Series B, 60(1):255–268.
Rosenberg, P. S. (1995). Hazard function estimation using B-splines. Biometrics, 51(3):874–887.
21
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian
models by using integrated nested Laplace approximations. Journal of the Royal Statistical
Society Series B, 71(2):319–392.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464.
Sinha, D. and Dey, D. K. (1997). Semiparametric Bayesian analysis of survival data. Journal of
the American Statistical Association, 92(439):1195–1212.
Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior moments and
marginal densities. Journal of the American Statistical Association, 81(393):82–86.
Turnbull, B. W. (1976). The empirical distribution function with arbitrarily grouped, censored
and truncated data. Journal of the Royal Statistical Society: Series B, 38(3):290–295.
Virlogeux, V., Fang, V. J., Park, M., Wu, J. T., and Cowling, B. J. (2016). Comparison of
incubation period distribution of human infections with MERS-CoV in South Korea and Saudi
Arabia. Scientific Reports, 6(1):35839.
Virlogeux, V., Li, M., Tsang, T. K., Feng, L., Fang, V. J., Jiang, H., Wu, P., Zheng, J., Lau,
E. H., Cao, Y., et al. (2015). Estimating the distribution of the incubation periods of human
avian influenza A (H7N9) virus infections. American Journal of Epidemiology, 182(8):723–729.
Xia, W., Liao, J., Li, C., Li, Y., Qian, X., Sun, X., Xu, H., Mahai, G., Zhao, X., Shi, L., et al.
(2020). Transmission of corona virus disease 2019 during the incubation period may lead to a
quarantine loophole. MedRxiv.
Yang, L., Dai, J., Zhao, J., Wang, Y., Deng, P., and Wang, J. (2020). Estimation of incubation
period and serial interval of COVID-19: analysis of 178 cases and 131 transmission chains in
Hubei province, China. Epidemiology & Infection, 148.
22
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 9, 2023. ; https://doi.org/10.1101/2023.08.07.23293752doi: medRxiv preprint
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The generation time distribution, reflecting the time between successive infections in transmission chains, is a key epidemiological parameter for describing COVID-19 transmission dynamics. However, because exact infection times are rarely known, it is often approximated by the serial interval distribution. This approximation holds under the assumption that infectors and infectees share the same incubation period distribution, which may not always be true. We estimated incubation period and serial interval distributions using 629 transmission pairs reconstructed by investigating 2989 confirmed cases in China in January-February 2020, and developed an inferential framework to estimate the generation time distribution that accounts for variation over time due to changes in epidemiology, sampling biases and public health and social measures. We identified substantial reductions over time in the serial interval and generation time distributions. Our proposed method provides more reliable estimation of the temporal variation in the generation time distribution, improving assessment of transmission dynamics. The generation time (interval between successive infections in a transmission chain) is an important parameter for epidemiological modeling. Here, the authors develop a framework for estimating this parameter and how it changes over time and apply it to data from China in the first months of the pandemic.
Article
Full-text available
In infectious disease epidemiology, the instantaneous reproduction number R t is a time-varying parameter defined as the average number of secondary infections generated by an infected individual at time t. It is therefore a crucial epidemiological statistic that assists public health decision makers in the management of an epidemic. We present a new Bayesian tool (EpiLPS) for robust estimation of the time-varying reproduction number. The proposed methodology smooths the epidemic curve and allows to obtain (approximate) point estimates and credible intervals of R t by employing the renewal equation, using Bayesian P-splines coupled with Laplace approximations of the conditional posterior of the spline vector. Two alternative approaches for inference are presented: (1) an approach based on a maximum a posteriori argument for the model hyperparameters, delivering estimates of R t in only a few seconds; and (2) an approach based on a Markov chain Monte Carlo (MCMC) scheme with underlying Langevin dynamics for efficient sampling of the posterior target distribution. Case counts per unit of time are assumed to follow a negative binomial distribution to account for potential overdispersion in the data that would not be captured by a classic Poisson model. Furthermore, after smoothing the epidemic curve, a “plug-in’’ estimate of the reproduction number can be obtained from the renewal equation yielding a closed form expression of R t as a function of the spline parameters. The approach is extremely fast and free of arbitrary smoothing assumptions. EpiLPS is applied on data of SARS-CoV-1 in Hong-Kong (2003), influenza A H1N1 (2009) in the USA and on the SARS-CoV-2 pandemic (2020-2021) for Belgium, Portugal, Denmark and France.
Article
Full-text available
In May 2022, monkeypox outbreaks have been reported in countries not endemic for monkeypox. We estimated the monkeypox incubation period, using reported exposure and symptom-onset times for 18 cases detected and confirmed in the Netherlands up to 31 May 2022. Mean incubation period was 8.5 days (5th–95th percentiles: 4.2–17.3), underpinning the current recommendation to monitor or isolate/quarantine case contacts for 21 days. However, as the incubation period may differ between different transmission routes, further epidemiological investigations are needed.
Article
Full-text available
The mixture cure model for analyzing survival data is characterized by the assumption that the population under study is divided into a group of subjects who will experience the event of interest over some finite time horizon and another group of cured subjects who will never experience the event irrespective of the duration of follow‐up. When using the Bayesian paradigm for inference in survival models with a cure fraction, it is common practice to rely on Markov chain Monte Carlo (MCMC) methods to sample from posterior distributions. Although computationally feasible, the iterative nature of MCMC often implies long sampling times to explore the target space with chains that may suffer from slow convergence and poor mixing. Furthermore, extra efforts have to be invested in diagnostic checks to monitor the reliability of the generated posterior samples. A sampling‐free strategy for fast and flexible Bayesian inference in the mixture cure model is suggested in this article by combining Laplace approximations and penalized B‐splines. A logistic regression model is assumed for the cure proportion and a Cox proportional hazards model with a P‐spline approximated baseline hazard is used to specify the conditional survival function of susceptible subjects. Laplace approximations to the posterior conditional latent vector are based on analytical formulas for the gradient and Hessian of the log‐likelihood, resulting in a substantial speed‐up in approximating posterior distributions. The spline specification yields smooth estimates of survival curves and functions of latent variables together with their associated credible interval are estimated in seconds. A fully stochastic algorithm based on a Metropolis‐Langevin‐within‐Gibbs sampler is also suggested as an alternative to the proposed Laplacian‐P‐splines mixture cure (LPSMC) methodology. The statistical performance and computational efficiency of LPSMC is assessed in a simulation study. Results show that LPSMC is an appealing alternative to MCMC for approximate Bayesian inference in standard mixture cure models. Finally, the novel LPSMC approach is illustrated on three applications involving real survival data.
Article
Full-text available
In epidemics many interesting quantities, like the reproduction number, depend on the incubation period (time from infection to symptom onset) and/or the generation time (time until a new person is infected from another infected person). Therefore, estimation of the distribution of these two quantities is of distinct interest. However, this is a challenging problem since it is normally not possible to obtain precise observations of these two variables. Instead, in the beginning of a pandemic, it is possible to observe for transmission pairs the time of symptom onset for both people as well as a window for infection of the first person (e.g. because of travel to a risk area). In this paper we suggest a simple semi-parametric sieve-estimation method based on Laguerre-Polynomials for estimation of these distributions. We provide detailed theory for consistency and illustrate the finite sample performance for small datasets via a simulation study.
Article
Full-text available
Background The incubation period is a crucial index of epidemiology in understanding the spread of the emerging Coronavirus disease 2019 (COVID-19). In this study, we aimed to describe the incubation period of COVID-19 globally and in the mainland of China. Methods The searched studies were published from December 1, 2019 to May 26, 2021 in CNKI, Wanfang, PubMed, and Embase databases. A random-effect model was used to pool the mean incubation period. Meta-regression was used to explore the sources of heterogeneity. Meanwhile, we collected 11 545 patients in the mainland of China outside Hubei from January 19, 2020 to September 21, 2020. The incubation period fitted with the Log-normal model by the coarseDataTools package. Results A total of 3235 articles were searched, 53 of which were included in the meta-analysis. The pooled mean incubation period of COVID-19 was 6.0 days (95% confidence interval [ CI ] 5.6–6.5) globally, 6.5 days (95% CI 6.1–6.9) in the mainland of China, and 4.6 days (95% CI 4.1–5.1) outside the mainland of China ( P = 0.006). The incubation period varied with age ( P = 0.005). Meanwhile, in 11 545 patients, the mean incubation period was 7.1 days (95% CI 7.0–7.2), which was similar to the finding in our meta-analysis. Conclusions For COVID-19, the mean incubation period was 6.0 days globally but near 7.0 days in the mainland of China, which will help identify the time of infection and make disease control decisions. Furthermore, attention should also be paid to the region- or age-specific incubation period. Graphic Abstract
Article
Full-text available
Background: Understanding changes in infectiousness during SARS-COV-2 infections is critical to assess the effectiveness of public health measures such as contact tracing. Methods: Here, we develop a novel mechanistic approach to infer the infectiousness profile of SARS-COV-2 infected individuals using data from known infector-infectee pairs. We compare estimates of key epidemiological quantities generated using our mechanistic method with analogous estimates generated using previous approaches. Results: The mechanistic method provides an improved fit to data from SARS-CoV-2 infector-infectee pairs compared to commonly used approaches. Our best-fitting model indicates a high proportion of presymptomatic transmissions, with many transmissions occurring shortly before the infector develops symptoms. Conclusions: High infectiousness immediately prior to symptom onset highlights the importance of continued contact tracing until effective vaccines have been distributed widely, even if contacts from a short time window before symptom onset alone are traced. Funding: Engineering and Physical Sciences Research Council (EPSRC).
Article
Full-text available
We consider smooth nonparametric estimation of the incubation time distribution of COVID‐19, in connection with the investigation of researchers from the National Institute for Public Health and the Environment (Dutch: RIVM) of 88 travelers from Wuhan: Backer et al. (2020). The advantages of the smooth nonparametric approach w.r.t. the parametric approach, using three parametric distributions (Weibull, log‐normal and gamma) in Backer et al. (2020) is discussed. It is shown that the typical rate of convergence of the smooth estimate of the density is n2/7 in a continuous version of the model, where n is the sample size. The (non‐smoothed) nonparametric maximum likelihood estimator (MLE) itself is computed by the iterative convex minorant algorithm (Groeneboom and Jongbloed (2014)). All computations are available as R scripts in Groeneboom (2020a).
Book
This is a practical guide to P-splines, a simple, flexible and powerful tool for smoothing. P-splines combine regression on B-splines with simple, discrete, roughness penalties. They were introduced by the authors in 1996 and have been used in many diverse applications. The regression basis makes it straightforward to handle non-normal data, like in generalized linear models. The authors demonstrate optimal smoothing, using mixed model technology and Bayesian estimation, in addition to classical tools like cross-validation and AIC, covering theory and applications with code in R. Going far beyond simple smoothing, they also show how to use P-splines for regression on signals, varying-coefficient models, quantile and expectile smoothing, and composite links for grouped data. Penalties are the crucial elements of P-splines; with proper modifications they can handle periodic and circular data as well as shape constraints. Combining penalties with tensor products of B-splines extends these attractive properties to multiple dimensions. An appendix offers a systematic comparison to other smoothers.