ArticlePDF Available

A Large Scale Spatio-temporal Binomial Regression Model for Estimating Seroprevalence Trends

Authors:

Abstract

This paper develops a large-scale Bayesian spatio-temporal binomial regression model for the purpose of investigating regional trends in antibody prevalence to Borrelia burgdorferi, the causative agent of Lyme disease. The proposed model uses Gaussian predictive processes to estimate the spatially varying trends and a conditional autoregressive model to account for spatio-temporal dependence. Careful consideration is made to develop a novel framework that is scalable to large spatio-temporal data. The proposed model is used to analyze approximately 16 million Borrelia burgdorferi test results collected on dogs located throughout the conterminous United States over a sixty month period. This analysis identifies several regions of increasing canine risk. Specifically, this analysis reveals evidence that Lyme disease is getting worse in some endemic regions and that it could potentially be spreading to other non-endemic areas. Further, given the zoonotic nature of this vector-borne disease, this analysis could potentially reveal areas of increasing human risk.
A Large Scale Spatio-temporal Binomial
Regression Model for Estimating
Seroprevalence Trends
Stella Watson Self1, Christopher McMahan1,, D. Andrew Brown1,
Robert Lund1, Jenna Gettings2, and Michael Yabsley2,3
1Department of Mathematical Sciences
Clemson University
Clemson, SC 29634-0975
2Southeastern Cooperative Wildlife Disease Study
Department of Population Health
The University of Georgia
Athens, GA 30602
3Warnell School of Forestry and Natural Resources
The University of Georgia
Athens, GA 30602
mcmaha2@clemson.edu
Abstract
This paper develops a large-scale Bayesian spatio-temporal binomial regression model for the
purpose of investigating regional trends in antibody prevalence to Borrelia burgdorferi, the
causative agent of Lyme disease. The proposed model uses Gaussian predictive processes to
estimate the spatially varying trends and a conditional autoregressive model to account for
1
spatio-temporal dependence. Careful consideration is made to develop a novel framework
that is scalable to large spatio-temporal data. The proposed model is used to analyze ap-
proximately 16 million Borrelia burgdorferi test results collected on dogs located throughout
the conterminous United States over a sixty month period. This analysis identifies several
regions of increasing canine risk. Specifically, this analysis reveals evidence that Lyme dis-
ease is getting worse in some endemic regions and that it could potentially be spreading to
other non-endemic areas. Further, given the zoonotic nature of this vector-borne disease,
this analysis could potentially reveal areas of increasing human risk.
Keywords: Borrelia burgdorferi, CAR model, chromatic sampling, Gaussian predictive pro-
cesses, Lyme disease
1. Introduction
Lyme disease is a vector-borne disease that impacts both humans and several other mam-
malian species, with domestic dogs being particularly sensitive to infection (Little et al.,
2010). Disease occurs as a result of infection by Borrelia burgdorferi, a spirochetal bacteria
that is transmitted by ticks. Incidence of disease in humans is considered to be emerging,
with a growing number of high incidence counties (Adams, 2017). Humans and dogs are in-
fected by the same vectors (Little et al., 2010), and so, unsurprisingly, the risks of exposure
for both are closely related. In fact, dogs are often considered to be sentinels for the regional
risk of Lyme disease in humans (Mead et al., 2011).
Dogs are tested regularly for exposure to B. burgdorferi as part of their annual wellness
examinations. Commonly, veterinarians use a serologic test that detects antibodies against
the C6 peptide that is present in the blood of infected animals. The presence of C6 is
indicative of an intermediate or late-term infection, and is often detectable 3 to 6 weeks
2
after exposure (Wagner et al., 2012). Among dogs that are infected, only approximately
5% develop any clinical signs of Lyme disease (Levy and Magnarelli, 1992). This practice of
routine testing provides a unique opportunity to measure the seroprevalence of B. burgdorferi
within a relatively healthy canine population visiting veterinary clinics.
Monitoring seroprevalence is useful for many reasons, despite the low incidence of disease.
Directly, it provides an estimate for the risk of exposure within a region, allowing veteri-
narians to make accurate preventative care and testing recommendations. Indirectly, the
seroprevalence of B. burgdorferi can identify the approximate range of the Ixodes spp. tick
vectors. Especially because Ixodes spp. are capable of transmitting several other pathogens,
including Anaplasma,Ehrlichia muris eauclairensis and Babesia microti (Nelder et al., 2016),
several of which are zoonotic. The shared vector allows dogs to serve as sentinels for human
risk. Therefore, modeling trends in canine seroprevalence should inform changing risk of
exposure to B. burgdorferi in humans.
The goal of this paper is to identify US regions that are experiencing an increase in canine
seroprevalence of B. burgdorferi, and by proxy identify regions where the risk of human
exposure could also be increasing. The data analyzed here contain 16,571,562 serologic test
results for B. burgdorferi conducted on domestic dogs in the conterminous United States (US)
from January 2012 - December 2016, aggregated by county and month. Figure 1 displays
the raw prevalence estimates after aggregating over all sixty months in the study; i.e., the
proportion of positive tests. There are 3,109 distinct US counties and county-equivalent
regions in the conterminous US, not all of which report test data every month. Data were
reported from 69,876 county-month pairs. As our goal is to determine where seroprevalence
is increasing, our model must have a temporal trend component that is spatially varying. To
facilitate more reliable inference, the strong positive spatio-temporal dependence of the tests
is also taken into account. The size of this data set and its large spatio-temporal support
3
motivates some of our methodological choices.
Gaussian processes (GPs) are popular geostatistical modeling tools due to their flexibility
and ability to quantify uncertainty in nonparametric regressions (O’Hagan, 1978; Neal, 1998).
Overviews of GP modeling can be found in Cressie (1993), Rasmussen and Williams (2006),
Cressie and Wikle (2011), and Gelfand and Schliep (2016). Banerjee et al. (2015) discuss
Bayesian aspects of GPs. Objective prior specification for GP models is studied in Berger
et al. (2001). GPs have become standard tools in a wide variety of applications, including
oceanography (Jona-Lasinio et al., 2012), water quality analysis (Zhang and El-Shaarawi,
2009), image classification (Morales- ´
Alvarez et al., 2017), neuroimaging (Lazar, 2008), and
computer experiments (Santner et al., 2003). GPs are also used to model disease prevalence,
including dengue fever (Johnson et al., 2017), Malaria (Andrade-Pacheco et al., 2015), and
influenza (Senanayake et al., 2016). Gelfand et al. (2003) used GPs to allow linear model
coefficients to vary smoothly over space, an approach used here to localize regional trends
in B. burgdorferi seroprevalence.
Gaussian process modifications and algorithms for analyzing big spatial data sets have
received significant attention in the recent literature, including fixed rank kriging (Cressie
and Johannesson, 2008) and LatticeKrig (Nychka et al., 2015) approaches. Both methods
employ basis function expansions of spatial random effects to reduce the dimension of the
associated covariance matrices. Katzfuss (2017) took a similar approach by applying basis
functions to a succession of refined resolutions. Spatial partitioning (e.g., Sang et al., 2011;
Heaton et al., 2017a) can be used to split regions into smaller, more manageable sub-regions
with computation being accelerated via a conditional independence assumption. Covariance
tapering (Furrer et al., 2006) uses a covariance function with compact support to induce
sparsity. Nearest neighbor processes (Datta et al., 2016) achieve computational efficiency by
conditioning on a subset of nearby observations. A similar idea was used by Gramacy and
4
Apley (2015) to find the largest number of neighbors computationally feasible for predic-
tion, optimally chosen by minimizing prediction variance. Heaton et al. (2017b) provide an
overview and comparison of these procedures and others. The approach used here involves
Gaussian predictive processes (GPPs) (Banerjee et al., 2008), which are discussed further in
Section 2.
The most common approach for modeling spatially dependent areal data involves Gaus-
sian Markov random fields (GMRFs; Rue and Held, 2005), with Gaussian conditional autore-
gressive (CAR) models (Banerjee et al., 2015) being particularly popular. As special cases of
Markov random fields (Besag, 1974), GMRFs are collections of jointly distributed Gaussian
random variables satisfying a Markov dependence structure quantified through a precision
matrix. GMRFs are extended to flexible degrees of smoothness in Brezger et al. (2007) and
Yue and Speckman (2010). Brown et al. (2017a) adjust the CAR precision matrix to build a
unified model for independent and dependent cases and study neighborhood structures other
than those based on physical adjacency. GMRF and GP connections are explored in Rue
and Tjemland (2002), Song et al. (2008), and Lindgren et al. (2011). CAR models are now
standard in disease mapping problems (e.g., Waller et al., 1997).
To achieve our goals, we develop a large scale spatio-temporal binomial regression model
that has both GPP and CAR components. The former is used to capture spatially varying
trends by treating the trend coefficient as a non-parametric surface over the spatial domain
of interest, while the latter accounts for spatio-temporal correlation. Through data augmen-
tation steps and the use of a novel sampling strategy, we establish a modeling framework that
is computationally scalable to large non-Gaussian spatio-temporal data sets. In particular,
straightforward Gibbs sampling is facilitated via a data augmentation step involving latent
olya-Gamma random variables. To avoid computationally expensive matrix calculations,
we use a chromatic sampling strategy in the Gibbs sampler. Our proposed methodology eas-
5
ily handles missing data. The finite sample properties of our proposed approach is studied
via simulation before our B. burgdorferi seroprevalence analysis.
The remainder of this paper is organized as follows: Section 2 describes the model and
our GPP and CAR structures. Section 3 discusses model fitting procedures, emphasizing
computational tractability with large scale spatio-temporal data. Section 4 presents a simu-
lation study supporting our proposed approach, and Section 5 analyzes the canine serology
data described above. We offer concluding remarks in Section 6.
2. Modeling Methods
Let Yst denote the number of cases (e.g., positive B. burgdorferi tests) observed in nst tests
in region sat time t, for s= 1, . . . , S and t= 1, . . . , T . We let Ys= (Ys1, . . . , YsT )0,
Y= (Y0
1,...,Y0
S)0RST ,ns= (ns1, . . . , nsT )0, and n= (n0
1,...,n0
S)0NST . In addition
to the disease surveillance data, covariates Zstq and Xstp, for q= 1, . . . , Q and p= 1, . . . , P ,
are assumed to be available. The Zstq are covariates whose associated effects are constant
over the study area, while Xstq are covariates whose associated effects vary by region.
To relate the observed test data to the available covariates, a Bayesian generalized
linear mixed model (MuCullagh and Nelder, 1989; Diggle et al., 1998; Banerjee et al.,
2015) is adopted. The general model for our data is a binomial regression: Yst |nst, pst
Binomial(nst, pst ) with
νst := g1(pst) = Z0
stδ+X0
stβ(`s) + ξst ;s= 1, . . . , S;t= 1, . . . , T, (1)
where g:R(0,1) is a known link function (e.g., logistic) relating the linear predictor νst
to the prevalence pst,Zst = (1, Zst1, . . . , ZstQ)0RQ+1,Xst = (Xst1, . . . , XstP )0RP,δ=
(δ0, . . . , δQ)0are global regression coefficients, β(·)=(β1(·), . . . , βP(·))0are spatially varying
6
regression coefficients, `s= (`s1, `s2)0is a vector of spatial coordinates (e.g., latitude and
longitude) that identify the centroid of region s, and ξst is a spatio-temporal random effect.
Following Gelfand et al. (2003), the spatially varying regression coefficients are regarded as
unknown surfaces over the study region. To model these unknown surfaces while maintaining
computational tractability, we use GPPs.
A Gaussian process is a stochastic process whose finite dimensional distributions are
multivariate normal. A GP βp(·)|θp GP (µp(·), C(·,·;θp)) is uniquely determined by
its mean and covariance, µp(`s) := E[βp(`s)] and C(`s,`s0;θp) := Cov(βp(`s), βp(`s0)) =
σ2
pρp(`s,`s0;θp), where ρp(·,·;θp) is a correlation function depending on the parameter vec-
tor θp. For smoothing and interpolation, it is often sufficient to take a constant mean
(Bayarri et al., 2007). In our case, we a priori posit that µp(·)0 for all p. Thus,
βp= (βp(`1), . . . , βp(`S))0, S N,follows a multivariate normal distribution with mean 0
and covariance matrix Cp=σ2
pRp, where (Rp)ss0=ρp(`s,`s0;θp). In general, the covariance
matrix inversions and factorizations associated with estimating posterior GPs are O(S3) in
computational complexity and in an MCMC context, these operations are repeated thou-
sands of times. Thus, as Sgrows large, GP’s quickly become computationally prohibitive.
To reduce the dimension of the problem, the Gaussian predictive process (GPP; Banerjee
et al., 2008) considers a “parent” process based on a strategically chosen set of knots, and then
interpolates the process to the points of interest via kriging. Let {`
1,...,`
S
p}denote the knot
set with S
pS. Define β
p= (βp(`
1), . . . , βp(`
S
p))0and note that β
p|σ2
p,θp
ind
N(0,C
p), for
all p, where C
p=σ2
pR
pand (R
p)ss0=ρp(`
s,`
s0;θp). The joint distribution of βpand β
pis
again multivariate normal:
βp
β
p
σ2
p,θpN
0, σ2
p
Rpe
R
p
e
R0
pR
p
,(2)
7
where e
R
pis an S×S
pmatrix with the (s, s0)th element being ρp(`s,`
s0;θp). Exploiting this
relationship, the Gaussian predictive process simply replaces βpwith e
βp:= E(βp|β
p;θp) =
e
R
p(R
p)1β
p. When S
pis not large, (R
p)1can be quickly computed. For more on GPPs,
see Banerjee et al. (2008).
Fully specifying a GPP requires selecting knot locations. Banerjee et al. (2008) discuss
several methods of knot selection, including placing them on a regular grid, selecting them
at random from the observation locations, and methods which place more knots in areas
with more observations. Finley et al. (2009) suggest choosing knot locations to minimize the
conditional variance at observed data locations; Guhaniyogi et al. (2011) propose an adaptive
knot selection strategy where the knot locations are treated as a point process. Following
Eidsvik et al. (2012), our knots are chosen via K-means clustering with S
pclusters; i.e.,
using K-means clustering we partition the Scounties into S
pclusters based on the `s. The
knot locations are subsequently taken to be the centroids of the S
pclusters. For further
details on K-means clustering see Hartigan and Wong (1979).
We account for the spatio-temporal dependence that exists in the data by allowing a
spatial GMRF to evolve over time. One way of doing so is presented in Waller et al. (1997),
who allow the variance of the CAR model to depend on time. However, we use a first-order
vector autoregression with errors following a GMRF; i.e.,
ξt=ζξt1+φt(3)
where ξt= (ξ1t, ...ξSt )0,ζ(1,1) is a temporal correlation parameter, and we assume
ξ0=0without loss of generality. We take φtto be independent and identically distributed
as a proper intrinsically autoregressive model (Besag and Kooperberg, 1995); i.e., φt
N (0, τ 2(DωW)1), where τ2>0 and ω(0,1) is a so-called ‘propriety parameter’ that
8
ensures the precision matrix is non-singular (Banerjee et al., 2015). The neighborhood matrix
WRS×Sis such that (W)ss0is equal to 1 if and only if location sis adjacent to location
s0, s 6=s0, 0 otherwise, and D= diag PS
j=1(W)sj , s = 1, . . . , S. To avoid confounding
with the intercept, we impose the standard sum-to-zero constraint (i.e., PT
t=1 PS
s=1 ξst = 0).
We complete the proposed model by specifying prior distributions on the regression coef-
ficients and the variance and correlation parameters. In the absence of strong prior informa-
tion, the hyperparameters are chosen so that the prior distributions are vague. A Gaussian
prior is taken on the global regression coefficients and inverse Gamma (IG) priors on the
variance components for conditional conjugacy. Likewise, a truncated Gaussian prior whose
support is confined to (1,1) is specified for ζ, again for conditional conjugacy. We take a
Beta(αω, υω) prior on ωand concentrate it close to one, since previous empirical work has
shown that ω1 is necessary to induce noticeable spatial association (Banerjee et al., 2015).
These specifications lead to the following hierarchy:
Yst|nst , νst
indep.
Binomial (nst, pst =g(νst )) , s = 1, . . . , S;t= 1, . . . , T ;
β
p|σ2
p,θp
indep.
N(0, σ2
pR
p(θp)), p = 1, . . . , P ;
σ2
p
indep.
IG(ασ2
p, υσ2
p), p = 1, . . . , P ;
θp
i.i.d.
π(θp), p = 1, . . . , P ;
δN(0, σ2
δI), σ2
δ>0;
ξt|ξt1, τ 2, ω, ζ Nζξt1, τ2(DωW)1, t = 1, . . . , T ;
τ2IG(ατ2, υτ2), ατ2, υτ2>0;
ωBeta(αω, υω), αω, υω>0;
ζTruncated Normal(0, σ2
ζ,1,1), σ2
ζ>0,
(4)
where νst =Z0
stδ+X0
st e
β(`s) + ξst,e
β(`s) = ( e
β1(`s), ..., e
βP(`s))0, and each coefficient in e
β(`s)
9
is obtained from the Ppredictive processes via e
βp=e
R
p(R
p)1β
p, and ξ0=0. Appropriate
(identical) priors for θ1, . . . θPdepend on the selected correlation function in the GPP model.
3. Posterior Sampling
3.1 Data Augmentation
We assume conditional independence given the covariate effects and spatio-temporal effects
and observe that Ydepends on the regression coefficients and random effects only through
ν= (ν11, . . . , ν1T, ν21, . . . , νST )0. Hence, the likelihood can be expressed as
f(Y|ν)
T
Y
t=1
S
Y
s=1
g(νst)Yst {1g(νst )}nstYst .(5)
To develop a posterior sampling algorithm, we take g(·) to be the logistic link function.
Other link functions are possible and can be implemented following Albert and Chib (1993)
or Gamerman (1997). Metropolis-Hastings steps (Metropolis et al., 1953; Hastings, 1970) can
be used either componentwise or in blocks, but such samplers can be difficult to tune in high
dimensions. To facilitate the derivation of a Gibbs sampler for the regression coefficients and
spatio-temporal random effects, we use a data augmentation scheme that leads to sampling
these parameters from Gaussian full conditional distributions.
Our data augmentation approach follows that of Polson et al. (2013). This scheme relies
on the fact that exp(ν)a{1 + exp(ν)}b= 2bexp(κν)R
0exp(ψν2/2) p(ψ|b, 0), where
aR,bR+,κ=ab/2, and p(· | b, 0) is the probability density function of a olya-
Gamma random variable with parameters band 0. Thus, under the logistic link, (5) can be
10
written as
f(Y|ν)
T
Y
t=1
S
Y
s=1
exp(κstνst )Z
0
exp(ψstν2
st/2)p(ψst |nst,0)st
T
Y
t=1
S
Y
s=1 Z
0
f(Yst, ψst |νst )st,
where κst =Yst nst/2. By introducing the ψst as latent random variables to be sampled
via MCMC, we obtain
f(Y,ψ|ν)exp(ν0Dψν/2 + κ0ν)
T
Y
t=1
S
Y
s=1
p(ψst|nst ,0),
where ψ= (ψ11, . . . , ψ1T, ψ21, . . . , ψST )0,Dψ= diag(ψ), and κ= (κ11, . . . , κ1T, κ21 , . . . , κST )0.
We see, then, that data augmentation yields a Gaussian density in νup to a normalizing
constant. Consequently, the full conditional distributions for most of the parameters are of
a known form and are easy to sample from; i.e., the full conditional distribution of ψst is
olya-Gamma, β
pis multivariate normal, δis multivariate normal, σ2
pis inverse gamma,
τ2is inverse gamma, and ζis truncated normal. The Supplementary Material provides the
complete set of full conditional distributions.
Given the data augmentation, a posterior sampling algorithm involving Gibbs steps for
the aforementioned parameters can be constructed in the usual manner. Metropolis-Hastings
steps are used to sample θpand ω. Under the considered data augmentation scheme, the
full conditional distribution of ξtis multivariate normal. However, sampling this parameter
is computationally expensive due to its high dimension. To facilitate more efficient repeated
updates of ξt, we employ chromatic sampling, which is described next.
11
3.2 Chromatic Sampling
The full conditional distributions of ξt, t = 1, . . . , T , are each multivariate normal. Block
sampling from these full conditionals is reasonable when the number of spatial units is
relatively small (Furrer and Sain, 2010), but becomes unwieldy as Sincreases due to the as-
sociated Cholesky factorizations and memory requirements. As an alternative, we propose to
use so-called chromatic sampling (Gonzalez et al., 2011; Brown et al., 2017b). The chromatic
sampler exploits the Markov structure of the CAR model to parallelize single-site updates,
thereby avoiding time consuming matrix calculations such as Cholesky factorizations. Under
chromatic sampling, the computing time scales approximately linearly in Sand T.
Let {A1,...,AK}be a partition in which Akis an index set identifying a collection of
spatial regions that are not adjacent to one another; i.e., for all s, s0 Ak,Wss0= 0. A
greedy algorithm for finding such a partition on an irregular lattice is given by Brown et al.
(2017b). For a vector a= (a1, . . . , aS)0and an index set C, define a(C) := (as:s C)0.
The Markov property of the CAR model implies that the elements of ξt(Ak), given ξt(Ac
k),
are conditionally independent. Therefore, by conditioning on ξt(Ac
k), the elements of ξt(Ak)
can be sampled from their univariate full conditional distributions in parallel (or through
‘vectorized’ calculations). This approach can handle an extremely large number of spatial
regions (e.g., S > 100,000) when they are sparsely connected. For further details, see Brown
et al. (2017b), who compare block sampling to chromatic sampling for GMRFs.
3.3 A Note on Missing Data
In our application, data are not reported at all location-time pairs. To capture this effect, let
Rbe the set of ordered pairs (s, t) for which data are available. The augmented likelihood
12
is then
f(Y(R),ψ(R)|ν(R)) exp(ν(R)0Dψ(R)ν(R)/2 + κ(R)0ν(R)) Y
(s,t)∈R
p(ψst|nst ,0),
where ν(R) = Z(R)δ+X(R)e
b+I(R)ξand we use the convention that A(R) is the
matrix formed by retaining the rows of Awhose indices are in R. Here Z= (Z0
1· · · Z0
S)0
RST ×(Q+1) with Zs= (Zs1· · · ZsT )0. Similarly, X=LS
s=1 XsRST ×SP with Xs=
(Xs1,...,XsT )0,Iis the identity matrix, and e
b= (e
β0(`1),...,e
β0(`S))0RSP . Since ξ
RST is the vector of spatial random effects over all locations within the study region for all
time points, we obtain a well-defined full conditional distribution for ξ, provided that the
prior on ξis proper. This representation of the joint density allows the model to be extended
to the entire study region by imputing the missing effects via posterior realizations.
4. A Simulation Study
In this section, we study how well the proposed method estimates model coefficients and
how GPP knot selection influences results via simulation. Data are generated on a regularly
spaced 13×13 grid over 60 time points and drawing Yst|nst, pst
indep.
Binomial(nst, pst ), where
g1(pst) = δ0+e
β1(`s)t/60 + ξst, s = 1,...,132;t= 1,...,60,
and g(·) is the logistic link. The test counts nst are randomly sampled from a discrete uniform
distribution ranging from 100 to 200. The random effects ξst are generated from the CAR
model defined in Section 2, with ζ= 0.9, τ2= 0.005, ω= 0.9, and the neighborhood matrix
Wset so that two areas are neighbors if and only if they share a common edge or corner.
The true intercept is δ0=1 and the surface e
β1(·) at each study location is generated from
13
a GPP model. In particular, a realization of the parent process is first simulated on a 5 ×5
grid of equally spaced knots. The parent process has µ1(`
s)1 and ρ(`
s,`
s0;θ1) = θd2
ss0
1,
where dss0is the Euclidean distance between `
sand `
s0,θ1= 0.6 and σ2
1= 1.5. The resulting
true surface e
β1(·) is depicted in Figure 2. Using this surface, 500 independent data sets are
generated from the assumed data generating model.
We fit our model to each of the 500 data sets using three separate knot set configurations.
The first configuration uses the same knots as those used to generate the true surface,
representing an ideal situation. The other two configurations take 4 ×4 and 7 ×7 grids
of equally spaced knots. For the model (4) priors, we take ασ2
1=υσ2
1=ατ2=υτ2= 2,
σ2
δ= 1000, αω= 900, υω= 100, and σ2
ζ= 10. In the GPP, the correlation function is taken
to be ρ(`s,`s0;θ1) = θd2
s,s0
1,the same as the true GPP, and we specify a Uniform(0,1) prior
on θ1. For each data set, we retain 5,000 MCMC iterates after a burn-in of 5,000 samples.
Convergence of the chains were assessed via trace plots.
Figure 3 displays a summary of the simulation results for the temporal trend parameter
e
β1(·). This summary includes a spatial depiction of the arithmetic average of the 500 point
estimates, as well as empirical bias and mean squared error, where for each simulated data
set a point estimate of e
β1(·) was obtained as the mean of the 5,000 retained MCMC iterates.
The results suggest that our model estimates the spatially varying regression coefficient well;
i.e., the mean estimates show little bias. The variability of the estimators tends to increase
near the region’s edges. This boundary effect is expected and is common to non-parametric
smoothers. Further, little difference between the estimates obtained under the three different
knot configurations is seen, demonstrating that the methods can recover the true coefficient
surface across the entire study region (assuming the model is correct up to choice of knots).
14
5. Lyme Analysis
5.1 Background
Our data consist of 16,571,562 test results from domestic dogs living throughout the con-
terminous United States from January 2012 - December 2016. The data were provided by
IDEXX Laboratories, Inc. to the Companion Animal Parasite Council (CAPC), who made
them available online at https://www.capcvet.org. The data are aggregated by month
and county, resulting in 69,876 county-month pairs reporting data.
In general, the spatial distribution of a vector-borne disease is strongly influenced by the
environment and the vector’s hosts, leading to correlated data (Legendre, 1993). Indeed, a
strong spatial correlation is seen in these data, as indicated by Figure 1 and a Moran’s I
statistic of 0.378 (p-value 0). Such data are also positively temporally correlated. Figure
4 displays raw county-level seroprevalence estimates over all months in the respective years
of 2012 and 2016. These figures suggest where a significant increase in seroprevalence is
expected and include western Pennsylvania, Virginia, West Virginia, Minnesota, and Iowa.
5.2 Model Building and Seasonality
Given the seasonality of Ixodes spp. activity, seasonality could manifest itself in B. burgdorferi
seroprevalence. To investigate this, the model
νst =δ0+e
β1(`s)I1(t) + e
β2(`s)I2(t) + e
β3(`s)I3(t) + e
β4(`s)t+ξst (6)
is posited, where tdenotes time (rescaled to the unit interval) and Ip(t) is a seasonal indicator,
for p= 1,2,3. Seasons are defined as follows: winter (December-February), spring (March-
May), summer (June-August), and fall (September-November), where winter regarded as the
15
baseline. This model allows for spatially varying seasonal effects and spatially varying trend
effects. While covariates such as county level temperatures and precipitations are available,
these are not used in the regression since our goal is to quantify any trends, not determine
specific drives of any trends.
The model in (6) was fit with the prior specifications and correlation function described
in Section 4. Two specifications for the GPP model are considered, using 50 and 100 knots,
respectively. In both cases, knot placement for all GPP models is done by K-means clustering
as described in Section 2.1. For sampling, 30,000 MCMC iterates are generated, with the
last 10,000 retained for inference. Convergence of the MCMC chains was assessed using
trace plots. We stress the computational scalability of this approach. This model consists of
four a priori independent coefficient surfaces, each with 3109 spatial locations, and 186,540
spatio-temporal random effects.
Two primary findings arise. First, there are no appreciable differences between the es-
timates using 50 and 100 knots. As both specifications are computationally feasible, all
subsequent analyses use 100 knots. Second, there is evidence of seasonality in the location
parameters, but these appear constant across space. Thus, the simpler model
νst =δ0+δ1I1(t) + δ2I2(t) + δ3I3(t) + e
β1(`s)t+ξst (7)
was fit. Credible intervals at level 95% indicate that the model can be futher reduced to
νst =δ0+δ1I
1(t) + e
β1(`s)t+ξst,(8)
where I
1(t) is a seasonal indicator that equals one if tis between March and November,
and zero otherwise. Approximate 95% credible intervals for δ0and δ1are [3.95,3.82] and
[0.20,0.10], respectively.
16
For further insight, the model in (8) is compared against the nonseasonal model
νst =δ0+e
β1(`s)t+ξst.(9)
For this model, an approximate 95% credible interval for δ0is [4.08,4.03]. Figure 5
displays the estimates of e
β1(·) from both models. Very similar large-scale patterns in the
estimated trends are seen; hence, while seasonality exists in the location parameters, its
effect on trends seems negligible.
The spatial trend e
β1(·) is a large-scale regional trend with low spatial frequency, and
is not intended to explain local (county-level) trends. While regional trends are useful for
estimating behavior in areas reporting little data, it may be desirable to separate local
heterogeneity in the trends and provide a county level assessment. Our proposed modeling
framework can accomplish this. Specifically, let υ(g)
s, for posterior sample realization g=
1, ..., G, be the slope estimate obtained at county sby fitting a simple linear regression to
{(t, ν(g)
st ) : t= 1, ..., T }. Then υ(g)
scan be regarded as a Monte Carlo realization of the
county level trend. Using the υ(g)
sas a random sample, point estimation and inference for
county level trends proceeds in the usual way.
5.3 Results
Figure 5 displays the estimated posterior mean of the regional trend e
β1. The regional rate of
change in B. burgdorferi seroprevalence between January 2012 and December 2016 is positive
in all states that are currently recognized as having high human Lyme disease incidence
(Centers for Disease Control and Prevention, 2017), including portions of the Northeast
and the Upper Midwest. The regional rate of increase varies spatially, with high incidence
regions generally exhibiting the greatest changes. These regions include Maine, south to
17
West Virginia and Virginia, and northern parts of Minnesota and Wisconsin.
Figure 6 displays estimated posterior means of the county-level trend υs. Significantly
increasing local trends are seen in much of the Northeast, extending southwards through
West Virginia and Virginia and into North Carolina and Tennessee. This conclusion is not
surprising as this region entails localities where Lyme disease has been reported in increasing
incidence. Also seen are increasing local trends in parts of northwestern Minnesota, north-
ern Wisconsin, and southeastern Iowa. In the Great Lakes region, increasing trends are
observed in eastern Ohio, Indiana, and western Michigan. Of note, in much of eastern New
England which is the region where Lyme first emerged in people, the prevalence appears to
be remaining stable, albiet high. Figure 7 graphically depicts counties where local trends
are significantly positive, using approximate 95% equal-tailed credible intervals to assess
significance. This graphic further supports the above statements.
6. Discussion
This paper develops a computationally feasible binomial regression model for a large spatio-
temporal data set that can identify localized trends in canine seroprevalence. Our novel
approach combines several recent advances in large-scale spatial modeling and MCMC sam-
pling. The end product is a flexible, scalable methodology for modern spatio-temporal data.
Our approach was used to identify regions of the U.S. experiencing increasing canine risk
for B. burgdorferi infection. Since human and canine risk are similar, these regions are likely
experiencing increasing human exposure as well. And while human Lyme disease data may be
private and in many regions scarce due to lack of testing, our canine seroprevalence data con-
sist of over 16 million spatio-temporally referenced test results. The size of the domain (3109
spatial locations and 60 time points) creates computational challenges. While monthly and
18
county-level aggregation reduces the size of the response vector from 16,581,562 test results
to 69,876 county-month pairs, a binomial response in an MCMC context typically requires
sampling via Metropolis-Hastings steps, which can be difficult to tune in extremely high
dimensions (over 180,000, in our case). Under the logistic link, a recently proposed Poly´a-
Gamma data augmentation was used to facilitate direct Gibbs sampling on full conditional
distributions. Gaussian predictive processes were used to model smoothly varying, high-
dimensional coefficients through a low-dimensional representation. Local spatio-temporal
heterogeneity was modeled by random effects following a time-varying Gaussian CAR distri-
bution. Chromatic sampling was used on GMRFs to construct an efficient MCMC algorithm.
The motivation for this study is the rise in reported Lyme disease cases in the United
States (Adams, 2017) and, in particular, rising incidence in states not traditionally considered
to be endemic, such as West Virginia. Our results suggest that 1) canine seroprevalence is
rising in conjunction with reports of human cases (Kugeler et al., 2015; Hendricks and Mark-
Carew, 2017; Centers for Disease Control and Prevention, 2017), 2) rates are increasing
most in areas where the pathogen has recently encroached, and 3) seroprevalence in dogs is
rising outside of the states considered to be high incidence for humans (Centers for Disease
Control and Prevention, 2017) (suggesting that risk may be increasing for humans in those
areas). Several studies have recognized increasing risk in low incidence areas, including
human disease incidence, tick density, and presence of the pathogen. These areas include
Illinois (Herrmann et al., 2014), Iowa (Lingren et al., 2005), North Dakota (Russart et al.,
2014), Ohio (Wang et al., 2014), and Michigan (Lantos et al., 2017). We also observe
significant increases in canine seroprevalence in several states that have not yet reported
significant human incidence. Given the proximity of these areas to recognized high-incidence
states, it is reasonable to propose that canine seroprevalence is more sensitive to changes in
risk of exposure and thus may be used as an early warning system for changes in human
19
risk.
Examining local trends, as opposed to regional effects, shows that some adjacent counties
are exhibiting trends in opposite directions. To fully understand this heterogeneity, further
ecological analyses are needed. Possible factors to consider include the presence of urban
centers, degree of forestation or other habitat factors, tick populations, reservoir presence
and densities, vaccination, and preventative medication use in dogs. The latter are likely
driven by socioeconomic factors whereas other factors are related to climate or changing
habitats. Areas with significantly positive trends include the Appalachian mountains from
upstate New York to North Carolina, the Upper Midwest, and Iowa. The West Virginia,
western Pennsylvania, and eastern Ohio regions can be viewed as a leading edge of rising
seroprevalence in Lyme’s westward expansion. This is supported by evidence in increased
reports of ticks in these regions (Eisen et al., 2016).
Our approach makes several simplifying assumptions. We treat the link function in the
GLM as known, and it might be poorly specified. As this can induce bias in the estimates
of the covariate effects (Neuhaus, 1999), relaxing this assumption could be fruitful. We also
assumed that the spatially varying coefficients follow independent and identically distributed
Gaussian processes. A more flexible approach would allow these coefficients to be correlated
through a multivariate GP (Ver Hoef and Barry, 1998), but these are more difficult to use
and challenges remain in their development (e.g., Fricker et al., 2013). The observed sero-
prevalences suggest that smoothness of the random effects may change by region, suggesting
that a heteroskedastic GP might be more appropriate (Binois et al., 2016). Further, GMRFs
are known to oversmooth salient features (Smith and Fahrmeir, 2007) and do not directly
correspond to any valid covariance function in a GP. However, approximating GPs with
GMRFs via stochastic PDEs to maintain computational feasibility (Lindgren et al., 2011)
could be promising for our application. In addition to statistical challenges, future applica-
20
tions of our model include human Lyme disease data and heartworm disease, ehrlichiosis,
and anaplasmosis in canines. The ecological, entomological and environmental implications
of the canine Lyme seroprevalence analysis presented in this work is the subject of ongoing
research.
7. Supplementary Material
The supplementary material for this article includes the full conditional distributions required
to develop the proposed posterior sampling procedure.
Acknowledgements:
The authors thank IDEXX Laboratories, Inc. for their data contribution. This mate-
rial is based upon work partially supported by the National Science Foundation Grants
DMS-1127914, DMS-1407480, CMMI-1563435, and EEC-1744497, and National Institutes
of Health Grant R01 AI121351. JRG is supported by The Boehringer Ingelheim Vetmedica-
CAPC Infectious Disease Postdoctoral Fellowship.
References
Adams, D. A. (2017). Summary of notifiable infectious diseases and conditions- United
States, 2015. Morbidity and Mortality Weekly Report, 64.
Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response
data. Journal of the American Statistical Association, 88(422):669–679.
Andrade-Pacheco, R., Mubangizi, M., Quinn, J., and Lawrence, N. (2015). Monitoring
21
short term changes in infectious disease in Uganda with Gaussian processes. In Douzal-
Chouakria, A., Vilar, J. A., and Marteau, P.-F., editors, Advanced Analysis and Learning
on Temporal Data, pages 95–110. Springer.
Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2015). Hierarchical Modeling and Analysis
for Spatial Data. Chapman and Hall/CRC, Boca Raton, 2nd edition.
Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictive
process models for large spatial datasets. Journal of the Royal Statistical Society: Series
B (Methodological), 70(4):825–48.
Bayarri, M. J., Berger, J. O., Paulo, R., Sacks, J., Cafeo, J. A., Cavendish, J., Lin, C.-
H., and Tu, J. (2007). A framework for validation of computer models. Technometrics,
49(2):138–154.
Berger, J. O., de Oliveira, V., and Sanso, B. (2001). Objective Bayesian analysis of spatially
correlated data. Journal of the American Statistical Association, 96(456):1361–1374.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal
of the Royal Statistical Society: Series B (Methodological), 36(2):192–236.
Besag, J. and Kooperberg, C. (1995). On conditional and intrinsic autoregressions.
Biometrika, 82:733–746.
Binois, M., Gramacy, R. B., and Ludkovski, M. (2016). Practical heteroskedastic Gaussian
process modeling for large simulation experiments. Arxiv preprint 1611:05902.
Brezger, A., Fahrmeir, L., and Hennerfeind, A. (2007). Adaptive Gaussian Markov random
fields with applications in human brain mapping. Journal of the Royal Statistical Society:
Series C (Applied Statistics), 56(3):327–345.
22
Brown, D. A., Datta, G. S., and Lazar, N. A. (2017a). A Bayesian generalized CAR model
for correlated signal detection. Statistica Sinica, 27:1125–1153.
Brown, D. A., McMahan, C. S., and Watson, S. C. (2017b). Sampling strategies for fast
updating of Gaussian Markov random fields. Arxiv preprint 1702:05518.
Centers for Disease Control and Prevention (2017). Data and statistics: Lyme disease.
Cressie, N. (1993). Statistics for Spatial Data. Wiley Press.
Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial datasets.
Journal of the Royal Statistical Society: Series B (Methodological), 70:209–226.
Cressie, N. and Wikle, C. K. (2011). Statistics for Spatio-temporal Data. Wiley Press.
Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchical nearest-
neighbor Gaussian process models for large geostatistical datasets. Journal of the American
Statistical Association, 111:800–812.
Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998). Model-baesd geostatistics. Journal of
the Royal Statistical Society: Series C (Applied Statistics), 47(3):299–350.
Eidsvik, J., Finley, A. O., Banerjee, S., and Rue, H. (2012). Approximate Bayesian inference
for large spatial datasets using predictive process models. Computational Statistics and
Data Analysis, 56(6):1362 1380.
Eisen, R. J., Eisen, L., and Beard, C. B. (2016). County-scale distribution of Ixodes scapularis
and Ixodes pacificus (Acari : Ixodidae) in the continental United States. Journal of Medical
Entomology, 53(January):349–386.
23
Finley, A. O., Sang, H., Banerjee, S., and Gelfand, A. E. (2009). Improving the perfor-
mance of predictive process modeling for large datasets. Computational Statistics and
Data Analysis, 53(8):2873 2884.
Fricker, T. E., Oakley, J. E., and Urban, N. M. (2013). Multivariate Gaussian process
emulators with nonseparable covariance structures. Technometrics, 55(1):47–56.
Furrer, R., Genton, M. G., and Nychka, D. (2006). Covariance tapering for interpolation of
large spatial datasets. Journal of Computational and Graphical Statistics, 15:502–523.
Furrer, R. and Sain, S. R. (2010). spam: A sparse matrix R package with emphasis on MCMC
methods for Gaussian Markov random fields. Journal of Statistical Software, 36(10):1–25.
Gamerman, D. (1997). Sampling from the posterior distribution in generalized linear mixed
models. Statistics and Computing, 7(1):57–68.
Gelfand, A. E., Kim, H. J., Sirmans, C. F., and Banerjee, S. (2003). Spatial modeling with
spatially varying coefficient processes. Journal of the American Statistical Association,
98(462):387–396.
Gelfand, A. E. and Schliep, E. M. (2016). Spatial statistics and Gaussian processes: A
beautiful marriage. Spatial Statistics, 18(Part A):86 104.
Gonzalez, J., Low, Y., Gretton, A., and Guestrin, C. (2011). Parallel Gibbs sampling:
From colored fields to thin junction trees. In Gordon, G., Dunson, D., and Dud´ık, M.,
editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence
and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 324–332,
Fort Lauderdale, FL, USA. PMLR.
Gramacy, R. and Apley, D. (2015). Local Gaussian process approximation for large computer
experiments. Journal of Computational and Graphical Statistics, 24:561–578.
24
Guhaniyogi, R., Finley, A. O., Banerjee, S., and Gelfand, A. E. (2011). Adaptive Gaussian
predictive process models for large spatial datasets. Environmetrics, 22(8):997–1007.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108.
Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their appli-
cation. Biometrika, 57:97–109.
Heaton, M. J., Christensen, W. F., and Terres, M. A. (2017a). Nonstationary Gaussian
process models using spatial hierarchical clustering from finite differences. Technometrics,
59:93–101.
Heaton, M. J., Datta, A., Finley, A., Furrer, R., Guhaniyogi, R., Gerber, R., Gramacy, R.,
et al. (2017b). Methods for analyzing large spatial data: A review and comparision. Arxiv
preprint 1710.05013.
Hendricks, B. and Mark-Carew, M. (2017). Using exploratory data analysis to identify
and predict patterns of human Lyme disease case clustering within a multistate region,
2010–2014. Spatial and Spatio-temporal Epidemiology, 20:35–43.
Herrmann, J. A., Dahm, N. M., Ruiz, M. O., and Brown, W. M. (2014). Temporal and spatial
distribution of tick-borne disease cases among humans and canines in Illinois (2000–2009).
Environmental Health Insights, 8(Supplement 2):15.
Johnson, L. R., Gramacy, R. B., Cohen, J., Mordecai, E. A., Murdock, C., Rohr, J., Ryan,
S. J., Stewart-Ibarra, A. M., and Weikel, D. (2017). Phenomenological forecasting of
disease incidence using heteroskedastic Gaussian processes: A dengue case study. Annals
of Applied Statitics, In Press.
25
Jona-Lasinio, G., Gelfand, A., and Jona-Lasinio, M. (2012). Spatial analysis for wave direc-
tion data using wrapped Gassian processes. Annals of Applied Statistics, 6(4):1478–1498.
Katzfuss, M. (2017). A multi-resolution approximation for massive spatial datasets. Journal
of the American Statistical Association, 112:201–214.
Kugeler, K. J., Farley, G. M., Forrester, J. D., and Mead, P. S. (2015). Geographic distribu-
tion and expansion of human Lyme disease, United States. Emerging Infectious Diseases,
21(8):1455–1457.
Lantos, P. M., Tsao, J., Nigrovic, L. E., Auwaerter, P. G., Fowler, V. G., Ruffin, F., Foster,
E., and Hickling, G. (2017). Geographic expansion of Lyme disease in Michigan, 2000–
2014. In Open Forum Infectious Diseases, volume 4. Oxford University Press.
Lazar, N. A. (2008). The Statistical Analysis of Functional MRI Data. Springer Science +
Business Media, LLC, New York.
Legendre, P. (1993). Spatial autocorrelation: Trouble or new paradigm? Ecology,
74(6):1659–1673.
Levy, S. and Magnarelli, L. (1992). Relationship between development of antibodies to
Borrelia burgdorferi in dogs and the subsequent development of limb/joint borreliosis.
Journal of the American Veterinary Medical Association, 200(3):344–347.
Lindgren, F., Rue, H., and Lindstr¨om, J. (2011). An explicit link between Gaussian fields
and Gaussian Markov random fields: The stochastic partial differential equation approach.
Journal of the Royal Statistical Society: Series B (Methodological), 73(4):423–498.
Lingren, M., Rowley, W., Thompson, C., and Gilchrist, M. (2005). Geographic distribution
of ticks (Acari: Ixodidae) in Iowa with emphasis on Ixodes scapularis and their infection
with Borrelia burgdorferi.Vector-Borne and Zoonotic Diseases, 5(3):219–226.
26
Little, S. E., Heise, S. R., Blagburn, B. L., Callister, S. M., and Mead, P. S. (2010). Lyme
borreliosis in dogs and humans in the USA. Trends in Parasitology, 26(4):213–218.
Mead, P., Goel, R., and Kugeler, K. (2011). Canine serology as adjunct to human Lyme
disease surveillance. Emerging Infectious Diseases, 17(9):1710.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953).
Equation of state calculations by fast computing machines. Journal of Chemical Physics,
21:1087–1091.
Morales-´
Alvarez, P., erez-Suay, A., Molina, R., and Camps-Valls, G. (2017). Remote sensing
image classification with large-scale Gaussian processes. IEEE Transactions on Geoscience
and Remote Sensing, PP(99):1–12.
MuCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman and
Hall/CRC, London, 2nd edition.
Neal, R. M. (1998). Regression and classification using Gaussian process priors. In Bernardo,
J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M., editors, Bayesian Statistics 6.
Oxford University Press, New York.
Nelder, M. P., Russell, C. B., Sheehan, N. J., Sander, B., Moore, S., Li, Y., Johnson, S.,
Patel, S. N., and Sider, D. (2016). Human pathogens associated with the blacklegged tick
Ixodes scapularis: A systematic review. Parasites and Vectors, 9(1):265.
Neuhaus, J. M. (1999). Bias and efficiency loss due to misclassified responses in binary
regression. Biometrika, 86(4):843–855.
Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S. (2015). A
multiresolution Gaussian process model for the analysis of large spatial datasets. Journal
of Computational and Graphical Statistics, 24:579–599.
27
O’Hagan, A. (1978). Curve fitting and optimal design for prediction. Journal of the Royal
Statistical Society: Series B (Methodological), 40:1–42.
Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models
using olya-Gamma latent variables. Journal of the American Statistical Association,
108(504):1339–1349.
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.
The MIT Press.
Rue, H. and Held, L. (2005). Gaussian Markov random fields: theory and applications.
Chapman and Hall/CRC.
Rue, H. and Tjemland, H. (2002). Fitting Gaussian Markov random fields to Gaussian fields.
Scandinavian Journal of Statitsics, 29:31–49.
Russart, N. M., Dougherty, M. W., and Vaughan, J. A. (2014). Survey of ticks (Acari:
Ixodidae) and tick-borne pathogens in North Dakota. Journal of Medical Entomology,
51(5):1087–1090.
Sang, H., Jun, M., and Huang, J. Z. (2011). Covariance approximation for large multivariate
spatial datasets with an application to multiple climate model errors. Annals of Applied
Statistics, 5(4):2519–2548.
Santner, T. J., Williams, B. J., and Notz, W. I. (2003). The Design and Analysis of Computer
Experiments. Springer-Verlag, New York.
Senanayake, R., O’Callaghan, S., and Ramos, F. (2016). Predicting spatio-temporal propa-
gation of seasonal influenza using variational Gaussian process regression. In Proceedings
of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 3901–3907.
AAAI Press.
28
Smith, M. and Fahrmeir, L. (2007). Spatial Bayesian variable selection with application to
functional magnetic resonance imaging. Journal of the American Statistical Association,
102(478):417–431.
Song, H.-R., Fuentes, M., and Ghosh, S. (2008). A comparative study of Gaussian geo-
statistical models and Gaussian Markov random fields. Journal of Multivariate Analysis,
99:1681–1697.
Ver Hoef, J. M. and Barry, R. P. (1998). Constructing and fitting models for cokriging and
multivariable spatial prediction. Journal of Statistical Planning and Inference, 69:275–294.
Wagner, B., Freer, H., Rollins, A., Garcia-Tapia, D., Erb, H. N., Earnhart, C., Marconi, R.,
and Meeus, P. (2012). Antibodies to Borrelia burgdorferi OspA, OspC, OspF, and C6
antigens as markers for early and late infection in dogs. Clinical and Vaccine Immunology,
19(4):527–535.
Waller, L. A., Carlin, B. P., Xia, H., and Gelfand, A. E. (1997). Hierarchical spatio-temporal
mapping of disease rates. Journal of the American Statistical Association, 92(438):607–
617.
Wang, P., Glowacki, M. N., Hoet, A. E., Needham, G. R., Smith, K. A., Gary, R. E., and
Li, X. (2014). Emergence of Ixodes scapularis and Borrelia burgdorferi, the Lyme disease
vector and agent, in Ohio. Frontiers in Cellular and Infection Microbiology, 4.
Yue, Y. and Speckman, P. L. (2010). Nonstationary spatial Gaussian Markov random fields.
Journal of Computational and Graphical Statistics, 19(1):96–116.
Zhang, H. and El-Shaarawi, A. (2009). On spatial skew-Gaussian processes and applications.
Environmetrics, 21(1):33–47.
29
Figure 1: Observed seroprevalence of B. burgdorferi, aggregated over January 2012 to De-
cember 2016. White counties are those that did not report any test results.
30
Figure 2: The true e
β1surface used to generate 500 independent data sets in the simulation
example.
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
31
Figure 3: Summary of the posterior estimates of e
β1obtained in the simulation example.
Presented results include the sample mean of the posterior estimates (top row), empirical
bias (middle row), and empirical mean squared error (bottom row). From left to right the
columns correspond to the use of a 4 ×4, 5 ×5, and 7 ×7 grid of knots.
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
-0.04
-0.02
0.00
0.02
0.04
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
-0.04
-0.02
0.00
0.02
0.04
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
-0.04
-0.02
0.00
0.02
0.04
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.000
0.002
0.004
0.006
0.008
0.010
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.000
0.002
0.004
0.006
0.008
0.010
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.000
0.002
0.004
0.006
0.008
0.010
32
Figure 4: Raw reported canine seroprevalences in 2012 (top) and 2016 (bottom). White
counties did not report any tests.
33
Figure 5: Estimate of the regional trend e
β1from the seasonal model (8) (top) and nonsea-
sonal model (9) (bottom) used to analyze the seroprevalence data.




34
Figure 6: County-level trends. The top graphic displays the posterior mean estimate of υs
from model (8), and the bottom from model (9).
-3.88 - -1.69
-1.69 - -1.24
-1.24 - -0.85
-0.85 - -0.55
-0.55 - -0.30
-0.30 - -0.05
-0.05 - 0.27
0.27 - 0.68
0.68 - 1.26
1.26 - 2.77
-3.90 - -1.69
-1.69 - -1.24
-1.24 - -0.85
-0.85 - -0.55
-0.55 - -0.30
-0.30 - -0.05
-0.05 - 0.27
0.27 - 0.68
0.68 - 1.26
1.26 - 2.86
7
35
Figure 7: Counties where υswas significantly positive at the 95 % confidence level. The top
graphic corresponds to model (8), and bottom to model (9).
1RW6LJQLILFDQWO\,QFUHDVLQJ6LJQLILFDQWO\,QFUHDVLQJ1RW6LJQLILFDQWO\,QFUHDVLQJ6LJQLILFDQWO\,QFUHDVLQJ
36
... Given the close association between human and canine vector-borne diseases (14)(15)(16), the rising number of cases observed in humans and the documented expansion of the vector range raise the question as to whether canine vectorborne disease is increasing as well. Recently, a Bayesian spatio-temporal binomial regression model was developed to describe the temporal trends in these seroprevalence data (17). In that study, the canine seroprevalence of B. burgdorferi was found to be increasing. ...
... Such a covariance function allows the strength of the correlation between two observations to decrease as the distance between them increases (see Figure 2). For additional details, including information regarding the selection of the knot locations, see the Appendix and (17). For more on Gaussian predictive processes, see (24). ...
... First, the data under study presents with county-month pairs for which no tests are reported; e.g., see Figures 1A,B. The spatio-temporal structure in the selected model makes it robust to this sort of missing data, as was demonstrated in (17). Second, to implement the proposed model, a custom MCMC sampling routine was developed and coded in R. ...
Article
Full-text available
In 2019, in the United States, over 220,000 and 350,000 dogs tested positive for exposure to Anaplasma spp. and Borrelia burgdorferi, respectively. To evaluate regional and local temporal trends of pathogen exposure we used a Bayesian spatio-temporal binomial regression model, analyzing serologic test results for these pathogens from January 2013 to December 2019. Regional trends were not static over time, but rather increased within and beyond the borders of historically endemic regions. Increased seroprevalence was observed as far as North Carolina and North Dakota for both pathogens. Local trends were estimated to evaluate the heterogeneity of underlying changes. A large cluster of counties with increased B. burgdorferi seroprevalence centered around West Virginia, while a similar cluster of counties with increased Anaplasma spp. seroprevalence centered around Pennsylvania and extended well into Maine. In the Midwest, only a small number of counties experienced an increase in seroprevalence; instead, most counties had a decrease in seroprevalence for both pathogens. These trends will help guide veterinarians and pet owners in adopting the appropriate preventative care practices for their area. Additionally, B. burgdorferi and A. phagocytophilum cause disease in humans. Dogs are valuable sentinels for some vector-borne pathogens, and these trends may help public health providers better understand the risk of exposure for humans.
... , n, such as a sum or average of individuals in the area. For instance, Self et al. (2018) investigate regional trends of occurrence of Lyme disease, where the data are the number of positive disease cases observed in each county in the United Brown et al. (2014), who consider functional magnetic resonance imaging data in which each y i quantifies the neuronal changes associated with an experiment observed in the i th three-dimensional pixel in a brain image, where the goal is to identify those areas exhibiting statistically significant changes. Waller et al. (1997) estimate spatially-varying risks of developing lung cancer using reported deaths in each county of the state of Ohio. ...
... If x satisfies this property, then x is said to be a Markov random field (MRF). MRFs are useful tools in a variety of challenging applications, including disease mapping (Waller et al., 1997;Self et al., 2018), medical imaging (Higdon, 1998;Brown et al., 2014), and gene microarray analysis (Xiao et al., 2009;Brown et al., 2017a). Even autoregressive time series models are instances of Markov random fields; though this work is primarily motivated by models for spatially-indexed data in which there is no clear direction of influence. ...
Preprint
Gaussian Markov random fields (GMRFs) are popular for modeling dependence in large areal datasets due to their ease of interpretation and computational convenience afforded by the sparse precision matrices needed for random variable generation. Typically in Bayesian computation, GMRFs are updated jointly in a block Gibbs sampler or componentwise in a single-site sampler via the full conditional distributions. The former approach can speed convergence by updating correlated variables all at once, while the latter avoids solving large matrices. We consider a sampling approach in which the underlying graph can be cut so that conditionally independent sites are updated simultaneously. This algorithm allows a practitioner to parallelize updates of subsets of locations or to take advantage of `vectorized' calculations in a high-level language such as R. Through both simulated and real data, we demonstrate computational savings that can be achieved versus both single-site and block updating, regardless of whether the data are on a regular or an irregular lattice. The approach provides a good compromise between statistical and computational efficiency and is accessible to statisticians without expertise in numerical analysis or advanced computing.
... The geographical distribution of tick species is changing, and thus the risk of co-infections is changing. This is reflected in the changes in the distribution and prevalence of canine VBP, highlighting the need for contemporary data [5,8,[20][21][22][23]. In the Southeastern United States, co-infections between D. immitis and Ehrlichia spp. ...
... Additional studies are needed to investigate the possible transmission [24,25,35]. Continued studies on the distribution of VBP are warranted, as changes in the distribution and density of vectors and their associated pathogens have been noted in recent years, which may be related to several factors such as climate or habitat changes [5,8,[20][21][22][23]36]. Additionally, novel vectors (e.g., Asian longhorned tick, Haemaphysalis longicornis) have been introduced into the United States, and this tick may alter the native pathogen transmission dynamics [37][38][39][40][41]. Consistent with previous studies, dogs that were ≥ 1 year of age had an increased risk of being positive [7,15]. ...
Article
Full-text available
Background Vector-borne infections pose significant health risks to humans, domestic animals, and wildlife. Domestic dogs (Canis lupus familiaris) in the United States may be infected with and serve as sentinel hosts for several zoonotic vector-borne pathogens. In this study, we analyzed the geographical distribution, risk factors, and co-infections associated with infection with Ehrlichia spp., Anaplasma spp., Borrelia burgdorferi, and Dirofilaria immitis in shelter dogs in the Eastern United States. Methods From 2016 to 2020, blood samples from 3750 shelter dogs from 19 states were examined with IDEXX SNAP® 4Dx® Plus tests to determine the seroprevalence of infection with tick-borne pathogens and infection with D. immitis. We assessed the impact of factors including age, sex, intact status, breed group, and location on infection using logistic regression. Results The overall seroprevalence of D. immitis was 11.2% (n = 419/3750), the seroprevalence of Anaplasma spp. was 2.4% (n = 90/3750), the seroprevalence of Ehrlichia spp. was 8.0% (n = 299/3750), and the seroprevalence of B. burgdorferi was 8.9% (n = 332/3750). Regional variation in seroprevalence was noted: D. immitis (17.4%, n = 355/2036) and Ehrlichia spp. (10.7%, n = 217/2036) were highest in the Southeast while seroprevalence for B. burgdorferi (19.3%, n = 143/740) and Anaplasma spp. (5.7%, n = 42/740) were highest in the Northeast. Overall, 4.8% (n = 179/3750) of dogs had co-infections, the most common of which were D. immitis/Ehrlichia spp. (1.6%, n = 59/3750), B. burgdorferi/Anaplasma spp. (1.5%, n = 55/3750), and B. burgdorferi/Ehrlichia spp. (1.2%, n = 46/3750). Risk factors significantly influenced infection across the evaluated pathogens were location and breed group. All evaluated risk factors were significant for the seroprevalence of D. immitis antigens. Conclusions Our results demonstrate a regionally variable risk of infection with vector-borne pathogens in shelter dogs throughout the Eastern United States, likely due to varying distributions of vectors. However, as many vectors are undergoing range expansions or other changes in distribution associated with climate and landscape change, continued vector-borne pathogen surveillance is important for maintaining reliable risk assessment. Graphical Abstract
... The geographical distribution of tick species is changing, and thus the risk of co-infections is changing. This is re ected in the changes in the distribution and prevalence of canine VBP, highlighting the need for contemporary data [5,8,[20][21][22][23]. In the southeastern United States, co-infections between D. immitis and Ehrlichia spp. ...
... In addition, we noted a higher prevalence of B. burgorferi infection in dogs from Virginia compared to past studies [11] which corresponds with reported changes in the distribution of this pathogen in dogs and people and its vector in Virginia [8, [31][32][33][34]. Continued studies on the distribution of VBP are warranted as changes in the distribution and density of vectors and their associated pathogens has been noted in recent years which may be related to several factors such as climate or habitat changes [5,8,[20][21][22][23]35]. Additionally, novel vectors (e.g., Asian longhorned tick, Haemaphysalis longicornis) have been introduced into the United States and this tick may alter the native pathogen transmission dynamics [36][37][38]. ...
Preprint
Full-text available
Background Vector-borne infections pose significant health risks to humans, domestic animals, and wildlife. Domestic dogs (Canis lupus familiaris) in the United States may be infected with and serve as sentinel hosts for several zoonotic vector-borne pathogens. In this study, we analyzed the geographic distribution, risk factors, and co-infections associated with infection with Ehrlichia spp., Anaplasma spp., Borrelia burgdorferi, and Dirofilaria immitis in shelter dogs in the eastern United States. Methods From 2016–2020, blood samples from 3,750 shelter dogs from 19 states were examined with IDEXX SNAP® 4Dx® Plus tests to determine prevalence of infection with tick-borne pathogens and infection with D. immitis. We assessed the impact of factors including age, sex, intact status, breed group, and location on infection using logistic regression. Results Regional variation in detection prevalence was noted: D. immitis (17.4%, n = 355/2,036) and Ehrlichia spp. (10.7%, n = 217/2,036) were highest in the Southeast while prevalence for B. burgdorferi (19.3%, n = 143/740) and Anaplasma spp. (5.7%, n = 42/740) were highest in the Northeast. Overall, 4.8% (n = 179/3,750) of dogs had co-infections, the most common of which were D. immitis/Ehrlichia spp. (1.6%, n = 59/3,750), B. burgdorferi/Anaplasma spp. (1.5%, n = 55/3,750), and B. burgdorferi/Ehrlichia spp. (1.2%, n = 46/3,750). Risk factors significantly influenced infection across the evaluated pathogens were location and breed group. All evaluated risk factors were significant for the prevalence of D. immitis antigens. Conclusions Our results demonstrate a regionally variable risk of infection with vector-borne pathogens in shelter dogs throughout the eastern United States, likely due to varying distributions of vectors. However, as many vectors are undergoing range expansions or other changes in distribution associated with climate and landscape change, continued vector-borne pathogen surveillance is important for maintaining reliable risk assessment.
... While Lyme disease is more debilitating in humans, dogs and other domestic animals can also contract Borrelia infections (Self et al., 2018). In fact, canines can act as sentinels of Lyme disease; testing for exposure to B. burgdorferi in dogs can be done during annual wellness visit examinations by veterinarians (Self et al., 2018). ...
... While Lyme disease is more debilitating in humans, dogs and other domestic animals can also contract Borrelia infections (Self et al., 2018). In fact, canines can act as sentinels of Lyme disease; testing for exposure to B. burgdorferi in dogs can be done during annual wellness visit examinations by veterinarians (Self et al., 2018). Veterinarians can then act as stewards of public health by reporting seroprevalence and cases of Lyme Borreliosis to the local public health department, helping to effectively track the spread of Ixodes ticks and B. burgdorferi as climate change advances. ...
Article
Full-text available
PICO question Do wild coyotes in the US that are in an urban habitat compared to a rural habitat have a higher prevalence of Borrelia burgdorferi seroconversion? Clinical bottom line Category of research question Prevalence The number and type of study designs reviewed Two papers, both utilising a cross-sectional study design Strength of evidence Zero Outcomes reported The relevant studies provide very limited to no evidence towards answering this PICO question. In one, while the absolute percentage of Borrelia-antibody-positive canines (including dogs in addition to coyotes) is higher in metropolitan areas, the effect was not found to be statistically significant, possibly due to their small sample sizes. In the second study, prevalence of antibodies against Borrelia was compared between different rural habitats, but no urban coyotes were tested as a comparison and thus the PICO question cannot be evaluated Conclusion There is a knowledge gap concerning the prevalence of Borrelia in coyotes and how it differs between urban and rural environments. Wild coyotes could be used as a sentinel species of Lyme disease activity and to assess potential for domestic pet and human infections, which would inform clinical differential diagnoses as well as testing and vaccination recommendations. More studies are needed before this PICO question can be answered in a confident manner How to apply this evidence in practice The application of evidence into practice should take into account multiple factors, not limited to: individual clinical expertise, patient’s circumstances and owners’ values, country, location or clinic where you work, the individual case in front of you, the availability of therapies and resources. Knowledge Summaries are a resource to help reinforce or inform decision making. They do not override the responsibility or judgement of the practitioner to do what is best for the animal in their care.
... In the United States, Lyme disease is an important vector-borne disease, accounting for 82% of tick-borne associated bacterial infections [19], and has spurred increased interest in other pathogenic Borrelia species (e.g., B. miyamotoi) [20]. Although several studies have investigated the prevalence of Borrelia spp. in I. scapularis, few studies have looked at infection in possible avifauna hosts [21][22][23][24][25]. Pennsylvania ranks as one of the leading states for the number of human Lyme disease cases and the number of dogs testing positive for antibodies to B. burgdorferi [19,26,27] (www.capcvet.org/maps). Since 1992, Lyme disease has been an increasing disease concern in Pennsylvania as evidenced by the increase in B. burgdorferi seroprevalence in dogs, reported cases of Lyme disease in humans, and geographic distribution of cases throughout the Commonwealth [7,8,27,28]. ...
... Although several studies have investigated the prevalence of Borrelia spp. in I. scapularis, few studies have looked at infection in possible avifauna hosts [21][22][23][24][25]. Pennsylvania ranks as one of the leading states for the number of human Lyme disease cases and the number of dogs testing positive for antibodies to B. burgdorferi [19,26,27] (www.capcvet.org/maps). Since 1992, Lyme disease has been an increasing disease concern in Pennsylvania as evidenced by the increase in B. burgdorferi seroprevalence in dogs, reported cases of Lyme disease in humans, and geographic distribution of cases throughout the Commonwealth [7,8,27,28]. ...
Article
Full-text available
The Borrelia genus contains two major clades, the Lyme borreliosis group, which includes the causative agents of Lyme disease/borreliosis (B. burgdorferi sensu stricto and other related B. burgdorferi sensu lato genospecies), and the relapsing fever borreliosis group (B. hermsii, B. turicatae, and B. parkeri). Other unclassified reptile- and echidna-associated Borrelia spp. (i.e., B. turcica and ‘Candidatus Borrelia tachyglossi’, respectively) do not belong in either of these two groups. In North America, Borrelia spp. from both of the major clades are important pathogens of veterinary and public health concern. Lyme disease is of particular interest because the incidence in the northeastern United States continues to increase in both dogs and humans. Birds have a potentially important role in the ecology of Borrelia species because they are hosts for numerous tick vectors and competent hosts for various Borrelia spp. Our goal was to investigate the prevalence of Borrelia spp. in four free-living species of upland game birds in Pennsylvania, USA including wild turkey (Meleagris gallopavo), ruffed grouse (Bonasa umbellus), ring-necked pheasants (Phasianus colchicus), and American woodcock (Scolopax minor). We tested 205 tissue samples (bone marrow and/or spleen samples) from 169 individuals for Borrelia using a flagellin gene (flab) nested PCR, which amplifies all Borrelia species. We detected Borrelia DNA in 12% (24/205) of samples, the highest prevalence was in wild turkeys (16%; 5/31), followed by ruffed grouse (13%; 16/126) and American woodcock (3%; 1/35). All pheasants (n = 13) were negative. We sequenced amplicons from all positive game birds and all were B. burgdorferi sensu stricto. Our results support previous work indicating that certain species of upland game birds are commonly infected with Borrelia species, but unlike previous studies, we did not find any relapsing fever borreliae.
... This mapping effort has been ongoing in all 50 states since 2012, and has been recently expanded to include Canada. These data have allowed for descriptive studies on the prevalence of exposure [1,2], analysis of the temporal trends in prevalence [21][22][23][24][25], predictive models that forecast the expected prevalence for the upcoming year [26][27][28][29][30], and studies that examine the relationship between Lyme disease in humans and canine exposure to B. burgdorferi [31]. However, common discussions regarding these data and studies revolve around the assumption that the dogs are likely in the care of a veterinarian, owned, and provided some protection for exposure to fleas, ticks, and mosquitoes. ...
Article
Full-text available
Domestic dogs are susceptible to numerous vector-borne pathogens that are of significant importance for their health. In addition to being of veterinary importance, many of these pathogens are zoonotic and thus may pose a risk to human health. In the USA, owned dogs are commonly screened for exposure to or infection with several canine vector-borne pathogens. Although the screening data are widely available to show areas where infections are being diagnosed, testing of owned dogs is expected to underestimate the actual prevalence in dogs that have no access to veterinary care. The goal of this study was to measure the association between the widely available data from a perceived low-risk population with temporally and spatially collected data from shelter-housed dog populations. These data were then used to extrapolate the prevalence in dogs that generally lack veterinary care. The focus pathogens included Dirofilaria immitis, Ehrlichia spp., Anaplasma spp., and Borrelia burgdorferi. There was a linear association between the prevalence of selected vector-borne pathogens in shelter-housed and owned dog populations and, generally, the data suggested that prevalence of heartworm (D. immitis) infection and seroprevalence of Ehrlichia spp. and B. burgdorferi are higher in shelter-housed dogs, regardless of their location, compared with the owned population. The seroprevalence of Anaplasma spp. was predicted to be higher in areas that have very low to low seroprevalence, but unexpectedly, in areas of higher seroprevalence within the owned population, the seroprevalence was expected to be lower in the shelter-housed dog population. If shelters and veterinarians make decisions to not screen dogs based on the known seroprevalence of the owned group, they are likely underestimating the risk of exposure. This is especially true for heartworm. With this new estimate of the seroprevalence in shelter-housed dogs throughout the USA, shelters and veterinarians can make evidence-based informed decisions on whether testing and screening for these pathogens is appropriate for their local dog population. This work represents an important step in understanding the relationships in the seroprevalences of vector-borne pathogens between shelter-housed and owned dogs, and provides valuable data on the risk of vector-borne diseases in dogs. Graphical abstract
... As recently shown by Soliman et al. (2019), the DFFN models tend to deliver more competitive predictive performance than RF and BR. Remarkably, while spatio-temporal epidemiology of (re)emerging infectious diseases has been extensively studied in the environmetrics community before (see, for instance, Nobre, Schmidt, & Lopes, 2005;Loh, 2011;Self et al., 2018, and references therein) and despite the increasing popularity of DL tools (McDermott & Wikle, 2019) in the spatio-temporal environmetrics applications, utility of DL approaches in infectious epidemiology remains largely unexplored. ...
Preprint
As per the records of theWorld Health Organization, the first formally reported incidence of Zika virus occurred in Brazil in May 2015. The disease then rapidly spread to other countries in Americas and East Asia, affecting more than 1,000,000 people. Zika virus is primarily transmitted through bites of infected mosquitoes of the species Aedes (Aedes aegypti and Aedes albopictus). The abundance of mosquitoes and, as a result, the prevalence of Zika virus infections are common in areas which have high precipitation, high temperature, and high population density.Nonlinear spatio-temporal dependency of such data and lack of historical public health records make prediction of the virus spread particularly challenging. In this article, we enhance Zika forecasting by introducing the concepts of topological data analysis and, specifically, persistent homology of atmospheric variables, into the virus spread modeling. The topological summaries allow for capturing higher order dependencies among atmospheric variables that otherwise might be unassessable via conventional spatio-temporal modeling approaches based on geographical proximity assessed via Euclidean distance. We introduce a new concept of cumulative Betti numbers and then integrate the cumulative Betti numbers as topological descriptors into three predictive machine learning models: random forest, generalized boosted regression, and deep neural network. Furthermore, to better quantify for various sources of uncertainties, we combine the resulting individual model forecasts into an ensemble of the Zika spread predictions using Bayesian model averaging. The proposed methodology is illustrated in application to forecasting of the Zika space-time spread in Brazil in the year 2018.
... As recently shown by Soliman et al. (2019), the DFFN models tend to deliver more competitive predictive performance than RF and BR. Remarkably, while spatio-temporal epidemiology of (re)emerging infectious diseases has been extensively studied in the environmetrics community before (see, for instance, Nobre, Schmidt, & Lopes, 2005;Loh, 2011;Self et al., 2018, and references therein) and despite the increasing popularity of DL tools (McDermott & Wikle, 2019) in the spatio-temporal environmetrics applications, utility of DL approaches in infectious epidemiology remains largely unexplored. ...
Article
As per the records of the World Health Organization, the first formally reported incidence of Zika virus occurred in Brazil in May 2015. The disease then rapidly spread to other countries in Americas and East Asia, affecting more than 1,000,000 people. Zika virus is primarily transmitted through bites of infected mosquitoes of the species Aedes (Aedes aegypti and Aedes albopictus). The abundance of mosquitoes and, as a result, the prevalence of Zika virus infections are common in areas which have high precipitation, high temperature, and high population density. Nonlinear spatio‐temporal dependency of such data and lack of historical public health records make prediction of the virus spread particularly challenging. In this article, we enhance Zika forecasting by introducing the concepts of topological data analysis and, specifically, persistent homology of atmospheric variables, into the virus spread modeling. The topological summaries allow for capturing higher order dependencies among atmospheric variables that otherwise might be unassessable via conventional spatio‐temporal modeling approaches based on geographical proximity assessed via Euclidean distance. We introduce a new concept of cumulative Betti numbers and then integrate the cumulative Betti numbers as topological descriptors into three predictive machine learning models: random forest, generalized boosted regression, and deep neural network. Furthermore, to better quantify for various sources of uncertainties, we combine the resulting individual model forecasts into an ensemble of the Zika spread predictions using Bayesian model averaging. The proposed methodology is illustrated in application to forecasting of the Zika space‐time spread in Brazil in the year 2018.
Article
We develop a Spatial Effect Detection Regression (SEDR) model to capture the nonlinear and irregular effects of high-dimensional spatio-temporal predictors on a scalar outcome. Specifically, we assume that both the component and the coefficient functions in the SEDR are unknown smooth functions of location and time. This allows us to leverage spatially and temporally correlated information, transforming the curse of dimensionality into a blessing, as confirmed by our theoretical and numerical results. Moreover, we introduce a set of 0–1 regression coefficients to automatically identify the boundaries of the spatial effect, implemented via a novel penalty. A simple iterative algorithm, with explicit forms at each update step, is developed, and we demonstrate that it converges from the initial values given in the paper. Furthermore, we establish the convergence rate and selection consistency of the proposed estimator under various scenarios involving dimensionality and the effect space. Through simulation studies, we thoroughly evaluate the superior performance of our method in terms of bias and empirical efficiency. Finally, we apply the method to analyse and forecast data from environmental monitoring and Alzheimer’s Disease Neuroimaging Initiative study, revealing interesting findings and achieving smaller out-of-sample prediction errors compared to existing methods.
Article
Full-text available
In 2015 the US federal government sponsored a dengue forecasting competition using historical case data from Iquitos, Peru and San Juan, Puerto Rico. Competitors were evaluated on several aspects of out-of-sample forecasts including the targets of peak week, peak incidence during that week, and total season incidence across each of several seasons. Our team was one of the winners of that competition, outperforming other teams in multiple targets/locales. In this paper we report on our methodology, a large component of which, surprisingly, ignores the known biology of epidemics at large—for example, relationships between dengue transmission and environmental factors—and instead relies on flexible nonparametric nonlinear Gaussian process (GP) regression fits that “memorize” the trajectories of past seasons, and then “match” the dynamics of the unfolding season to past ones in real-time. Our phenomenological approach has advantages in situations where disease dynamics are less well understood, or where measurements and forecasts of ancillary covariates like precipitation are unavailable, and/or where the strength of association with cases are as yet unknown. In particular, we show that the GP approach generally outperforms a more classical generalized linear (autoregressive) model (GLM) that we developed to utilize abundant covariate information. We illustrate variations of our method(s) on the two benchmark locales alongside a full summary of results submitted by other contest competitors.
Article
Full-text available
Current remote sensing image classification problems have to deal with an unprecedented amount of heterogeneous and complex data sources. Upcoming missions will soon provide large data streams that will make land cover/use classification difficult. Machine learning classifiers can help at this, and many methods are currently available. A popular kernel classifier is the Gaussian process classifier (GPC), since it approaches the classification problem with a solid probabilistic treatment, thus yielding confidence intervals for the predictions as well as very competitive results to state-of-the-art neural networks and support vector machines. However, its computational cost is prohibitive for large scale applications, and constitutes the main obstacle precluding wide adoption. This paper tackles this problem by introducing two novel efficient methodologies for Gaussian Process (GP) classification. We first include the standard random Fourier features approximation into GPC, which largely decreases its computational cost and permits large scale remote sensing image classification. In addition, we propose a model which avoids randomly sampling a number of Fourier frequencies, and alternatively learns the optimal ones within a variational Bayes approach. The performance of the proposed methods is illustrated in complex problems of cloud detection from multispectral imagery and infrared sounding data. Excellent empirical results support the proposal in both computational cost and accuracy.
Conference Paper
Full-text available
Understanding and predicting how influenza propagates is vital to reduce its impact. In this paper we develop a nonparametric model based on Gaussian process (GP) regression to capture the complex spatial and temporal dependencies present in the data. A stochastic variational inference approach was adopted to address scalability. Rather than modeling the problem as a time series as in many studies, we capture the space-time dependencies by combining different kernels. A kernel averaging technique which converts spatially-diffused point processes to an area process is proposed to model geographical distribution. Additionally, to accurately model the variable behavior of the time-series, the GP kernel is further modified to account for non-stationarity and seasonality. Experimental results on two datasets of statewide US weekly flu-counts consisting of 19,698 and 89,474 data points, ranging over several years, illustrate the robustness of the model as a tool for further epidemiological investigations.
Article
Understanding and predicting how influenza propagates is vital to reduce its impact. In this paper we develop a nonparametric model based on Gaussian process (GP) regression to capture the complex spatial and temporal dependencies present in the data. A stochastic variational inference approach was adopted to address scalability. Rather than modeling the problem as a time-series as in many studies, we capture the space-time dependencies by combining different kernels. A kernel averaging technique which converts spatially-diffused point processes to an area process is proposed to model geographical distribution. Additionally, to accurately model the variable behavior of the time-series, the GP kernel is further modified to account for non-stationarity and seasonality. Experimental results on two datasets of state-wide US weekly flu-counts consisting of 19,698 and 89,474 data points, ranging over several years, illustrate the robustness of the model as a tool for further epidemiological investigations.
Book
Gaussian Markov Random Field (GMRF) models are most widely used in spatial statistics - a very active area of research in which few up-to-date reference works are available. This is the first book on the subject that provides a unified framework of GMRFs with particular emphasis on the computational aspects. This book includes extensive case-studie.
Article
The Gaussian process is an indispensable tool for spatial data analysts. The onset of the "big data" era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each which was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics. Supplementary materials regarding implementation details of the methods and code are available for this article online.
Article
The Summary of Notifiable Infectious Diseases and Conditions - United States, 2015 (hereafter referred to as the summary) contains the official statistics, in tabular and graphical form, for the reported occurrence of nationally notifiable infectious diseases and conditions in the United States for 2015. Unless otherwise noted, data are final totals for 2015 reported as of June 30, 2016. These statistics are collected and compiled from reports sent by U.S. state and territories, New York City, and District of Columbia health departments to the National Notifiable Diseases Surveillance System (NNDSS), which is operated by CDC in collaboration with the Council of State and Territorial Epidemiologists (CSTE). This summary is available at https://www.cdc.gov/MMWR/MMWR_nd/index.html. This site also includes summary publications from previous years.