PreprintPDF Available

Bayesian hierarchical modelling approaches for combining information from multiple data sources to produce annual estimates of national immunization coverage

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Estimates of national immunization coverage are crucial for guiding policy and decision-making in national immunization programs and setting the global immunization agenda. WHO and UNICEF estimates of national immunization coverage (WUENIC) are produced annually for various vaccine-dose combinations and all WHO Member States using information from multiple data sources and a deterministic computational logic approach. This approach, however, is incapable of characterizing the uncertainties inherent in coverage measurement and estimation. It also provides no statistically principled way of exploiting and accounting for the interdependence in immunization coverage data collected for multiple vaccines, countries and time points. Here, we develop Bayesian hierarchical modeling approaches for producing accurate estimates of national immunization coverage and their associated uncertainties. We propose and explore two candidate models: a balanced data single likelihood (BDSL) model and an irregular data multiple likelihood (IDML) model, both of which differ in their handling of missing data and characterization of the uncertainties associated with the multiple input data sources. We provide a simulation study that demonstrates a high degree of accuracy of the estimates produced by the proposed models, and which also shows that the IDML model is the better model. We apply the methodology to produce coverage estimates for select vaccine-dose combinations for the period 2000-2019. A contributed R package {\tt imcover} implementing the No-U-Turn Sampler (NUTS) in the Stan programming language enhances the utility and reproducibility of the methodology.
Content may be subject to copyright.
Bayesian hierarchical modelling approaches for com-
bining information from multiple data sources to pro-
duce annual estimates of national immunization cover-
C. Edson UtaziE-mail:
Warren C. Jochem
WorldPop, School of Geography and Environmental Science, University of Southamp-
ton, UK
Marta Gacic-Dobo
World Health Organization, Geneva, Switzerland
Padraic Murphy
United Nations Children’s Fund, New York, USA
Sujit K. Sahu
Mathematical Sciences, University of Southampton, UK
M. Carolina Danovaro-Holliday
World Health Organization, Geneva, Switzerland
Andrew J. Tatem
WorldPop, School of Geography and Environmental Science, University of Southamp-
ton, United Kingdom
Summary. Estimates of national immunization coverage are crucial for guiding pol-
icy and decision-making in national immunization programs and setting the global
immunization agenda. WHO and UNICEF estimates of national immunization cover-
age (WUENIC) are produced annually for various vaccine-dose combinations and all
WHO Member States using information from multiple data sources and a determin-
istic computational logic approach. This approach, however, is incapable of charac-
terizing the uncertainties inherent in coverage measurement and estimation. It also
provides no statistically principled way of exploiting and accounting for the interde-
pendence in immunization coverage data collected for multiple vaccines, countries
and time points. Here, we develop Bayesian hierarchical modeling approaches for
producing accurate estimates of national immunization coverage and their associ-
ated uncertainties. We propose and explore two candidate models: a balanced data
Address for correspondence: WorldPop, School of Geography and Environmental Science, Uni-
versity of Southampton, UK
arXiv:2211.14919v1 [stat.ME] 27 Nov 2022
2Utazi et al.
single likelihood (BDSL) model and an irregular data multiple likelihood (IDML) model,
both of which differ in their handling of missing data and characterization of the un-
certainties associated with the multiple input data sources. We provide a simulation
study that demonstrates a high degree of accuracy of the estimates produced by the
proposed models, and which also shows that the IDML model is the better model. We
apply the methodology to produce coverage estimates for select vaccine-dose com-
binations for the period 2000-2019. A contributed R package imcover implementing
the No-U-Turn Sampler (NUTS) in the Stan programming language enhances the
utility and reproducibility of the methodology.
1. Introduction
Accurate estimates of immunization coverage are required at the global, regional, national
and subnational levels to inform policies and guide interventions aimed at improving cov-
erage levels and accelerating progress towards disease elimination and eradication (Bur-
ton et al., 2009; World Health Organization, 2020; Gavi, The Vaccine Alliance, 2020).
In particular, globally comparable estimates of national immunization coverage (ENIC)
are crucial for profiling countries and understanding where attention should be focused
to strengthen global immunization service delivery. Immunization coverage is also an
important indicator for measuring and monitoring progress towards the goals and targets
set out in global policy frameworks such as the Sustainable Development Goals (SDGs)
(United Nations General Assembly, 2015) and the Immunization Agenda 2030 (IA2030)
(World Health Organization, 2020). Since 1998, the World Health Organization (WHO)
and the United Nations International Children’s Emergency Fund (UNICEF) have jointly
published estimates of national immunization coverage annually for 195 countries and
territories and different vaccine-dose combinations (Burton et al., 2012, 2009; Danovaro-
Holliday et al., 2021).
WHO and UNICEF estimates of national immunization coverage (WUENIC) are pro-
duced through integrating information from multiple data sources using a computational
logic approach (Burton et al., 2009, 2012). Fundamentally, the approach uses determin-
istic ad hoc estimation rules to supplement administrative and official ENIC reported to
WHO and UNICEF (World Health Organization and United Nations Children’s Fund,
2022) with estimates obtained via nationally representative household surveys. The ap-
proach is implemented on a country-by-country basis, enabling the incorporation of expert
knowledge and country-specific adjustments (e.g., vaccine stockouts, conflict and other
data quality assessments). Although the approach encourages reproducibility, representing
an improvement over informal, manual estimation procedures, it is incapable of charac-
terizing the uncertainties inherent in the multiple input data sets and those associated with
the output coverage estimates. Also, the approach provides no statistically principled way
of accounting for and exploiting the different sources of variation or dependence that may
exist in immunization coverage data collected for multiple vaccines, countries and years.
The entire time series produced by the approach is usually updated when new data become
available; however, the approach is also unsuited for producing estimates of immunization
coverage for future time points which may be useful to guide program planning. To ad-
Bayesian hierarchical modelling approaches for immunization coverage estimation 3
dress these challenges, WHO and UNICEF made a call in 2019 for alternative approaches
for producing ENIC.
Bayesian hierarchical modeling (BHM) approaches (Cressie and Wikle, 2011; Sahu,
2022) offer a robust, flexible framework to combine information from multiple data sources
to estimate a phenomenon of interest, whilst accounting for the full range of uncertainty
present in these data sources. BHM approaches have been widely applied in different the-
matic areas ranging from environmental studies to health and medicine (see, e.g., Sahu,
2022). In immunization coverage estimation, the use of BHM approaches has mostly fo-
cused on producing estimates of coverage at a high resolution (e.g, 1 km or 5 km grid
squares covering an area of interest) and for subnational administrative areas (e.g. the dis-
trict level), where the intermediate levels of the model relate to characterizing the spatial
and spatiotemporal variation in the data using geostatistical and conditional autoregres-
sive models (Local Burden of Disease Vaccine Coverage Collaborators, 2021; Utazi et al.,
2020b, 2021). Similar applications also exist in the estimation of other health and develop-
ment indicators (e.g., Sahu et al., 2006; Burstein et al., 2019; Giorgi et al., 2021). Whilst
estimates of immunization coverage produced using these approaches are extremely valu-
able for uncovering the spatial heterogeneities in coverage that often exist within countries,
estimates of national immunization coverage and other indicators still remain the bench-
mark for policy and decision making, resource allocation and monitoring and evaluating
progress at the global level.
Here, we develop novel BHM approaches for producing ENIC as an alternative or
a complement to the WUENIC computational logic approach. Similar to WUENIC ap-
proach, our methodology utilizes multiple input data and enables the incorporation of ex-
pert opinions and judgments which can be implemented before or during model-fitting.
However, it also crucially provides a mechanism to leverage various sources of depen-
dence in the input data to improve the estimation of coverage and associated uncertainties.
We propose and explore two candidate models, namely a balanced data single likelihood
(BDSL) model and an irregular data multiple likelihood (IDML) model, both of which dif-
fer in their handling of missing data and characterization of the uncertainties arising from
the multiple input data sources. The proposed models are implemented for each of the six
WHO regions, but can also be implemented at the country level, although regional models
provide a richer framework to exploit inter-country variation. The methodology is sup-
ported by the development and Github publication of a contributed R package imcover
to enhance its utility and to encourage reproducibility.
In what follows, we present and explore the data used in this work. We also outline
the steps taken to process the data, including a recall-bias adjustment for survey data as in
the WUENIC approach (Burton et al., 2009, 2012; Brown et al., 2015). We then proceed
to present the proposed methodology and its implementation in a Bayesian framework us-
ing the Stan package in R (Stan Development Team, 2020). A simulation study assessing
both in-sample and out-of-sample predictive performance of the methodology is presented
in Section 5. Modelled estimates of coverage and associated uncertainties are presented
and discussed in Section 6, as well as comparisons with corresponding WUENIC esti-
mates (2020 revision published in 2021). In Section 7, a description of the accompany-
4Utazi et al.
ing imcover software package is provided. We conclude with a discussion on both the
methodology and the modelled outputs and outline directions for future work.
2. Data description and processing
We assembled publicly available, aggregate data on national immunization coverage from
the WHO Immunization Data Portal (; ac-
cessed on March 2, 2022). The data portal provides access to information from the WHO/U-
NICEF Joint Reporting Form on Immunization (JRF) (World Health Organization and
United Nations Children’s Fund, 2022). We collect data from three main sources of in-
formation on national-level immunization coverage:
(i) reported administrative coverage data (admin);
(ii) country-reported official coverage estimates (official);
(iii) household surveys of vaccination coverage (survey).
Administrative coverage data are estimates of vaccination coverage which countries report
annually to WHO and UNICEF through the JRF. Official estimates are coverage reports
which have been independently reviewed by countries against other datasets and repre-
sent their assessment of the most likely coverage. In most cases, official estimates are the
same as administrative estimates, but in some cases, are based on surveys or “corrections”
for known inaccuracies in the admin estimates, e.g., incomplete reporting. The survey
data comes from nationally representative household survey data used by WUENIC. There
are three main survey sources: the Expanded Programme on Immunization (EPI) cluster
survey (World Health Organization, 2018) and surveys using previous WHO recommenda-
tions for vaccination coverage surveys (Danovaro-Holliday et al., 2018); the Demographic
and Health Surveys (ICF International, 2022); and the Multiple Indicator Cluster Survey
(MICS) (United Nations Children’s Fund, 2022). For each of these sources (admin, official,
and survey), we obtain annual data on five vaccines for the period 2000 - 2019: diphtheria-
tetanus-pertussis-containing vaccine doses 1 and 3 (DTP1 and DTP3), measles containing
vaccine doses 1 and 2 (MCV1 and MCV2), and pneumococcal conjugate vaccine dose 3
In addition to the coverage information, we obtain mid-year estimates of countries’
population from the UN Population Division 2019 revision (United Nations, Department of
Economic and Social Affairs, Population Division, 2019). These population data serve as
denominators to estimate the percentage of the population covered by administered vaccine
doses. Therefore, the age cohorts of the population data correspond to the target population
to receive the selected vaccines. In the case of DTP1, DTP3, MCV1, and PCV3, this is
surviving infants (i.e., under 1 year old, even when MCV1 is recommended in the second
year of life in some countries), while the target age for MCV2 depends on the national
immunization schedule (World Health Organization, 2022). Finally, we collect information
on the year of vaccine introduction (yovi) from the WHO repository. Since MCV2 and
Bayesian hierarchical modelling approaches for immunization coverage estimation 5
PCV3 vaccines were not used across all countries in this period, we restrict our analyses to
periods when vaccines were fully rolled-out.
All WHO Member States spread across six WHO regions (namely, African Region
(AFR), Region of the Americas (AMR), Eastern Mediterranean Region (EMR), European
region (EUR), South-East Asian Region (SEAR) and Western Pacific Region (WPR)) and
for which data were available are included in this work (see supplementary Figure 2). After
obtaining the raw datasets described above, a multi-step data cleaning and harmonisation
process was implemented to create analysis-ready data for model fitting. In the first stage
of processing, standard vaccine-source-specific input data files were created for the study
period from the raw data by extracting the relevant data required for the analysis. This was
followed by a recall-bias adjustment step implemented for DTP3 and PCV3 using survey
data (Brown et al., 2015). This is detailed in the supplementary file. Further, to ensure that
modelled estimates of DTP1 and DTP3 were consistent, i.e. that DTP1 is greater than or
equal to DTP3 for all country-vaccine-time combinations, we opted to model DTP1 and
the ratio of DTP1 and DTP3, i.e. DTP3/DTP1 (later on in the modelling step, we converted
these ratios to DTP3 estimates by multiplying corresponding DTP1 and DTP3/DTP1 es-
timates). To calculate the ratios, we first preserved the differential between DTP1 and
DTP3 where estimates of the former were greater than 100 by calculating the ratios using
the original input data. We then rounded down those estimates greater than 100 to 99.9%
(this is necessary for the logit-transformation step in Section 3) and adjusted correspond-
ing DTP3 estimates using the ratios obtained previously. Finally, for each country, vaccine,
time and data source, we scaled the coverage data to the unit interval and logit-transformed
these as a final pre-modelling step. During this step, estimates of other vaccines greater
than 1 were set to 0.999 and, to avoid undefined values in the transformation, any estimates
equal to zero were adjusted to 0.001. The results of these processing steps is a collection of
country-specific, immunization coverage estimates from three major sources of informa-
tion. The three sources of information (admin, official and survey) form three time series,
although with different levels of variation over time and observation time points and com-
pleteness. The challenge of the proposed modelling approach outlined in Section 3 is to
draw information from these different sources to estimate a true, but unobserved, national
immunisation coverage estimate.
Summaries of the processed data are presented in Table 1 at the global level and Figure
1 for each WHO region. For illustrations of the patterns of missingness in these data, see
supplementary Figure 3 and Figure 2. In all, about 46% of the data were from administra-
tive sources, while 47% and 7% were from official sources and surveys, respectively. 186
countries had administrative data, 194 countries had official data, while 131 had survey
data for at least one of the five vaccines at any time point during the study period. When
considering data from all sources at the global level, Table 1 shows that PCV3 had the
smallest number of data points, the most variability and the lowest coverage due to be-
ing only universally recommended by WHO in 2009. DTP1 had the highest coverage and
lowest variation, whereas DTP3 and MCV1 had very similar distributions.
Further, Figure 1 shows that within each region, coverage data obtained from surveys
were more likely to be lower on average than data obtained from official and administrative
6Utazi et al.
Table 1: Summary of processed national immunization coverage data for the period 2000
2019 for all WHO Member States and data sources. Shown in the second column are
numbers of countries with input data and numbers of non-missing data points.
Vaccine/No. of Summary statistics (%)
data countries/Min. Q1 Med. Mean Std. Q3 Max.
source data points dev.
DTP1 189 (6787) 10.00 89.00 96.00 91.43 11.57 99.00 99.99
DTP3 189 (6763) 0.92 81.00 91.00 85.75 15.16 96.00 99.98
MCV1 194 (7577) 2.10 81.00 92.00 86.47 14.88 97.00 99.99
MCV2 176 (4191) 1.00 78.00 91.52 84.03 18.70 96.97 99.99
PCV3 148 (2144) 1.00 76.00 89.00 79.97 23.26 95.00 99.99
Admin 186 (12567) 0.92 83.36 93.00 87.21 15.79 97.40 99.99
Official 194 (12872) 1.00 84.00 93.00 87.44 15.27 97.20 99.99
Survey 131 (2023) 2.10 67.95 83.90 78.03 19.35 93.00 99.90
All vaccines/
sources 194 (27462) 0.92 82.20 92.53 86.64 16.02 97.00 99.99
sources (except in the case of PCV3 in SEAR). We also observe that administrative and
official data have very similar distributions as expected. We note that the processing steps
outlined here have been developed in conjunction with the WHO/UNICEF immunization
coverage working group. However, these can be improved upon to allow, for example,
the exclusion of coverage data deemed implausible for estimation based on expert evalua-
tion (unlikely zero estimates of coverage are excluded from the current analysis, although
such estimates could sometimes reflect a vaccine stock-out). The methodology proposed
in Section 3 can produce coverage estimates for all desired cases using available input
data, although with greater uncertainty where input data are missing, hence offering the
flexibility to use the most accurate input data for coverage estimation.
3. The proposed methodology
Our aim is to develop BHM approaches for producing estimates of national immunization
coverage and associated uncertainties from multiple data sources. Here, we propose two
candidate models termed balanced data single likelihood (BDSL) model and irregular data
multiple likelihood (IDML) model. As the name implies, the BDSL model uses a single
likelihood to capture the variability in the data and is considered here as a suitable alterna-
tive against which to compare the IDML model which induces considerable flexibility in
the estimation of the variability in the data through using separate probability distributions.
In general, let ˜p(k)
ijt denote the kth type estimate of vaccination coverage (proportion)
for the ith country i(i= 1, . . . , C),jth vaccine j(j= 1, ..., V )and year t(t= 1, . . . , T ),
where for
(i) k=a,˜p(a)
ijt is the administrative estimate,
Bayesian hierarchical modelling approaches for immunization coverage estimation 7
Fig. 1. Distribution of processed national immunization coverage data by WHO region and
data source.
(ii) k=o,˜p(o)
ijt is the official estimate,
(iii) k=s,˜p(s)
ijt is the survey estimate.
Figure 2 provides a plot of these estimates in percentage forms.
The three versions of the estimates ˜p(k)
ijt are available to us, but the counts and the
denominators corresponding to these estimated proportions are not always available for
modelling. Hence we are not able to use the binomial distribution for modelling the true
vaccination coverage pijt. Instead, we treat the logit-transformed estimates,
ijt logit ˜p(k)
ijt = log ˜p(k)
ijt !,
8Utazi et al.
as observed data varying over the real line. Note that the logit transformation, used here, is
natural for transforming proportions to the real line. As is well known, possible alternatives
to the logit transformation are the inverse cumulative distribution function (cdf) transfor-
mation such as the probit, i.e.,˜pijt = Φ (yijt)where Φ(·)is the cdf of the standard normal
distribution. In this work, however, we only adopt the logit transformation since it is able
to accommodate more extreme values than the probit transformation. Hence 0˜pij t 1
for all i,jand t.
The transformed estimates are then assumed to follow the normal distribution based
linear models. Indeed, see supplementary Figure 4 where the histograms of the logit and
probit-transformed estimates (panels (b) and (c)) show better bell shaped curves compared
to the histogram of the proportions (panel (a)) which is negatively skewed as expected. We
also observe deviant peaks in the right tails of the histograms in panels (b) and (c), which
is due to the high frequency of proportions close to 1 in the data.
Before we introduce the models we note that although we model on the logit-transformed
scale, we obtain and report the model based predictions on the original scale of 0 to 100%.
Sampling based Bayesian computation methods also allow us to obtain the uncertainties
of the predictions on the original scale. Details to obtain these predictions are provided in
Section 4.2.
3.1. Balanced data single likelihood (BDSL) model
We assume that all three estimates y(a)
ijt ,y(o)
ijt and y(s)
ijt aim to estimate the true mean µijt but
each of these three have their own biases. In the this model, these biases are captured using
a source-specific random effect, ν(k). The BDSL model attempts to model the observed
ijt as:
ijt =µijt +ν(k)+e(k)
ijt ,
µijt =λ+βi+αj+γt+φit +δjt +ψij +ωijt .(1)
is the true bias-corrected mean, e(k)
ijt is an error term assumed to follow the N(0, σ2)in-
dependently and identically for all values of i, j and t, and λis the overall mean. Thus,
the source-specific term, ν(k), captures the bias of coverage estimates from data source k
relative to λ, and is modelled as ν= (ν(a), ν(o), ν (s))0N(0, σ2
νI). Further, in equa-
tion (1), βiis a country level effect, αjis a vaccine effect and γtis a temporal effect for
t= 1, . . . , T , where Tdenotes all time points being considered in the analysis. These
terms capture overall variation in the data emanating from the different attributes. Addi-
tionally, φit, δj t,and ψij are country-time, vaccine-time and country-vaccine interaction
terms modelling trends that are specific to each country (φit)and vaccine (δjt), and random
variation between each country and vaccine (ψij). Also, ωijt is a country-vaccine-time in-
teraction that captures trends specific to each country-vaccine combination. Specification
of these random effects are deferred to Section 3.3 below.
We note that the source-specific random effect, ν(k), is not included in the shared mean
µijt in (1), as this only serves to estimate the biases of the data sources relative to µijt and
Bayesian hierarchical modelling approaches for immunization coverage estimation 9
is hence undesirable. However, we note that model (1) does not offer much flexibility to
penalize the contribution of each data source to the shared mean since this is only achieved
via the parameters ν(a), ν(o)and ν(s), which are the same for all i, j and t, and considering
that e(k)
ijt is modelled as iid irrespective of k. Further discussions on this are provided in
Section 3.2.
3.2. Irregular data multiple likelihood (IDML) model
The balanced model, as given in (1), is defined for all possible combinations of the indices
i,j,kand t. Thus in reality, we need 3CV T data values where the factor 3 comes from
three possible values of k, viz. admin, official and survey. However, for MCV1 for exam-
ple, we only have 65.1% of these data values. Hence the remaining 34.9% must be treated
as missing in our Bayesian modelling.
The observed time points, denoted by the index tin ˜p(k)
ijt , as seen in the horizontal axis
in Figure 2, are very much misaligned for the three types of estimates. In this figure, the
survey and official estimates have been produced only for few selected years and not for
all the years. This presents a difficult problem in modelling based on balanced regular time
series as for the BDSL model in Section 3.1 since all the data, y(k)
ijt are not available for all
regularly spaced values of tfrom 2000 to 2019. The problem is caused by the presence of a
large percentage of missing data. Indeed, if we were to use regular time series models, then
we will have 34.9% of missing data for MCV1 from the three data series, y(k)
ijt ,k=a, o
and s. A suitable multiple imputation scheme will be necessary to properly handle this
large percentage of missing data.
Our proposed novel solution to this missing data problem comes through a multiple
likelihood approach based on three different indices t1,t2, and t3respectively for admin,
official and survey data respectively as illustrated in Figure 2. Thus, we model y(a)
ijt3where the time indices t1,t2, and t3depend on the country iand vaccine type j
combination. For example, in Figure 2 t1(for admin estimates) takes the values 1, 6, 7,
. . ., implying that the administrative estimates are missing for time points 2, 3, 4, and 5,
and so on. In our modelling development, we simply write down the likelihood contribu-
tions based on the data from the observed time points. The combined likelihood function
then captures all the information contained in the observed data for the underlying model
The IDML model is given by:
ijt1=λ(a)+µijt1+ijt1, t1= 1, . . . , T1,
ijt2=λ(o)+µijt2+ijt2, t2= 1, . . . , T2,
ijt3=λ(s)+µijt3+ijt3, t3= 1, . . . , T3,
ijt1N(0, σ2
1), ijt2N(0, σ2
2), ijt3N(0, σ2
10 Utazi et al.
Fig. 2. Data illustration for the irregular data multiple likelihood model using estimates of
MCV1 coverage for Nigeria.
and σ2
kfor k= 1,2and 3 are source-specific error variance parameters. Also, λ(a), λ(o),
and λ(s)are source-specific intercept terms capturing the bias of coverage estimates from
data source krelative to the true bias-corrected mean, µijt. The time indices t1, t2and t3
are possibly unequally-spaced denoting only the time points for which data are available
from a given data source. Similarly, T1, T2and T3are the total numbers of times data are
available from the respective data sources. We note that if, for example, there are no survey
data for country iand vaccine j, then T3= 0.
The shared mean for model (2) is given by
µijt =βi+αj+γt+φit +δjt +ψij +ωijt , t = 1, . . . , T . (3)
Hence, the IDML model also brings out the novel feature that to estimate µijt, we are di-
rectly able to combine information from all the available sources for that specific i(coun-
try), j(vaccine) and t(time), with appropriate relative weighting as estimated by the vari-
ance components σ2
2and σ2
3. Another likely advantage is that unlike the BDSL model,
the IDML model does not need to estimate the input data for the missing cases, e.g. the
admin estimate y(a)
ijt for MCV1 in Nigeria in 2006. Thus, the IDML model yields a lower
dimensional parameter space which is advantageous in MCMC based Bayesian computa-
We note that the shared mean in (3) does not include the source-specific intercept terms
λ(a),λ(o)and λ(s). This is because these terms play a similar role as ν(k)in the BDSL
model in accounting for the biases arising from the various data sources and are also not
Bayesian hierarchical modelling approaches for immunization coverage estimation 11
desirable in the true, bias-corrected mean. However, unlike ν(k)which is modelled as
a random effect, these terms are modelled as fixed effects. Also, we note that unlike (1)
which includes an overall intercept term λ, the shared mean for the IDML model in (3) does
not include an intercept term, which is a direct consequence of the different approaches
adopted in accounting for the biases arising from the data sources in both models.
In model (2), the data sources are weighted by their respective variance parameters
1, σ2
2,and σ2
3, which in turn determine their contributions to the overall mean function
in equation (3). The influence of each data source in the model can thus be controlled by
adjusting the values of these parameters either directly or through the prior distributions
placed on them. For example, to increase the influence of survey estimates in the model,
an informative prior that encourages smaller values relative to σ2
1and σ2
2could be placed
on σ2
3. In contrast, the BDSL specification does not provide a mechanism for this adjust-
ment since it assumes that σ2
3=σ2. We investigated assigning separate prior
distributions to ν(a),ν(o)and ν(s), but this did not lead to any meaningful changes in the
modelled estimates obtained from the shared mean in (1). Hence, the IDML specifica-
tion has more flexibility in terms of handling contributions from the data sources to the
likelihood than the BDSL model.
It is straightforward to derive a country-level model from equations (1) and (2) by drop-
ping the isubscript and excluding the terms: βi, φit, ψij and ωij t from the model. Al-
though such models can be implemented by individual countries, these are of less interest
in the current work as they do not allow borrowing strength across countries.
3.3. Specification of the random effects
We assume that the main effects βiand αjare each iid normal random variables with mean
zero and variances σ2
βand σ2
α, respectively. We note that it suffices to model country-level
variation as random as previous work (Utazi et al., 2020a) showed a lack of significant
spatial dependence in the data. We assume a first-order autoregressive (AR(1)) model for
the temporal effect γt. That is,
γtN(ργt1, σ2
with γ1N(0, σ2
γ/(1 ρ2
γ)). The country-time interaction effect
where φ= (φ11, . . . , φ1T, . . . , φC1, . . . , φC T )and Qφ(ρφ)is a CT ×CT structured ma-
trix (Clayton, 1996; Knorr-Held, 2000) specifying the nature of interdependence between
the elements of φ,σ2
φis an unknown precision parameter, ρφis an autoregressive parame-
ter and [.]denotes the Moore-Penrose generalized inverse of a matrix. Following Clayton
(1996), Qφ(ρφ)can be factorised as the Kronecker product of the structure matrices of the
interacting main effects βiand γt. Given the nature of these terms, here we assume a
Type II interaction (Knorr-Held, 2000) such that Qφ(ρφ) = ICRγ, where ICis an iden-
tity matrix and Rγis the neighbourhood structure of an AR(1) process. This assumption
12 Utazi et al.
implies that the temporal parameters for each country φi,1:T= (φi1, . . . , φiT )follow an
AR(1) process independent of all other countries. In other words, this two-way interaction
term captures temporal trends that are different from country to country and do not have
any spatial structure. Similarly, the vaccine-time interaction is modelled as
where δ= (δ11, . . . , δ1T, . . . , δV1, . . . , δV T )0,σ2
δis also an unknown precision param-
eter and ρδis an autoregressive parameter. We also assume a Type II interaction for this
parameter such that Qδ(ρδ) = IVRγ. Here again, the structure of Qδimplies that
the temporal pattern for vaccine jrepresented by δj,1:T= (δj1, . . . , δjT )is independent
of those of other vaccines. Next, ψij represents the interaction of the iid terms βiand αj,
hence we assume a Type I interaction (Knorr-Held, 2000) for it, such that
ψN(0, σ2
ψICV ),
where ψ= (ψ11, . . . , ψ1V, . . . , ψC1, . . . , δC V )0σ2
ψis a variance parameter. This interac-
tion term models additional unstructured country-vaccine variation. Lastly, the three-way
interaction term, ωijt , models temporal trends specific to each country-vaccine combina-
tion. We assume that
where ω= (ω111, . . . , ωC V T )0and all the parameters are as defined previously. We also
assume a Type-II-like interaction for ωsuch that Qω(ρω) = ICIVRγ=ICV Rγ,
which implies that the temporal structure given to country iand vaccine j, represented by
ωij,1:T= (ωij1, . . . , ωijT ), is independent of those of other country-vaccine combina-
4. Bayesian inference and computation
In this section, we describe details of implementation of the proposed models in a Bayesian
setting. Let ydenote all observed data. Also, let
111, . . . , y(a)
CV T10,
111, . . . , y(o)
CV T20and
111, . . . , y(s)
CV T30,
where T1=T2=T3=Tin model (1). For the BDSL model, let ηB= (λ, β,α,γ,
φ,δ,ψ,ω,ν)denote a latent field comprising the intercept term and the joint distribution
of all the parameters in the mean model µ= (µ111, . . . , µC V T )(i.e., the random effects)
given in equation (1), θB
1=σ2denote the variance of the Gaussian observations, and
2= (σ2
β, σ2
α, ργ, σ2
ρ, ρφ, σ2
φ, ρδ, σ2
δ, σ2
ψ, ρω, σ2
ω, σ2
Bayesian hierarchical modelling approaches for immunization coverage estimation 13
denote the hyperparameters of the latent field ηB, i.e. the variances and autocorrelation
parameters of the random effects. Similarly, for the IDML model, let
ηI= (λ(a), λ(o), λ(s),β,α,γ,φ,δ,ψ,ω),
1= (σ2
1, σ2
2, σ2
2= (σ2
β, σ2
α, ργ, σ2
ρ, ρφ, σ2
φ, ρδ, σ2
δ, σ2
ψ, ρω, σ2
We complete our Bayesian specification by placing appropriate prior distributions on
the parameters as follows. For the BDSL model, we assume the following prior distribu-
σCauchy(0,2)I(σ > 0),
For the IDML model, the prior distributions were:
λ(a)N(0, v1); λ(o)N(0, v2); λ(s)N(0, v3);
σ3Cauchy(0,0.2)I(0 σ30.4).
These prior distributions were chosen based on trial runs, during which we set v1=v2=
v3= 0.25. The highly informative truncated Half-Cauchy prior on σ3was chosen to
attribute greater likelihood to survey estimates in the model subject to expert belief and to
adjust for the higher proportions of missingness in this data source. For all other parameters
in both models, we used default uniform priors available in Stan Stan Development Team
Letting θB= (θB
2)0, the joint posterior distribution of the BDSL model can be
14 Utazi et al.
written as:
π(θB,ηB|y)p(y|ηB, θB
p(y(a)|ηB, σ2)×p(y(o)|ηB, σ2)×p(y(s)|ηB, σ2)×p(β|θB
k=1 σ1exp 1
ijt µijt ν(k))2
βexp β2
αexp α2
γ(1 ρ2
γ) exp (1 ρ2
γexp (γtργt1)2
× |σ2
2exp 1
× |σ2
2exp 1
ψexp 1
× |σ2
2exp 1
νexp 1
where p(θ)denotes the joint prior distribution on the parameters. Given that we assumed
that the parameters are apriori independent, p(θB)simply represents the product of the
prior distributions assigned to them.
Similarly, letting θI= (θI
2)0, the joint posterior distribution of the IDML model
can be written as:
j=1 "T1
1exp 1
2exp 1
3exp 1
Bayesian hierarchical modelling approaches for immunization coverage estimation 15
where p(β,α,γ,φ,δ,ψ,ω|θI
2) = p(β|θI
2), which are all the same as corresponding expressions provided in equa-
tion (5).
The goal of inference in equations (5) and (6) is to estimate the posterior distributions
of the components of ηB,ηI,θBand θI, both of which are in turn used to obtain the
underlying, smoothed coverage estimates µijt,i {1, . . . , C}, j {1, . . . , V }and t
{1, . . . , T }as given in equations (1) and (3).
We fitted both models by running Markov Chain Monte Carlo (MCMC) using the
NUTS (No-U-Turn Sampler) algorithm (Hoffman and Gelman, 2014) within the Stan pack-
age in R Stan Development Team (2020). We implemented four chains, each of which was
run for 4,000 iterations including a burn-in of 2,000 iterations. We assessed convergence
using the MCMC convergence statistic, ˆ
R, which we ensured was below 1.05 (Vehtari
et al., 2021) for each parameter in the models. We also provide an R package for imple-
menting the proposed models, further details of which are provided in Section 7.
4.1. Smoothed overall estimates
Note that the mean models in (1) and (3) are well defined for both the BDSL and IDML
models as discussed above. We use the posterior distribution of µijt to produce our model
based vaccination coverage estimates. After eliminating the source-specific bias compo-
nents, ν(k)and λ(k)in our notation, we estimate the coverage estimate as follows. We
suppose that the inverse logit-transform
pijt =1
1 + exp (µijt)(7)
is the source free and true bias-corrected vaccination coverage proportion.
The posterior distribution of pijt given all the data is summarised to provide source free
estimates of vaccination coverage. The posterior distribution of pijt is easy to calculate
using MCMC sampling. For example, we obtain MCMC samples θ(`),`= 1, . . . , L for
all the unknown parameters and random effects and missing data collectively denoted by
θ. Using θ(`)we evaluate µ(`)
ijt and subsequently p(`)
ijt for `= 1, . . . , L. These samples are
then used to estimate the true pijt along with the uncertainty estimates. Note that these
uncertainty estimates are obtained exactly for vaccination coverage at the original scale.
4.2. Prediction and aggregation to the regional level
In addition to parameter estimation, it is often the goal in Bayesian analysis to estimate
missing observations or to predict future observations. Typically, Bayesian in-sample and
out-of-sample prediction (the latter is also known as forecasting) are both based on the
posterior predictive distribution. For example, the one-step-ahead prediction at any time
point tcan be obtained by evaluating the conditional distribution of pijt+1 given all the
data y. According to (7) we have:
pijt+1 =1
1 + exp (µijt+1),
16 Utazi et al.
where, for example, for the BDSL model,
µijt+1 =λ+βi+αj+γt+1 +φit+1 +δjt+1 +ψij +ωijt+1
from (1). Hence to predict pij t+1 we also need the values of the time advanced parameters
γt+1,δj t+1 and ωijt+1 at time t. For a given twithin the modelled time period T, these
parameters are already sampled within the implemented MCMC scheme. For tTwe
use the assumed model dynamics, e.g. (4) to sample these parameters. That is, we set
t+1 N(ρ(`)γ(`)
t, σ2(`)
if γt+1 has not been sampled already. The other dynamic parameters are treated similarly.
Hence to estimate (or to predict if t>T)pijt , using either of the proposed models, we
evaluate the posterior distribution of pijt given yby drawing samples p(`)
ijt for `= 1, . . . , L
for a large number of MCMC replicates L.
In Stan, these out-of-sample predictions can be computed post model-fitting. As with
in-sample estimation, the final predictions can be obtained by summarizing the inverse logit
of the posterior samples of µ(`)
ij(t+1). We note that by default, in-sample estimates of µijt
are estimated for desired in-sample country-vaccine-time combinations even when no data
are observed for these cases, since µijt is estimated i {1, . . . , C}, j {1, . . . , V }, t
{1, . . . , T }. As explained earlier, in-sample estimates of µijt are processed post-model-
fitting using year of vaccine introduction (yovi) data to obtained modelled estimates for
desired country-vaccine-time combinations.
Further, we obtain modelled estimates of immunization coverage for each WHO region
as population-weighted averages taken over all the countries falling within the region. That
is, for region Rr(r= 1,...,6), vaccine j, time tand posterior sample `,
rjt(R) = X
ijt ×qr
where qr
iis the proportion of surviving infants or target population for MCV2 of region
Rrliving in country i. It is straightforward to compute equation (8) using the posterior
distributions of pijt under each model.
4.3. Model comparison, evaluation and validation
To choose between the proposed models in our application, we considered the Watanabe-
Akaike information criterion (WAIC) (Watanabe, 2013). The WAIC is a fully Bayesian
criterion that is based on the log of the predictive density for each data point, hence it
assesses the ability of the fitted model to predict the input data. Accessible discussions
regarding WAIC are provided by Gelman et al. (2014) and Sahu (2022).
To further evaluate the ability of the proposed models to predict the in-sample and out-
of-sample data (i.e. pijt) in a simulation experiment (due to lack of true values of pijt in
Bayesian hierarchical modelling approaches for immunization coverage estimation 17
our application - see Section 5), we computed the following metrics:
Average bias: AvBias =1
Root mean squared error: RMSE =v
Mean absolute error: M AE =1
95% coverage: 95% coverage = 100 ×
and the Pearson’s correlation between observed and predicted values. Here, mdenotes all
the values of p(.)used for validation (across all vaccines, countries and time points), ˆpkand
pkare the predicted (i.e. the posterior means) and observed values, ˆpl
kand ˆpu
kare the lower
and upper limits of the 95% credible intervals of the predictions and I(.)is an indicator
function. The actual coverage of the 95% prediction intervals assesses the accuracy of the
uncertainty estimates associated with the predictions, while all the other metrics evaluate
the accuracy of the point estimates. The closer the 95% coverage rates are to the nominal
value of 95%, the better the predictions. Similarly, the closer the RMSE, MAE and AvBias
(in absolute value) are to zero, the better the predictions. Correlations close to 1 indicate
strong predictive power.
5. Simulation study
We conducted a simulation study to examine the predictive performance of the proposed
models with respect to in-sample and out-of-sample estimation of pijt . In the study, we set
C= 20, T = 20 and V= 5, mimicking a moderately-sized WHO region. We then used
the following true parameter values to generate estimates of µijt, as described in equations
(1) and (3) for the BDSL and IDML models, respectively: σ2
β= 1, σ2
α= 1, ργ=
0.5, σ2
γ= 1, ρφ= 0.3, σ2
φ= 0.25, ρδ= 0.4, σ2
δ= 0.64, σ2
ψ= 1, ρω= 0.7, σ2
ω= 0.64.
Additionally, for the IDML model, we set λ(a)= 0.07, λ(o)= 0.02, λ(s)= 0.05. For the
BDSL model, we set σ2= 1 and λequal to the means of the corresponding parameters in
the IDML model, i.e. λ= 0.05.
Owing to the key role that the parameters σ2
1, σ2
2, σ2
3and σ2
νplay in capturing the
amount of residual variability or bias attributable to the different data sources in the pro-
posed models, we examined the effect of their varying values on the estimation of pijt
(equation (7)) by considering the following scenarios.
Scenario 1: Variance of ν(k)(i.e. σ2
ν)in BDSL model set equal to the average of the con-
ditional variances of admin (σ2
1), official (σ2
2) and survey estimates (σ2
3) in IDML model
18 Utazi et al.
1= 1, σ2
2= 0.64, σ2
3= 0.16 and σ2
ν= 0.6).
Scenario 2: Large differences between the conditional variances of admin, official and
survey estimates and larger variance for ν(k)(σ2
1= 9, σ2
2= 4, σ2
3= 0.25 and σ2
ν= 4).
Scenario 3: Same conditional variances for admin, official and survey estimates and
smaller variance for ν(k)(σ2
1= 1, σ2
2= 1, σ2
3= 1 and σ2
ν= 0.1).
These true parameter estimates were chosen to encourage, as much as possible, an even
distribution of values of pijt on the unit interval. Adding the other components of equa-
tions (1) and (2) to the simulated values of µijt, we obtained the simulated admin, official
and survey estimates for all values of C,Tand V. To mimic the patterns of missingness
in MCV2 and PCV3 in our application (see, e. g., supplementary Figure 3), we randomly
selected t= 10 or t= 15 as the starting points for the observations for the last two vac-
cines. Further, we deleted 15% of each of the simulated admin and official data and 20% of
the simulated survey data to reflect the overall pattern of missing values in our application.
In all, for each model, the simulated data had a total of 3864 observations, 68% of which
were either admin or official data, while the remaining 32% were survey data.
We placed the same prior distributions as before on both models, except that for the
IDML model, we placed a Half-Cauchy(0,2) prior on σ3, making it the same as the priors
on σ1and σ2, since the goal here is to compare both models under similar conditions (the
effect of the prior specification on σ3on the performance of the IDML model is examined
further in Section 6). We set the starting point for the one- and two-step ahead predictions
at t= 11, meaning that we would use the first 10 observations as base years and then make
predictions for the remaining ten time points, i.e. t= 11,...,20.
In Table 2, we report the results of the study showing the validation statistics computed
using the true and modelled estimates of pijt under each model. Also, in supplemen-
tary Figure 5, we show examples of the simulated data and modelled estimates for five
countries and three vaccines under scenario 1. For in-sample prediction, the IDML model
generally outperformed the BDSL model across the three scenarios, yielding both more
accurate point and uncertainty estimates. In particular, we observe that the BDSL model
had relatively large AvBias estimates under Scenarios 1 and 2, demonstrating that it under-
predicted the true values of pijt in those cases (see, e.g., supplementray Figure 5). This
under-prediction also resulted in very poor 95% coverage values under Scenario 2. This
suggests that the BDSL model may not be well-suited for in-sample prediction when there
is considerable amount of variation (or bias) arising from the different data sources.
For out-of-sample prediction, the RMSE, MAE and correlation estimates are worse off
as expected. Mixed results were, however, obtained when considering AvBias and 95%
coverage. We note that the high values of the actual 95% coverage for both models are not
surprising since out-of-sample predictions are often made with wider uncertainty intervals
and the goal here was to predict the true values of pijt and not the random observations.
We rather focus on using the other metrics to evaluate the out-of-sample performance of
the models. Again, the IDML model generally yielded more accurate estimates of pijt
Bayesian hierarchical modelling approaches for immunization coverage estimation 19
Table 2: Simulation study: Validation statistics for in-sample, one-step-ahead and two-step
ahead predictions. The better result is shown in bold in each case.
Validation In-sample prediction
metrics Scenario 1 Scenario 2 Scenario 3
AvBias -8.81 0.79 -9.25 0.51 -1.00 1.12
RMSE 11.22 3.47 11.95 3.87 5.74 4.98
MAE 9.09 1.73 9.50 1.87 3.78 2.92
95% coverage 79.00 98.50 28.20 98.40 74.10 98.00
Correlation 0.98 0.99 0.97 0.99 0.98 0.99
One-step-ahead prediction
AvBias -3.53 -2.11 -1.58 -2.54 -0.21 -1.82
RMSE 21.40 18.60 19.40 18.55 20.74 19.10
MAE 18.54 15.82 15.89 15.74 17.29 16.20
95% coverage 99.00 99.70 98.70 99.80 98.20 99.70
Correlation 0.74 0.81 0.79 0.81 0.75 0.80
Two-step-ahead prediction
AvBias -3.33 -2.78 -2.20 -3.19 -1.23 -2.49
RMSE 21.38 19.42 20.10 19.37 21.34 19.87
MAE 18.36 16.70 16.68 16.64 18.01 17.00
95% coverage 98.79 99.74 98.58 99.79 98.47 99.74
Correlation 0.74 0.79 0.77 0.79 0.73 0.78
than the BDSL model according to the RMSE, MAE and correlation, although it tended
to produce relatively more biased estimates, especially under scenarios 1 and 2. Unlike in
in-sample prediction, the performance of the BDSL model appears to be more stable and
more comparable to that of the IDML model across the three scenarios in out-of-sample
prediction. In all, these results show that the IDML model outperformed the BDSL model
and is, therefore, better suited for both in- and out-of-sample prediction with regards to the
estimation of the true, underlying coverage estimates, pijt.
6. Results
Here, we present and discuss the results of the application of the proposed methodology
to produce modelled estimates of national immunization coverage for all WHO Member
6.1. Model choice, validation and parameter estimates
We first fitted both the BDSL and IDML models to the national immunization coverage
data to further examine their performance and suitability for the data. With the IDML
model, we considered two cases - one in which we placed a Half-Cauchy(0,2) prior on
20 Utazi et al.
σ3to depict an unrestricted scenario, and the other where we placed a truncated Half-
Cauchy(0,0.2) prior on σ3to improve the contribution of survey data to the likelihood.
For the BDSL model, we used the same priors described previously.
Table 3: WAIC statistics for the balanced data single likelihood (BDSL) and irregular data
multiple likelihood (IDML) models for each WHO region. The better result is shown in
bold in each case.
GOF penalty WAIC GOF penalty WAIC
Half-Cauchy(0,2) prior placed on σ3in the IDML model
AFR 21524.8 1053.9 23632.6 20905.1 1543.5 23992.1
AMR 8024.3 1314.8 10653.9 1370.3 2242.1 5854.5
EMR 6905.6 733.0 8371.6 5699.9 1093.4 7886.7
EUR 7990.8 1453.4 10897.6 5416.9 2325.0 10066.9
SEAR 5094.2 189.4 5473.0 4892.6 269.6 5431.8
WPR 8656.1 850.7 10357.5 7276.0 1302.8 9881.6
Truncated Half-Cauchy(0,0.2) prior placed on σ3in the IDML model
AFR 21524.8 1053.9 23632.6 22530.8 1420.1 25371.0
AMR 6163.8 2599.3 11362.4 8024.3 1314.8 10653.9
EMR 6905.6 733.0 8371.6 7019.2 1113.2 9245.6
EUR 7990.8 1453.4 10897.6 6779.1 2438.1 11655.3
SEAR 5094.2 189.4 5473.0 4963.7 357.8 5679.3
WPR 8656.1 850.7 10357.5 9141.6 1237.8 11617.2
In Table 3, we report the WAIC statistics for these models, which reveal an interesting
pattern. With a less restrictive Half-Cauchy(0,2) prior on σ3, the IDML model clearly
outperformed the BDSL model in all cases except in the AFR region. When examining the
contributions of the penalty and goodness-of-fit (GOF) terms to the calculated WAIC, we
observe that although the BDSL model has smaller penalties as expected, since it includes a
fewer number of parameters, the IDML model consistently provided better fits according to
the GOF statistics. However, the use of a truncated Half-Cauchy(0,0.2) prior on σ3in the
IDML model, though deliberate, resulted in poorer fits to the data since this prior biases the
modelled estimates towards survey data. Hence, the BDSL model yielded smaller WAIC
statistics in this case. These results further demonstrate the flexibility of the IDML model
to adjust the modelled estimates based on expert opinions. All modelled outputs and results
presented in the remainder of this work are, therefore, based on the IDML model only.
In Table 4, we report estimates of parameters of the model for AFR region. Parameter
estimates for other regions are presented in supplementary Tables 1 - 5. We observe that
estimates of the source-specific intercept terms, ˆ
λ(a)and ˆ
λ(o)), are consistently positive
while those of ˆ
λ(s)are consistently negative in all the regions. However, only ˆ
significant in some of the regions. This further demonstrates that survey data where avail-
able, on average, tend to have lower values than admin and official data. Estimates of the
residual standard deviation for survey data are all close to the upper bound of the prior
Bayesian hierarchical modelling approaches for immunization coverage estimation 21
Table 4: Posterior estimates of the parameters of the irregular data model (IDM) for the
AFR region.
Parameter Mean Std. dev. 2.5% 50% 97.5%
λ(a)0.4922 0.2868 -0.0728 0.4887 1.0557
λ(o)0.3661 0.2870 -0.2023 0.3656 0.9286
λ(s)-0.421 0.2869 -1.0068 -0.4421 0.1195
ˆσ11.3951 0.0209 1.3546 1.3947 1.4364
ˆσ21.1700 0.0182 1.1352 1.1698 1.2059
ˆσ30.3992 0.0008 0.3970 0.3994 0.4000
ˆσβ0.8900 0.1091 0.6982 0.8803 1.1283
ˆσα1.5378 1.1932 0.0950 1.3068 4.5089
ˆργ0.5084 0.5363 -0.7621 0.7546 0.9963
ˆσγ0.0786 0.0591 0.0060 0.0673 0.2061
ˆρφ0.4089 0.0714 0.2682 0.4113 0.5445
ˆσφ0.5260 0.0235 0.4801 0.5259 0.5714
ˆρδ0.9712 0.0255 0.8992 0.9791 0.9963
ˆσδ0.2303 0.0356 0.1629 0.2297 0.3029
ˆσψ0.3726 0.0569 0.2498 0.3753 0.4774
ˆρω0.7106 0.0563 0.5961 0.7131 0.8104
ˆσω0.3823 0.0299 0.3249 0.3819 0.4428
placed on σ3, demonstrating the strong effect of the prior in the fitted models. However,
ˆσ1and ˆσ2are considerably higher than ˆσ3in all the regions except AMR and EUR. In both
regions, the estimates of these parameters are very close, indicating that survey estimates
are, on average, more likely to have greater or similar variation as other data sources in
both regions (see e.g., Figure 1).
When considering the main effects - βi,αjand γt- these results indicate that in the
AFR region, the vaccine random effect, αj, accounted for much (74.8%) of the total vari-
ation σ2
β+ ˆσ2
α+ ˆσ2
γ)explained by these terms. For other regions, the vaccine random
effect accounted for between 79.1% and 99.6% of the total variation explained by the main
effects. This demonstrates substantial variation in coverage levels between the vaccines
across all the regions. There is also considerable variation in coverage levels among coun-
tries within each region, as the estimates of σβshow. However, the temporal main effect,
γt, explains very little variation in the data in each region (except for the AFR region),
which is likely due to the significant effect of temporally correlated interaction terms in the
models. Similarly, when considering the estimated variances of the interaction terms, the
country-time interaction, φit, explained the most variation in the data compared to other in-
teraction terms in the AFR region. This was also the case in the SEAR region. For AMR,
EMR and WPR regions, the most variation was explained by the country-vaccine-time
interaction ωijt , whereas for the EUR region, this was explained by the country-vaccine
interaction, ψij . In all the regions, the vaccine-time interaction, δj t, explained the least
variation in the data compared to other interaction terms. These results indicate the pres-
22 Utazi et al.
ence of strong within-country trends in the data. Estimates of the autoregressive parameter
for γt,ˆργ, are not significant across all the regions, further indicating the little contribution
of this global term in the fitted models (although its inclusion supports the structure of the
interaction terms). However, the autocorrelation parameters of all the time-varying inter-
action terms are estimated to be significantly positive in all the regions, suggesting strong
positive temporal trends which are tied to other sources of variation (country and vaccine)
in the data.
6.2. Modelled estimates of national immunization coverage and comparisons
with WUENIC estimates
Fig. 3. Modelled estimates of immunization coverage (a) and corresponding uncertainty
estimates (b) for the EMR region. Predictions for 2021 and 2022 are shown on the right-
hand side of the black dotted vertical lines.
Bayesian hierarchical modelling approaches for immunization coverage estimation 23
In Figure 3 and supplementary Figures 6 - 10, we present plots of modelled estimates
of coverage and associated uncertainties for EMR and other WHO regions, respectively.
Time series plots of the modelled estimates overlaid with the input admin, official and
survey data, as well as corresponding WUENIC estimates (2020 revision published in
2021) are also shown in supplementary Figure 11. In general, we observe that the patterns
in immunization coverage are similar for DTP1, DTP3 and MCV1 due to these vaccines
being introduced much earlier (since the 1970s) in the study countries, and therefore exhibit
more stable trends, compared to MCV2 and PCV3. Both newer vaccines tend to have
different patterns since these are mostly driven by the length of time since introduction
and the speed of uptake. We also note that the fitted models produced plausible estimates
of coverage relative to the input data and adjust well to survey data in some cases for the
example countries as shown in supplementary Figure 11. For EMR, coverage appears high
and more stable in many countries, although some countries had lower coverage at the
beginning of the study period (e.g. AFG and DJI), but made substantial progress which
appears to stagnate in latter years. The uncertainty estimates are low in most cases but
generally higher for country-vaccine-time combinations for which input data were scanty
or unavailable (e.g., DJI-MCV2 in Figure 3) and for out-of-sample predictions. In general,
the predictions for 2020 and 2021 show changes in coverage, but not substantially from the
preceding years. We note that these predictions did not take into account any disruptions
to routine immunization caused by the pandemic. Hence, they represent a counterfactual
non-pandemic scenario.
In Figure 4 and supplementary Figures 12 - 18, we show comparisons between WUENIC
and the modelled estimates for the period 2000 to 2019. There is generally a good level
of correspondence between both estimates across the regions despite the differences in the
methodologies used to produce these. The best correlations seem to have occurred in AFR,
EMR and SEAR regions, but the most differences also occurred in AFR where the mod-
elled estimates tend to be higher than WUENIC in some countries for DTP1, DTP3 and
MCV1. At the global level, we obtained a median difference of -1.90% with an interquar-
tile range of 5.44% (supplementary Table 6) between WUENIC and the modelled esti-
mates. The highest median difference (in absolute value) was observed in AMR (-4.62%)
whilst the largest interquartile range was observed in AFR (10.35%). Overall, these results
strongly indicate that the modelled estimates are close to WUENIC.
Furthermore, the trends in the regional estimates of immunization coverage are dis-
played in supplementary Figures 19 (a) and (b). For the AFR region, for example, all
five vaccines except MCV2 showed increasing trends which appear to level off or regress
towards the end of the study period. Similar or different patterns were observed in other
regions. The uncertainties associated with these regional estimates show robust estimation
in most cases.
7. Software
We developed the imcover package in the R statistical programming language to sup-
port the proposed Bayesian modeling approaches for immunization coverage. imcover
24 Utazi et al.
Fig. 4. Plots of WUENIC versus modelled estimates by WHO regions and vaccine type.
The blue lines illustrate a perfect agreement between both estimates while the light red
lines are simple least square fits to the estimates within each region.
Bayesian hierarchical modelling approaches for immunization coverage estimation 25
is an interface to the Stan programming language implementing the No-U-Turn Sampler
(NUTS). The package includes functionality to replicate the analyses described in 3, in-
cluding both the BDSL and IDML models. The package is designed to support reanalysis
of the WUENIC data with functionality to retrieve input data from the WHO website,
process the files, fit models and produce coverage estimates for required country- and
regional-vaccine-time combinations. The software is available from https://wpgp. and is described in more detail along with a full processing
script in Supplemental materials (Sections 2 and 4).
8. Discussion
We have laid out a new methodology for producing ENIC, as an alternative or a comple-
ment to the WUENIC approach. Our methodology is based on a Bayesian hierarchical
model which accounts for the full range of uncertainties present in the input data as well
as those associated with the modelled estimates. The methodology is implemented in a
user-friendly R package imcover. We investigated two candidate models and concluded
that the irregular data multiple likelihood (IDML) model performed better both in terms of
model fit and predictive performance. Regarding computing time, the IDML model takes
an average of 1.6 hours to run 4,000 iterations, which includes a burn-in of 2,000 iterations,
on a high specification computer, for each WHO region.
The work presented here is an improvement over previous work (Utazi et al., 2020a),
which utilized some of the rules implemented in the WUENIC computational logic ap-
proach (e.g., denominator adjustment for admin estimates and choosing survey data when
the differences between survey estimates and denominator-adjusted admin estimates were
>10%) to process and harmonize the multiple input data. The processed data was then
modelled using a BHM similar to the models developed in the current work, but only
accounting for (random) spatial, temporal and vaccine-related variations and their inter-
actions, to obtain smoothed coverage estimates and associated uncertainties. Whilst this
approach produced coverage estimates that were similar to WUENIC, the ad hoc method
adopted in combining information from the multiple input data prior to model-fitting meant
that the full range of variability in the data was not properly accounted for. Our method-
ology also offers significant improvements over the WUENIC approach. It provides a
mechanism to: estimate the uncertainties associated with the modelled estimates; borrow
strength across countries, vaccine and time to improve coverage estimation; and predict
immunization coverage for future time points. Also, additional informed beliefs could be
introduced in a more methodical manner in the modelling stage through prior specifica-
tions on the parameters of the model. Another model-based approach for producing ENIC
was developed by Lim et al. (2008), but this was based predominantly on survey data and
was implemented in a non-Bayesian framework which does not allow the incorporation
of prior beliefs into the modelling process. Also, Galles et al. (2021) utilized data from
official sources and surveys to produce ENIC using a model-based approach, but their
methodology was based on fitting spatio-temporal Gaussian process regression models in
multiple steps: first, for bias-adjustment of official data using survey dat, and then joint
26 Utazi et al.
modelling of bias-adjusted official and survey data.
The quality of the modelled estimates produced in our work is largely dependent upon
the quality and degree of completeness of the input data. First, as we noted in Section
2, all three input data sets were not simultaneously available for desired country-time-
vaccine combinations, with survey estimates being the most incomplete data source and
unequally distributed over time and by WHO region. There were also instances where
no data were available from all three data sources. As we observed in Section 6, the
proposed methodology is more likely to produce more robust and more precise estimates
when more (accurate) input data are used for model-fitting. Secondly, the different input
data sets have their inherent biases (e.g., admin estimates being greater than 100) which are
likely the result of inaccurate denominator and/or numerator estimates, large differences
between consecutive coverage estimates (in time) and recall bias associated with survey
data for multi-dose vaccines (Cutts et al., 2016). A complete overview and analysis of
data quality issues associated with these data sources are provided in Rau et al. (2022);
Stashko et al. (2019). Although we implemented some ad hoc measures to correct these
biases wherever possible, e.g., recall-bias adjustment for survey estimates of DTP3 and
PCV3 and rounding down of administrative estimates greater than 100 whilst persevering
the differential between multi-dose vaccines, they are better addressed at the point of data
collection and summarization. Hence, efforts should be intensified within countries to
improve the quality of data collected via these sources as has been recommended by global
advisory bodies (Scobie et al., 2020).
Wherever possible, the data processing steps presented here could be much improved to
deal with any remaining data quality issues prior to model-fitting, as we have highlighted
previously. It is possible to ‘switch-off an entire data source for a given country-vaccine
combination if it is deemed unfit for model-fitting, and then utilize input data from other
sources for coverage estimation. Making such decisions on a case-by-case basis, though
arduous, could lead to better quality modelled estimates that reflect the peculiarities of
individual countries. There is, however, a need to strike a balance between how much
data are available for model-fitting and the quality of modelled estimates that is desired.
Additional data processing steps could also include expertly identifying and excluding any
implausible outlying observations (e.g., some coverage estimates 1% included in the
current analysis) that could bias the modelled estimates.
The modelling approaches outlined here are subject to other limitations. Currently,
our methodology does not utilize any covariate information for coverage estimation. The
inclusion of highly informative covariates such as access to a health facility, antenatal care
attendance and female literacy rates (see, e.g., Utazi et al., 2022; Galles et al., 2021) in
the models could help improve the accuracy and precision of the estimates. This will
be fairly straightforward to implement, but we anticipate that it will greatly simplify the
structures of some of the terms used to account for residual variation in models (1) and (2),
or make these redundant. An effective model selection criteria will, therefore, be needed
to determine the best model parameterization in this setting. Furthermore, our modelled
estimates had wide uncertainties in some cases, particularly where input data were scarce.
The amount of uncertainty present in the modelled estimates could be further controlled
Bayesian hierarchical modelling approaches for immunization coverage estimation 27
by adjusting the priors on λ(a),λ(o)and λ(s)in the IDML model.
Future work will focus on extending the proposed methodology to subnational cover-
age estimation, which is highly relevant to current global health agenda (United Nations
General Assembly, 2015; World Health Organization, 2020). Whilst this will present ad-
ditional challenges (see, e.g., Brown, 2018), we anticipate that subnational data will have a
richer spatial structure unlike country-level data, which can be accounted for using condi-
tional autoregressive priors (Lee, 2011). Furthermore, our predictions for 2020 and 2021 -
which represent a counterfactual non-pandemic scenario - can be used to evaluate the im-
pact of the Covid-19 pandemic on routine immunization coverage. Lastly, we will extend
our analyses to include other vaccines not used for model development.
In conclusion, our work holds a lot of promise as it adds to increased efforts to boost the
quality of immunization coverage estimates available for global health policy and decision-
This work was funded by WHO (Grant numbers 2020/1077452-0 and 2021/1103498-0),
and in part by the Bill and Melinda Gates Foundation (Grant number INV-003287), and
carried out at WorldPop, University of Southampton, United Kingdom. The authors grate-
fully acknowledge the WHO-UNICEF immunization coverage working group for their
valuable inputs and feedback during model and software development. The authors also
acknowledge the use of the IRIDIS High Performance Computing Facility, and associated
support services at the University of Southampton, in the completion of this work.
Brown, D., Burton, A. and Gacic-Dobo, M. (2015) An examination of a recall bias
adjustment applied to survey-based coverage estimates for multi-dose vaccines.
Brown, D. W. (2018) Definition and use of ”valid” district level vaccination coverage to
monitor global vaccine action plan (gvap) achievement: evidence for revisiting the dis-
trict indicator. Journal Global Health,8, 020404.
Burstein, R., Henry, N., Collison, M. et al. (2019) Mapping 123 million neonatal, infant
and child deaths between 2000 and 2017. Nature,574, 353 358.
Burton, A., Kowalski, R., Gacic-Dobo, M., Karimov, R. and Brown, D. (2012) A formal
representation of the who and unicef estimates of national immunization coverage: A
computational logic approach. PLOS ONE,7, 1–12. URL:
28 Utazi et al.
Burton, A., Monasch, R., Lautenbach, B., Gacic-Dobo, M., Neill, M., Karimov, R., Wolf-
son, L., Jones, G. and M., B. (2009) Who and unicef estimates of national infant immu-
nization coverage: methods and processes. Bulletin of the World Health Organisation,
87, 535–41.
Clayton, D. G. (1996) Generalized linear mixed models. In Markov Chain Monte Carlo
in Practice (eds. W. R. Gilks, S. Richardson and D. J. Spiegelhalter), 275–301. London:
Chapman & Hall.
Cressie, N. A. C. and Wikle, C. K. (2011) Statistics for Spatio-Temporal Data. New York:
John Wiley & Sons.
Cutts, F. T., Claquin, P., Danovaro-Holliday, M. C. and Rhoda, D. A. (2016) Mon-
itoring vaccination coverage: Defining the role of surveys. Vaccine,34, 4103–
4109. URL:
Danovaro-Holliday, M., Gacic-Dobo, M., Diallo, M., Murphy, P. and Brown, D. (2021)
Compliance of WHO and UNICEF estimates of national immunization coverage
(WUENIC) with Guidelines for Accurate and Transparent Health Estimates Reporting
(GATHER) criteria [version 1; peer review: 2 approved]. Gates Open Research,5.
Danovaro-Holliday, M. C., Dansereau, E., Rhoda, D. A., Brown, D. W., Cutts, F. T. and
Gacic-Dobo, M. (2018) Collecting and using reliable vaccination coverage survey es-
timates: Summary and recommendations from the “meeting to share lessons learnt
from the roll-out of the updated who vaccination coverage cluster survey reference
manual and to set an operational research agenda around vaccination coverage sur-
veys”, geneva, 18–21 april 2017. Vaccine,36, 5150–5159. URL: https://www.
Galles, N. C., Liu, P. Y., Updike, R. L., Fullman, N., Nguyen, J., Rolfe, S., Sbarra, A. N.,
... and Mosser, J. F. (2021) Measuring routine childhood vaccination coverage in 204
countries and territories, 1980–2019: a systematic analysis for the global burden of
disease study 2020, release 1. The Lancet,398, 503–521. URL: https://www.
Gavi, The Vaccine Alliance (2020) Gavi strategy 5.0, 2021–2025. URL: https://www.
Gelman, A., Hwang, J. and Vehtari, A. (2014) Understanding predictive information crite-
ria for bayesian models. Statistics and Computing,24, 997–1016.
Giorgi, E., Fronterr`
e, C., Macharia, P. M., Alegana, V. A., Snow, R. W. and Diggle, P. J.
(2021) Model building and assessment of the impact of covariates for disease prevalence
mapping in low-resource settings: to explain and to predict. Journal of The Royal Soci-
ety Interface,18, 20210104. URL:
Bayesian hierarchical modelling approaches for immunization coverage estimation 29
Hoffman, M. D. and Gelman, A. (2014) The no-u-turn sampler: Adaptively setting path
lengths in hamiltonian monte carlo. Journal of Machine Learning Research,15, 1593–
ICF International (2022) Demographic and health surveys, calverton, maryland, u.s.a.
Knorr-Held, L. (2000) Bayesian modelling of inseparable space-time variation in disease
risk. Statistics in Medicine,19, 2555–67.
Lee, D. (2011) A comparison of conditional autoregressive models used in
bayesian disease mapping. Spatial and Spatio-temporal Epidemiology,2, 79–
89. URL:
Lim, S. S., Stein, D. B., Charrow, A. and Murray, C. J. L. (2008) Tracking progress towards
universal childhood immunisation and the impact of global initiatives: a systematic anal-
ysis of three-dose diphtheria, tetanus, and pertussis immunisation coverage. The Lancet,
372, 2031 2046.
Local Burden of Disease Vaccine Coverage Collaborators (2021) Mapping routine measles
vaccination in low- and middle-income countries. Nature,589, 415 419.
Rau, C., L¨
udecke, D., Dumolard, L. B., Grevendonk, J., Wiernik, B. M., Kobbe, R., Gacic-
Dobo, M. and Danovaro-Holliday, M. C. (2022) Data quality of reported child immu-
nization coverage in 194 countries between 2000 and 2019. PLOS Global Public Health,
2, 1–18. URL:
Sahu, S. K. (2022) Bayesian Modeling of Spatio Temporal Data with R. Boca Raton: Chap-
man and Hall, 1st edn. URL:
Sahu, S. K., Gelfand, A. E. and Holland, D. M. (2006) Spatio-temporal modeling of fine
particulate matter. Journal of Agricultural, Biological, and Environmental Statistics,11,
Scobie, H. M., Edelstein, M., Nicol, E., Morice, A., Rahimi, N., MacDonald, N. E.,
Carolina Danovaro-Holliday, M. and Jawad, J. (2020) Improving the quality and
use of immunization and surveillance data: Summary report of the working group
of the strategic advisory group of experts on immunization. Vaccine,38, 7183–
7197. URL:
Stan Development Team (2015) Stan modeling language: Users guide and reference man-
ual. Columbia University, Columbia, New York. URL:
(2020) RStan: the R interface to Stan. URL: R package
version 2.21.2.
30 Utazi et al.
Stashko, L. A., Gacic-Dobo, M., Dumolard, L. B. and Danovaro-Holliday, M. C. (2019)
Assessing the quality and accuracy of national immunization program reported target
population estimates from 2000 to 2016. PLOS ONE,14, 1–13. URL: https://
United Nations Children’s Fund (2022) Multiple indicator cluster survey. URL: https:
United Nations, Department of Economic and Social Affairs, Population Division (2019)
World population prospects 2019. URL:
United Nations General Assembly (2015) Transforming our world: The 2030 agenda
for sustainable development a/res/70/1 resolution adopted by the general assembly
on september 25, 2015. URL:
Utazi, C. E., Nilsen, K., Pannell, O., Dotse-Gborgbortsi, W. and Tatem, A. J. (2021)
District-level estimation of vaccination coverage: Discrete vs continuous spatial models.
Statistics in Medicine,40, 2197–2211. URL: https://onlinelibrary.wiley.
Utazi, C. E., Pannell, O., Aheto, J. M. K., Wigley, A., Tejedor-Garavito, N., Wunderlich,
J., Hagedorn, B., Hogan, D. and Tatem, A. J. (2022) Assessing the characteristics of
un- and under-vaccinated children in low- and middle-income countries: A multi-level
cross-sectional study. PLOS Global Public Health,2, 1–13. URL: https://doi.
Utazi, C. E., Sahu, S. K. and Tatem, A. J. (2020a) Bayesian time series regression methods
for estimating national immunization coverage. Tech. rep., WorldPop, University of
Southampton, Southampton, UK.
Utazi, C. E., Wagai, J., Pannell, O., Cutts, F. T., Rhoda, D. A., Ferrari, M. J., Dieng, B.,
Oteri, J., Danovaro-Holliday, M. C., Adeniran, A. and Tatem, A. J. (2020b) Geospatial
variation in measles vaccine coverage through routine and campaign strategies in nigeria:
Analysis of recent household surveys. Vaccine,38, 3062–3071. URL: https://www.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B. and B¨
urkner, P.-C. (2021) Rank-
Normalization, Folding, and Localization: An Improved b
Rfor Assessing Convergence
of MCMC (with Discussion). Bayesian Analysis,16, 667 718. URL: https://
Watanabe, S. (2013) A widely applicable bayesian information criterion. J. Mach. Learn.
Res.,14, 867–897.
World Health Organization (2018) World health organization vaccination coverage clus-
ter surveys: reference manual. URL:
Bayesian hierarchical modelling approaches for immunization coverage estimation 31
(2020) Immunization agenda 2030: a global strategy to leave no one
behind. URL:
(2022) Immunization schedule. URL:
World Health Organization and United Nations Children’s Fund (2022)
WHO/UNICEF joint reporting process. URL: https://www.who.
Bayesian hierarchical modelling approaches for com-
bining information from multiple data sources to pro-
duce annual estimates of national immunization cover-
Supplementary information
C. Edson Utazi, Warren C. Jochem, Marta Gacic-Dobo, Padraic Murphy, Sujit
K. Sahu, M. Carolina Danovaro-Holliday and Andrew J. Tatem
This document accompanies the main paper. It contains additional information,
including additional tables and figures referenced in the paper.
1. Recall bias adjustment and processing of survey data
For survey data, we implemented an additional data cleaning step to ensure consis-
tency in the entries in each column, particularly where these were character vari-
ables. Each coverage survey estimate was then linked to a ’birth cohort year’ which
we used as the reference year for the estimate in this work. The birth cohort year
was determined using the period of data collection and the age of the birth cohort
that the survey estimate relates to, as in WUENIC methodology.
Similar to WUENIC approach, we applied a recall-bias adjustment to DTP3 and
PCV3 survey estimates. Estimates based on vaccination cards only or vaccination
cards and recall were used for the adjustment. In the pre-cleaned input data file,
these estimates were labelled as: “Card” and “Card or History”, respectively, in
the column for evidence of vaccination. For country-vaccine-year combinations with
multiple estimates labelled as “crude” or “valid”, the “valid” estimates were retained
in the analysis as these were considered more accurate. The formula used for the
adjustment is:
VD3(card+history) = VD3(card only) ×VD1(card+history)
VD1(card only)
where VD3 denotes the third dose of DTP or PCV vaccine and VD1, the first
dose. We note that for each vaccine, the adjustment was applied only when all
the data needed to compute equation (1) were available. After the adjustment, the
original “Card or History” survey estimates of DTP3 and PCV3 were replaced with
corresponding bias-adjusted estimates for further processing.
For a given vaccine, country and year, if one survey estimate was available, it
was accepted if the sample size was greater than 300 or if the estimate was labelled
’valid’. Otherwise, the estimate was not accepted. Where multiple estimates were
arXiv:2211.14919v1 [stat.ME] 27 Nov 2022
2Utazi et al.
available for the same vaccine, country and year, “Card or History” estimates were
prioritized over “Card” only estimates, and either of these were accepted if the
corresponding sample size was greater than 300 or if the evidence of vaccination was
based on valid doses. We note that for DTP3 and PCV3, only the bias-adjusted
estimates were considered when available. When multiple estimates were available
from the preceding step (perhaps from different surveys or the same survey) for
the same vaccine, country and year, the estimate with the largest sample size was
accepted. If the sample sizes were missing, the first valid estimate or the first
estimate available was chosen, in the given sequence. The resulting survey estimates
were used in the rest of the analyses.
2. Software
In order to support the reproducibility and replication of the immunisation coverage
modelling methods described in this report, we developed a set of tools in the R
programming language (?). The imcover package provides functions for assembling
the common sources of immunisation coverage and for fitting the blanced data single
likelihood (BDSL) and irregular data multiple likelihood (IDML) models described
in Section 3 using full Bayesian inference with Stan (?). The latest version of imcover
can be installed from GitHub by typing the command within the R console:
devtools::install github(‘wpgp/imcover’).
In order to properly install the package, a C++ compiler is required. Internally,
imcover relies on Stan code which is translated into C++ and compiled, allow-
ing for faster computations. On a Windows PC, the Rtools program provides the
necessary compiler. This is available from
windows/Rtools/ for R version 4.0 or later. On Mac OS X, users should follow
the instructions to install XCode. Users are advised to check the Stan installa-
tion guide for further information on the necessary system set-up and compilers
A typical workflow using imcover to produce national- and regional-level immu-
nisation coverage estimates is illustrated in Figure 1. The user should first download
the time series of reported coverage data, which may come from multiple sources
(i.e. administrative, official, and survey estimates). Second, these datasets are
processed in several steps to filter, clean duplicate records and correct possible re-
porting biases and then assemble a single dataset. This stage creates datasets of
format ic.df within R. This format is an extension of the common data frame
and enables some of the specialised processing steps by imcover. Third, the model
is fitted against the assembled coverage dataset. The user has the option at this
stage to specify additional parameters and prior choices for the statistical model.
Fourth, after the model has been fitted, a model object is returned. This object’s
class extends the model objects from rstan from which parameter estimates and
other results can be extracted in R. In the fifth stage, post-processing is done on
the model results. These functions provide support to produce summaries of the
estimates, predictions forward in time, standardised visualisations, and population-
weighted regional aggregations of immunisation coverage estimates. The details of
Supplementary information 3
this workflow are illustrated below with a worked example of coverage data for the
WHO AFRO region. Further information on the R package can be found within
the documentation, including a long-form vignette, see help(imcover).
Supplementary Figure 1: Overview of steps supported by the imcover
Workflow example
In the following sections we provide a worked example to produce time series of
modelled estimates of immunisation coverage using the model-based approach im-
plemented in R using imcover. After loading the package, we first obtain the data
from the WHO Immunisation Data Portal (
An internet connection is required for this step.
1# load the p a c k a g e wi th i n th e R e nviron m e n t
4# downlo a d a dmini s t r a t ive and o f f i c i a l records
5c ov < - d ow n lo a d _ co v er a ge ( )
7# downlo a d s u r v e y rec o r d s
8s vy < - d o wn l oa d _ s ur ve y ( )
4Utazi et al.
Data downloading is handled by two functions: download coverage and downloa
d survey. These handle administrative/official estimates or survey datasets, respec-
tively. By default, the downloaded files are stored as temporary documents in the
user’s R temporary directory; however, the functions provide the option for users
to save the downloaded files to a user-specified location and load them later from a
local file path. In this way, a user can also load their own source of immunisation
coverage data and process it into a standardised format for modeling.
Data processing and formatting
As part of the download function, a series of checks and cleaning steps are applied
by default. The goal of these checks is to identify the core attributes in the input
data necessary to construct an immunisation coverage dataset. Specifically, these
attributes include a country, time, vaccine identifier, and percent of the population
covered by that vaccine. In the absence of a reported coverage percentage, the
number of doses administered and target population can be used to estimate cov-
erage. Identifying the core attributes allows the user to harmonise multiple source
files into a unified data format for modelling. Administrative and official coverage
estimates are processed similarly. Within download survey, the household survey
datasets require some more specialised processing. For instance, multi-dose vaccine
reports can have reporting and recall biases. Note that all the processing steps can
be carried out using separate functions available in imcover if advanced users want
more control over pre-processing.
Within the imcover package, processed coverage datasets are stored in ic.df
format, or an “immunisation coverage data frame”. This format extends the com-
mon data frame object of R where observations are rows and attributes are stored
in columns. ic.df objects support all standard methods and ways of working with
data frames in R. This includes selecting records and columns by indices or column
names, merging data frames, appending records, renaming, etc. The advantage
of the ic.df format over a standard data frame is that it includes information to
identify the columns containing core coverage information as well as notes on data
pre-processing that has been done. These allow users to combine disparate sources
of information on immunisation coverage into a harmonised analysis dataset without
having to adjust for missing or differently named columns.
1# note th e t y p e o f o b j e c t created
2c la s s ( co v )
3# > [ 1 ] " i c . df " " d at a . f r am e "
The data files available from the WHO website require some additional cleaning
before analysis. Notice that the data objects created by imcover can work with all
standard R commands.
1# Further d a t a c l eaning of i m m u n i zatio n r ecords
2# dr op s om e re co rd c at eg or ie s (PA B , HP V , W UE NI C)
3c ov < - c o v [ co v $co v er a ge _ c a te go r y % i n % c ( " AD M IN " , " O FF I CI A L ") ,
Supplementary information 5
4cov$c o ve r ag e _ ca t eg o ry < - to l ow e r ( co v $coverage_category) #
cl e an -u p
6# create a comb i n e d d at aset
7d at < - r b in d ( co v , s vy )
9# remove r e c o r d s wi t h mi s s i ng c o v e r a g e values
10 d at < - d a t [ ! is . n a ( d at $c o v er a g e ) , ]
12 # mi s m a tch in v a c c i n e n a m e s b e t w e e n c overage and s u r v e y
13 dat[dat$a n t i ge n = = D T P CV 1 , a n t ig e n ] < - D T P1
14 dat[dat$a n t i ge n = = D T P CV 2 , a n t ig e n ] < - D T P2
15 dat[dat$a n t i ge n = = D T P CV 3 , a n t ig e n ] < - D T P3
17 # subset r e c o r d s
18 d at < - i c _ f i lt e r ( da t ,
19 v ac c i ne = c ( " DT P 1 " , " D T P3 " , " M CV 1 " , " M CV 2 " , "
P CV 3 " ) ,
20 time = 2 0 00:202 0 )
In preparation for the statistical modelling we carry out several additional pre-
processing steps. Firstly, some records observe inconsistencies in the levels of cover-
age between multi-dose vaccines. To maintain consistency, where coverage of later
doses cannot exceed earlier doses, we model the ratio between first and third dose.
In this example, we only adjust DTP1 and DTP3, but other multi-dose vaccines
could be processed in a similar manner.
1# adjustm e n t - us e r a t i o f o r DTP3
2d at < - i c _ r a ti o ( d at , n u m er a t o r = D TP 3 , d e n om i n a to r = D TP 1 )
The ic.df object will now store a note that this processing step has been carried
out so that the ratio is correctly back-transformed and that coverage estimates and
predictions are adjusted appropriately.
Secondly, we need to force coverage estimates to lie between 0% and 100% so
that we can model the data with a logit transformation.
1# m ai nt ain c ov er ag e bet we en 0 - 10 0%
2d at < - i c _a d ju s t ( da t , c ov e ra g e _a d j = T RU E )
Fitting models