ArticlePDF Available

Modeling and Regionalization of China's PM2.5 Using Spatial-Functional Mixture Models

Authors:

Abstract and Figures

Severe air pollution affects billions of people around the world, particularly in developing countries such as China. Effective emission control policies rely primarily on a proper assessment of air pollutants and accurate spatial clustering outcomes. Unfortunately , emission patterns are difficult to observe as they are highly confounded by many meteorological and geographical factors. In this study, we propose a novel approach for modeling and clustering PM2.5 concentrations across China. We model observed concentrations from monitoring stations as spatially dependent functional data and assume latent emission processes originate from a functional mixture model with each component as a spatio-temporal process. Cluster memberships of monitoring stations are modeled as a Markov random field, in which confounding effects are controlled through energy functions. The superior performance of our approach is demonstrated using extensive simulation studies. Our method is effective in dividing China and the Beijing-Tianjin-Hebei region into several regions based on PM2.5 concentrations, suggesting that separate local emission control policies are needed.
Content may be subject to copyright.
Modeling and Regionalization of China’s PM2.5
Using Spatial-Functional Mixture Models
Decai Liang, Haozhe Zhang, Xiaohui Chang, and Hui Huang
Abstract
Severe air pollution affects billions of people around the world, particularly in de-
veloping countries such as China. Effective emission control policies rely primarily on a
proper assessment of air pollutants and accurate spatial clustering outcomes. Unfortu-
nately, emission patterns are difficult to observe as they are highly confounded by many
meteorological and geographical factors. In this study, we propose a novel approach
for modeling and clustering PM2.5concentrations across China. We model observed
concentrations from monitoring stations as spatially dependent functional data and
assume latent emission processes originate from a functional mixture model with each
component as a spatio-temporal process. Cluster memberships of monitoring stations
are modeled as a Markov random field, in which confounding effects are controlled
through energy functions. The superior performance of our approach is demonstrated
using extensive simulation studies. Our method is effective in dividing China and
the Beijing-Tianjin-Hebei region into several regions based on PM2.5concentrations,
suggesting that separate local emission control policies are needed.
Keywords: Latent emission process; Model-based clustering; Markov random field; Environ-
mental policies.
Decai Liang is a Ph.D. candidate at the School of Mathematical Science and Center for Statistical
Science, Peking University, Beijing, P.R. China, 100871 (Email: liangdecai@pku.edu.cn). Haozhe Zhang
is a Data & Applied Scientist at Microsoft Corporation, Redmond, WA 98052 (Email: haozhe.
zhang@microsoft.com). Xiaohui Chang is an assistant professor of Business Analytics at the College of
Business, Oregon State University, Corvallis, OR 97331 (Email: xiaohui.chang@oregonstate.edu). Hui
Huang is a professor of Statistics at the School of Mathematics, Sun Yat-sen University, Guangzhou, P.R.
China, 510275 (Email: huangh89@mail.sysu.edu.cn). For correspondence, please contact Hui Huang.
1
1 Introduction
Among all air pollutants, fine particulate matters with aerodynamic diameters less than 2.5
µm, also known as PM2.5, are generally regarded as the most health-damaging because they
easily penetrate the lung barrier and directly enter into the circulatory system. Numerous
studies have shown that chronic exposure to high concentrations of PM2.5contributes to the
risk of developing cardiovascular and respiratory diseases and lung cancers (Pope et al., 2002;
Hoek et al., 2013; Lelieveld et al., 2015). Global Burden of Disease estimated that long-term
exposure to PM2.5caused 4.2 million deaths worldwide in 2015, making it the fifth-ranked
global risk factor that year (Cohen et al., 2017).
Due to the rapid industrialization and urbanization in recent decades, many areas of
China have experienced the most chronic and severe air pollution in the world with the
highest PM2.5levels (van Donkelaar et al., 2010). In the first quarter of 2013, extremely
severe smog affected more than 800 million people in China. About 70% of the days in
January registered daily average PM2.5concentrations that exceeded 75 µg/m3in numerous
cities (Huang et al., 2014), more than seven times the World Health Organization’s (WHO)
recommended level of 10 µg/m3. In response to the consistently poor air quality, the Chinese
government directed massive efforts to assess air quality and evaluate the health impacts of
air pollution for the entire country. For instance, real-time high-quality air pollutant con-
centration measurements have been collected from a large national monitoring network since
2013. This dataset quickly became one of the key pillars for the development of environ-
mental policies and emission control strategies (Zhang et al., 2017). Unfortunately, the mea-
surements may not provide an accurate depiction of the true characteristics of air pollutant
emission, as the distribution and transmission patterns of PM2.5are highly confounded by
factors including meteorological conditions, topography, local emissions, secondary aerosols,
and regional transportation (Liang et al., 2015). These uncertainties, along with the large
variability of PM2.5in space, bring challenges for assessing and monitoring PM2.5in China.
Thus, to carry out the “coordinated inter-regional prevention and control efforts” initiated
by the Chinese government (Li, 2015), a comprehensive statistical methodology that exam-
ines underlying emission patterns and incorporates spatio-temporal variations is urgently
needed.
In this work, we propose a novel approach for modeling China’s PM2.5data collected
from the national monitoring network. The observed PM2.5concentration at each station is
2
modeled as functional data (Ramsay and Bernard, 2005). The unobserved true emission is
assumed to be a latent process that employs a functional finite mixture model to account for
spatial heteroscedasticity. Each component of the mixture model is a spatio-temporal process
with a temporal functional principal component (FPC) expansion and spatially correlated
FPC scores from different stations. Under our framework, stations are clustered into various
regions based on their emission patterns. The cluster memberships (weights of components)
follow a Markov random field (MRF) model, and topological factors are also exploited to
define the similarity measures between stations. This approach makes inferences in both
space and time while performing a model-based clustering for PM2.5emission.
In environmental science, cluster analysis has gained considerable attention in recent
years due to its wide applicability to many real-life environmental problems. Clustering is
frequently referred to as “regionalization” in the field because the outcomes are specifications
of locations or regions. Researchers organize environmental units into homogeneous zones
with the goal of establishing local environmental control strategies in different regions. Some
applications that regionalization has played a significant role are dust storms (Qian et al.,
2004), precipitation (Zhang et al., 2016), and air pollution (Wang et al., 2015). The conven-
tional regionalization methods adopted in the field include empirical orthogonal functions
(EOF) and its rotated version (REOF) (Zhang et al., 2012; Wang et al., 2015), which are
basically spatial principal component analysis and the corresponding component rotations,
respectively. Several new regionalization techniques have emerged, such as self-organizing
maps (SOM) (Li et al., 2000) and k-means (Li et al., 2015). There are clear drawbacks to
these approaches. The EOF and its extensions may be useful for initial data exploration but
are unsuitable for investigation and interpretation of data characteristics. The determination
of cluster boundaries using EOF is subjective. Moreover, it is very challenging to handle
multi-scale data like ours through EOF, SOM or k-means. The most serious drawback of
these approaches is that they either completely ignore or pay little attention to the intrinsic
spatio-temporal structures of data, precluding accurate inferences for data with strong space-
or time-varying features.
Cluster analysis has been well studied in the functional data analysis literature for its
practical applications. For instance, James and Sugar (2003) developed a flexible model-
based procedure for sparsely sampled longitudinal data. Chiou and Li (2007) proposed k-
center functional clustering that is a functional version of k-means. Peng and M¨uller (2008)
3
introduced a distance-based method with multi-dimensional scaling. For high-dimensional
functional data, some of the commonly used clustering methods largely rely on penalized
likelihood (Pan and Shen, 2007), high-dimensional data clustering (Bouveyron and Jacques,
2011), an approximation for the density of functional random variables (Jacques and Preda,
2013), or wavelets (Giacofci et al., 2013). Nevertheless, these methods were all designed for
independent curves and are unsuitable for spatially dependent data.
Earlier works on clustering spatial-functional data are relatively scarce. Romano et al.
(2013) considered the spatial dependence among functions based on variogram models. Gi-
raldo et al. (2012) proposed a hierarchical approach based on a dissimilarity matrix among
curves. A recent technique introduced by Jiang and Serban (2012) incorporated an MRF
into the modeling process to characterize spatial correlation and cluster dependence. MRF
originated from the field of statistical physics and is a general version of the Ising model (Kin-
dermann and Snell, 1980). In cluster analysis, especially for model-based methods, cluster
memberships of stations are usually assumed to be random and their probabilities are mod-
eled using a multinomial distribution. In this work, we employ the MRF-based approach
to model both the spatial dependence and cluster memberships, and k-nearest neighbors
combined with geographical information are also included for neighborhood definition.
This work contributes to the literature in the following five dimensions. First, we in-
troduce a unified framework for joint modeling and clustering. The spatial dependence
among the latent emission processes is embedded into the functional mixture model while
the cluster memberships are assigned using an MRF model. Second, our method allows for
heteroscedastic spatial dependence structures for different clusters, a much more realistic
assumption compared to having the same spatial structure for all clusters in Jiang and Ser-
ban (2012), and greatly enhances the flexibility and applicability of our method. Third, this
procedure has numerous practical advantages over other regionalization methods in real-
life applications, including but not limited to, improved interpretability and clear cluster
boundaries with stations connected within the same cluster, easy adaption to multi-scale
data, possible extension to multi-pollutant regionalization, and more comprehensive statisti-
cal inferences on data features. Fourth, the numerical performance of this method is shown
to be superior compared to others using extensive simulation studies. Last but not least,
we also propose a Monte Carlo EM approach to compute the likelihood in the presence of
multiple latent variables.
4
The structure of this paper is as follows. In Section 2, we introduce two PM2.5datasets
from China and Beijing-Tianjin-Hebei (BTH) region, one of the most populated and polluted
areas in the country, and discuss why regionalization is needed. We describe our method
and the estimation procedures in Sections 3 and 4, respectively. The simulation studies are
presented in Section 5. In Section 6, we demonstrate two examples of the application of the
method using data from China and the BTH region. The paper concludes with a discussion
in Section 7. Technical details of the Monte Carlo EM algorithm are in the Appendices.
Other details including the datasets, R code, and additional results can be found in the
Supplementary Material online.
2 Data Description
We analyze two datasets of different spatio-temporal scales: (1) city-level daily PM2.5concen-
tration data for the entire country, and (2) station-level monthly concentration data collected
from the BTH region. Performing regionalization for the entire country is highly challeng-
ing due to the widely diverse landscapes and drastically different meteorological conditions
that are critical for modeling air pollutant data. In contrast, the smaller BTH region is
more homogeneous in landscape and meteorological conditions, and this region is adopted
to demonstrate the performance of the proposed methodology.
To smooth data variability and reduce extreme values, we apply a logarithmic transforma-
tion to pollutant measurements. For both datasets, the topographic information (including
longitude, latitude, and elevation) is available. Pairwise distances between locations are cal-
culated using the great circle distance that is defined as the shortest distance between two
points on the surface of a sphere measured along the surface of the sphere (Porcu et al.,
2016).
2.1 China
China’s Ministry of Ecology and Environment has established a large monitoring network
for air quality assessment since 2013. This national network had expanded to more than
1,500 monitoring stations in 338 cities in 2015 and 2016. Real-time measurements of major
pollutants are continuously recorded and directly transferred to the China National Envi-
ronmental Monitoring Center (CNEMC). Pollutant measurements are collected using con-
5
Figure 1: Locations of 338 cities (marked by dots) in China with monitoring stations, and
the time series subplots of PM2.5 daily concentrations (µg/m3) of four megacities (Beijing,
Tianjin, Shanghai, and Guangzhou) from January 2015 to December 2016.
tinuous automated methods through either tapered element oscillating microbalance or Beta
ray attenuation (Wang et al., 2015). All equipment meet the standards of CNEMC.
Our city-level daily data are obtained by averaging hourly PM2.5 concentrations from all
monitoring stations in each city. A total of 731 measurements are available for each of the
338 cities from January 1, 2015, to December 31, 2016. We also remove 25 cities with low
data quality from Xinjiang and Tibet. For the remaining 313 cities, missing data (< 0.6%)
are imputed using linear interpolation.
The locations of all 338 cities are presented in Figure 1. As an illustration, we also
highlight four megacities (i.e., Beijing, Tianjin, Shanghai, and Guangzhou) and display their
average daily observations from January 2015 to December 2016. The PM2.5 time series
of Beijing and Tianjin are similar and highly correlated, especially during winter, partially
due to their close geographical proximity. On cold days, particle pollution is severe in
North China as a result of coal-burning for heating, and the large variation is explained
6
by the frequent and strong northern winds that can blow away air pollutants. The PM2.5
concentrations of Shanghai and Guangzhou are quite stable throughout the year but with
some clear differences in their mean functions and variations. Refer to Liang et al. (2016) for
a detailed discussion on PM2.5 patterns and weather influences in these Chinese megacities.
In this paper, we focus on feature extraction and separation of long-term and large-spatial-
scale patterns of PM2.5 across regions.
2.2 Beijing-Tianjin-Hebei
The Beijing-Tianjin-Hebei (BTH) region is one of the most polluted areas in the world,
mostly due to the emission of primary pollutants and weather conditions (Chen et al., 2018).
It is the national capital region of China, consisting of two of the most populated Chinese
cities (Beijing and Tianjin) and eleven cities in Hebei Province. Serious environmental
concerns have been raised when preparing for the 2008 Summer Olympics in Beijing, and
many studies focusing on this area have followed since then (Lin et al., 2008; Wang et al.,
2009; Xu et al., 2011).
There is a clear disparity in the air pollution level in the BTH region. For example,
the northern cities, including Zhangjiakou, Chengde, and Qinhuangdao, are located in a
mountainous area and marginally affected by emissions from factories and plants, whereas
the cities to the south of Beijing are burdened with emissions from steel and cement factories,
coal mines, and coking plants in Hebei Province. Therefore, the regionalization of the BTH
region using air pollutant data would provide significant information for both policy makers
and researchers.
The BTH region consists of a total of 73 national monitoring stations; see Figure 2 for
a map of all stations and cities. These stations are a subset of the stations in the national
network described in Section 2.1 and identical to those examined in earlier studies, e.g., Chen
et al. (2018). Due to the high proportions of missing values in the first half of 2013, we limit
our analysis to between June 2013 and December 2016. We also use monthly-averaged data
instead of daily measurements to eliminate any local patterns and focus on the long-term
trend. There are 43 months in total.
7
Figure 2: Locations of 13 cities and 73 monitoring stations (marked by dots) in the Beijing-
Tianjin-Hebei (BTH) region. The lines represent the administrative borders of the cities.
3 Model and Assumptions
In this section, we propose a functional mixture model with spatially correlated random ef-
fects to measure the spatio-temporal dependencies of PM2.5concentrations. We also suppose
the random functions of time are defined in a time domain Tand sampled from locations in
a spatial domain D. Let Y(s
s
si, tij ) be the discrete observation at time tij ∈ T on the random
curve of PM2.5concentration at location s
s
si∈ D ⊂ R2, where i= 1, . . . , n, and j= 1, . . . , mi.
The total number of spatial locations is n, and the total number of discrete observations at
location s
s
siis mi.
We denote Zias the cluster membership of the random curve at location s
s
si. In particular,
8
the membership Ziis a random variable following a multinomial distribution with support
{1, . . . K}, and Zi=kif the random curve at s
s
sibelongs to the kth cluster, where Kis the
total number of clusters. An MRF model is employed for the cluster membership to account
for spatial dependence intrinsically embedded in the data. For any location s
s
si, we assume
the marginal probability P(Zi=k) = πk, where πk0 for all kand PK
k=1 πk= 1.
3.1 Reduced-Rank Functional Mixture Model for PM2.5Data
Conditional on the cluster membership Zi=kfor k= 1, . . . , K, we assume the following
functional mixture model for PM2.5measurements:
Y(s
s
si, tij )(Zi=k) = µk(tij) + ηk(s
s
si, tij ) + ij ,(1)
where µk(·) is the mean function for the kth cluster, ηk(·,·) is a zero-mean spatio-temporal
process on D × T representing a spatially correlated functional random effect for the kth
cluster, and ij ’s are iid measurement errors with E (ij ) = 0 and var (ij) = σ2
. We consider
µk(·)+ηk(s
s
si,·) as the latent and smooth random function of PM2.5concentrations at location
s
s
siafter removing the measurement errors. This framework allows for spatially correlated
random effects, making it more general than the identical random effects for all clusters
adopted in James and Sugar (2003). Also, we consider ηk(s
s
s, t) as a spatial extension of a
temporal process with a standard Karhunen-Lo`eve expansion:
ηk(s
s
s, t) =
X
q=1
γq,k (s
s
s)ψq,k (t),(2)
where ψq,k (·)’s are orthonormal eigenfunctions known as the functional principal components
(FPC), and the functional principal component score γq,k (s
s
s) = Rτηk(s
s
s, t)ψq,k (t)dt is the
loading of ηk(s
s
s, t) on the qth principal component. We assume γq,k (s
s
s)’s are zero-mean and
second-order stationary random fields that are independent across qand k.
Spatial structure of the function data is modeled through the spatial covariance between
the FPC scores. More specifically,
cov{γq,k(s
s
s1), γq0,k0(s
s
s2)}=
σ2
γ,q,kρ(s
s
s1s
s
s2;φq,k ),if q=q0and k=k0,
0,otherwise,
where σ2
γ,1,k σ2
γ,2,k . . . > 0, and ρ(·) is a spatial correlation function that depends on
9
the distance measure between two sites and some parameter vector φq,k . A wide range of
spatial correlation functions can be employed to model the spatial dependence structure. For
example, when the most popular Mat´ern function in geostatistics is selected, φq ,k includes a
range parameter and a smoothness parameter (Mat´ern, 1960). It is worth mentioning that
the parameters σ2
γ,q,k’s, φq,k ’s, and principal components ψq,k(·)’s can vary across qand k,
thus this model allows for a heteroscedastic random effect structure, which is a much more
realistic assumption for many real-life applications compared to a homoscedastic random
effect structure.
In practice, the equation in (2) is often approximated by truncating the series using
the first Qleading functional principal components. Then, the reduced-rank version of the
functional mixture model can be written as:
Y(s
s
si, tij )(Zi=k) = µk(tij) +
Q
X
q=1
γq,k (s
s
si)ψq,k (tij ) + ij ,(3)
where Qis considered as a tuning parameter of interest (Li et al., 2013). A data-driven
method for selecting Qis discussed in more detail in Section 4.3.
3.2 Markov Random Field Model for Cluster Membership
Cluster membership is critical for specifying the joint distribution of the complete data in
our framework. Following Jiang and Serban (2012) with some modifications, we assume the
clustering configuration follows a locally dependent MRF model. The Markov property in
space implies that the state space (namely the cluster membership) of any given location,
s
s
si, would depend on the states of its neighboring locations, denoted by ∂i. This assumption
is reasonable because air pollutant data are usually locally dependent, and the degree of
dependence is highly influenced by geographical factors, such as the elevation of mountains.
To account for the spatial dependence in the cluster membership, we model the probabil-
ity mass function of one cluster membership, conditioning on its neighbors, as the Gibbs
distribution:
P(Zi=k|Z∂i ) = exp{Uik(ν)}
Ni(ν),(4)
where Z∂i is a vector of cluster memberships of the neighbor locations of s
s
si,Uik(ν) =
νPi0∂i I(Zi0=k) where ν0 is known as the energy function, I(·) is an indicator func-
tion, and Ni(ν) = PK
k=1 exp{Uik(ν)}is a normalizing constant. The energy function Uik (ν)
10
determines the spatial pattern of the entire region. A large value of Uik(ν) corresponds to a
spatial pattern where many spatially connected locations belong to the same cluster, whereas
a small Uik(ν) implies a weak spatial dependence in the cluster membership. The parameter
νreflects the degree of interaction among the nearby sites in the MRF: a large νrepresents
a highly spatially dependent cluster membership, and ν= 0 when there is no spatial depen-
dence at all with an equal chance of belonging to any cluster. This idea of using the Gibbs
distribution to model dependence structure originates from statistical physics (Kindermann
and Snell, 1980) and has been frequently used in spatial statistics (Clifford, 1990).
4 Estimation and Implementation
4.1 Spline Approximation
We use polynomial splines to approximate and estimate the mean functions and eigenfunc-
tions defined in Section 3. For simplicity, we assume the time domain is T= [0,1]. Let
B(t) = {b1(t), . . . , bp(t)}Tbe a spline basis with dimension p(de Boor, 2001). For simplic-
ity, we use equally spaced knots in T. We can approximately write µk(t) = BT(t)αkand
[ψ1,k(t), . . . , ψQ,k (t)] = BT(t)Θk, where αkand Θkare p×1 and p×Qmatrix of spline coeffi-
cients, respectively. According to Zhou et al. (2008) and Zhou et al. (2010), we impose iden-
tifiability restrictions on B(t) and Θksuch that: RB(t)BT(t)dt =Ip×pand ΘT
kΘk=IQ×Q,
and further require the first nonzero element of each column of Θkto be positive. See
Appendix 1 of Zhou et al. (2008) on how to construct a spline basis that satisfies the or-
thonormal constraints described above. For each station i, we use γi= [γi1, . . . , γiQ]Tto
represent the spatial random effect. Conditional on {Zi=k, γi}, the reduced-rank model
(3) takes the form:
Y(s
s
si, tij )|(Zi=k, γi) = BT(tij )αk+BT(tij )Θkγi+ij .(5)
A discussion on the selection of the spline basis is given in Section 4.3.
4.2 Monte Carlo EM Algorithm
As the standard approach for model-based functional clustering (e.g., James and Sugar,
2003), we use a likelihood-based procedure and treat both the memberships Zand the spatial
11
random effects γas latent variables. To overcome the computational challenges associated
with the joint distribution of (Z,γ), we use the Monte Carlo EM (MCEM) algorithm based
on Gibbs sampling for parameter estimation (Wei and Tanner, 1990).
With the spline approximation, the collection of parameters that need to be estimated
becomes ={σ2
, ν, k;k= 1, . . . , K}, where k={αk,Θk,φq,k , σ2
γ,q,k}. We represent the
model in a hierarchical matrix form:
Y|Z,γ=e
B
B
BTe
αZ+e
B
B
BTe
ΘZγ+,
γ|ZNormal 0,e
ΓZ,
ZMarkov random fields (ν).
Here, we use Yi={Y(s
s
si, ti1), . . . , Y (s
s
si, timi)}Tto denote the vector of all midiscrete observa-
tions at location s
s
siand then define Y=YT
1,...,YT
nT. Let e
B
B
BTbe an en×np block diagonal
matrix of spline basis functions B=BT(t), where en=Pn
i=1 miis the total sample size.
For the cluster memberships, we write Z= (Z1,...,Zn)Tas a vector of all cluster member-
ships with Zitaking values from 1 to K. This means one realization of Zis (k1, . . . , kn)T,
where ki∈ {1,··· , K }for all i. We define the spline coefficients e
αZ=αT
Z1,...,αT
ZnT
and e
ΘZ= diag (ΘZ1,...,ΘZn). The FPC scores are γ=γT
1,...,γT
nT, and errors are
=T
1,...,T
nT, where i= (i1, . . . , imi)T. The conditional covariance matrix of γgiven
the cluster membership Zis e
ΓZ, which depends on the parameters φq,k and σ2
γ,q,k. More
computational details on γand e
ΓZare given in Appendix A.
Based on f(Y,Z,γ) = f(Y|Z,γ)f(γ|Z)f(Z) and the assumptions made in Section
3.1, we write out the log-likelihood for the complete data:
`(;Y,Z,γ) = log nfY|Z,γ;e
αZ,e
ΘZ, σ2
o+ log nfγ|Z;e
ΓZo+ log {f(Z;ν)}.(6)
We adopt the pseudo-likelihood proposed in Besag (1975) to approximate the last term
in (6), i.e., log f(Z;ν)Pn
i=1 log {f(Zi|Z∂i ;ν)}. Given Z= (Z1,...,Zn)Tand γ=
γT
1,...,γT
nT,Y0
isare conditionally independent from a mi-dim multivariate normal distri-
bution with mean BT(αZi+ΘZiγi) and variance σ2
. As a result, the complete log-likelihood
in (6) can be written as:
`(;Y,Z,γ)log nfY|Z,γ;e
αZ,e
ΘZ, σ2
o+ log nfγ|Z;e
ΓZo+
n
X
i=1
log {f(Zi|Z∂i ;ν)}
12
∝ −1
2
n
X
i=1
K
X
k=1
I(Zi=k)nmilog σ2
+
YiBT(αZi+ΘZiγi)
22
o
1
2I(Z1=k1,...,Zn=kn)log e
ΓZ+γTe
Γ1
Zγ
+
n
X
i=1
K
X
k=1
I(Zi=k) [Uik(ν)log {Ni(ν)}].(7)
The complete likelihood depends on latent random variables Z’s and γ’s, so it cannot be
maximized directly. Instead, we treat these latent variables as missing data and estimate
the unknown parameters using the EM algorithm by iterating between the E-step and the
M-step until convergence. To determine the cluster memberships, we assign location s
s
sito
the cluster kthat maximizes that location’s conditional probability πk|i=P(Zi=k|Yi).
4.2.1 E-Step
In the E-step, the expected log-likelihood is first calculated:
Q(|prev) = E `(;Y,Z,γ|Y,prev),(8)
where prev represents the value of the parameters from the previous EM iteration.
To calculate the expectation on the right side of (8), we would need the exact joint
conditional distribution f(Z,γ|Y,prev). But because Zand γare not conditionally in-
dependent, this joint distribution f(Z,γ|Y,prev) cannot be directly approximated using
the product of f(Z|Y,prev) and f(γ|Y,prev). This is different from James and Sugar
(2003). Instead, we use the MCEM of Wei and Tanner (1990) that approximates the con-
ditional expectation using a Monte Carlo approximation and was shown to converge to the
maximum likelihood estimate under some general regularity conditions (Chan and Ledolter,
1995). More specifically, we use Gibbs sampling (Geman and Geman, 1984) based on the
full conditional distributions f(γ|Y,Z,prev) and f(Z|Y,γ,prev) to simulate the joint
conditional distribution. Then Q(·) in (8) can be estimated using the Monte Carlo average:
b
Q(|prev) = 1
T
T
X
τ=1
`;Y,Z(τ),γ(τ),(9)
where T is the size of the Monte Carlo samples, and Z(τ)and γ(τ)are samples from the
conditional distribution (Z,γ|Y,prev) using Gibbs sampling. More details on the E-step
13
are given in Appendix B.
4.2.2 M-Step
In the M-step, we update the parameter values to the values maximizing the approximated
conditional expectation in (9). According to (7), we obtain:
b
Q(|prev) = 1
2T
T
X
τ=1
n
X
i=1
K
X
k=1
IZ(τ)
i=kmilog σ2
+
YiBTαk+Θkγ(τ)
i
22
1
2T
T
X
τ=1
IZ(τ)
1=k1,...,Z(τ)
n=knlog e
ΓZ(τ)+γ(τ)Te
Γ1
Z(τ)γ(τ)
+1
T
T
X
τ=1
n
X
i=1
K
X
k=1
IZ(τ)
i=knU(τ)
ik (ν)log N(τ)
i(ν)o
=b
Q1|prev+b
Q2(|prev) + b
Q3(|prev).(10)
Because b
Q1,b
Q2, and b
Q3depend on mutually disjoint collections of parameters in , we can
maximize them separately. To be more specific, (σ2
,αk,Θk) are updated by maximizing b
Q1,
(φq,k ,σ2
γ,q,k) by b
Q2, and νby b
Q3, respectively. The detailed M-step algorithm is provided in
Appendix C.
4.3 Tuning Parameter Selection
There are three key tuning parameters in the model: number of clusters (K), number of
FPCs (Q), and dimension of the spline basis (p). We develop a data-driven method to select
them.
Bayesian Information Criterion (BIC) is one of the most popular methods for model
selection. It adds a penalty term for the dimension of the parameter space to the log-
likelihood function. Under our modeling framework, exact likelihood calculations using the
standard EM algorithm can be quite challenging due to the presence of latent variables
(Zand γ). For a candidate model M, we propose to approximate its log-likelihood by the
Monte Carlo average b
QMdefined in (9), which is computed using Gibbs sampling in the final
EM iteration. Note that this approximate likelihood coincides with the integrated likelihood
introduced in Fraley and Raftery (2002), f(Y|) = Rf(Y|Z,γ,)fγ,Z|e
ΓZ, νdγdZ.
Through numerical approximation and Monte Carlo samples, Fraley and Raftery (2002)
14
approximates the integrated likelihood by the BIC. Similarly, we define a Monte Carlo BIC
for the model M:
Monte Carlo BIC(M) = 2b
QM+cM·log(en),(11)
where cM is the number of parameters in M. The tuning parameters are then selected
simultaneously by minimizing (11).
Among all tuning parameters, the number of clusters (K) is the most critical and also
the most challenging parameter to estimate. In real-life applications of this method, the
BIC score gradually decreases with increasing K (James and Sugar, 2003; Zhou et al., 2010)
and achieves its minimum when K > 30. However, such a large number of regions is not
supported by any scientific evidence, and it becomes impractical to establish and implement
a separate policy for each region. As a result, we use the expert knowledge of environmental
scientists and do not consider K as a tuning parameter in our regionalization studies. Other
implementation issues to expedite the model selection are addressed in the Supplementary
Material S.1.
5 Simulation Studies
We carry out two simulation studies to compare the performance of the proposed spatial-
functional clustering methodology against other methods in the literature. The first sim-
ulation study focuses on the homoscedastic case where different clusters share the same
covariance structure, while the second study considers the heteroscedastic case where differ-
ent clusters have different covariance structures. For each study, we first simulate a synthetic
dataset, carry out several clustering methods, and then evaluate their performances. This
procedure is repeated 100 times.
Two metrics are adopted to quantify the accuracy of assigning cluster memberships to
curves. The first is the adjusted Rand index (ARI) (Hubert and Arabie, 1985), which is
an improved version of the Rand index (Rand, 1971) with expected value 0 and bounded
by ±1. It measures the similarity between the true cluster membership and the clustering
result obtained from a clustering method. A larger value of the adjusted Rand index implies a
more accurate clustering method. The second metric is the standardized Root Mean Squared
Error (RMSE), defined as RMSE = q||µk(t)b
µk(t)||2
||µk(t)||2, measuring the accuracy of the estimated
mean pattern b
µkagainst the truth µk. We also evaluate the accuracy of the parameters and
15
functional principal components estimated from the model.
5.1 Homoscedastic Case
We simulate n= 156 points with coordinates s1,s2, . . . , s156 R2over a rectangular region
(107E125E, 28N43N) in North China. The cluster memberships are simulated
by generating a Markov random field using Gibbs sampling with ν= 0.5 , K= 2, and
a multinomial distribution. The neighbors Z i are chosen using 5 nearest neighbors. The
synthetic data are generated from the following functional model:
Yi(t)|(γi,Zi=k) = µk(t) +
Q
X
q=1
γi,qψq(t) + i(t),(12)
for i∈ {1, . . . , n}, t { 1
30 ,2
30 , . . . , 29
30 ,1},Q= 2, and k∈ {1,2}. Following the simulation
setup of the mean functions in Jiang and Serban (2012), we let the two cluster-dependent
mean functions be µ1(t) = 1
2exp(t) cos(t) and µ2(t) = cos 5π
2t. The two functional principal
component functions are orthogonalized by ψ1(t) = 2 sin(2πt) and ψ2(t) = b3(t), where
b3(t) is the third basis function of the 4-dimensional cubic spline basis B(t) defined in Section
4.1. The FPC scores, γq’s, are generated with the isotropic exponential covariance structure
cov(γi,q, γi0,q ) = σ2
γ,q exp(−||sis0
i||) with φ= 1 and (σ2
γ,1, σ2
γ,2) = (7,2). The error term
iis a white-noise process with variance σ2
= 0.4. For each simulated dataset, we use the
proposed Monte Carlo BIC to determine K, and the true K= 2 is correctly selected for 79%
of the time. We closely monitor the convergence of the algorithm and provide additional
trace plots for the MCEM iterations in the online Supplementary Material S.2.
We compare our proposed spatial-functional mixture model under a Markov random field
(SFMM-MRF) with the following clustering methods:
(a) k-means clustering,
(b) James’ method (James and Sugar, 2003), a classical functional mixture model assum-
ing the same covariance structure for the random effects across all clusters,
(c) Jiang’s method (Jiang and Serban, 2012), spatial clustering method assuming a locally
dependent Markov random field model for memberships,
(d) a functional mixture model assuming independence in both the random effects and
cluster memberships (FMM),
16
Table 1: Means and standard deviations (in parentheses) of the adjusted Rand index (larger
is better) and the RMSE (smaller is better) using different clustering methods, based on 100
simulations for the homoscedastic case.
k-means Jiang James FMM FMM-MRF SFMM SFMM-MRF
ARI 0.600 0.453 0.673 0.889 0.858 0.903 0.909
(0.307) (0.304) (0.378) (0.299) (0.325) (0.282) (0.278)
RMSE 0.639 0.959 0.905 0.418 0.433 0.393 0.392
(0.324) (0.254) (0.380) (0.323) (0.325) (0.265) (0.268)
Table 2: Means and standard deviations of parameter estimates of 100 simulations using the
proposed method (SFMM-MRF) for the homoscedastic case. The first row shows parameters
and their true values.
Parameter φ= 1 ν= 0.5σ2
γ,1= 7 σ2
γ,2= 2
Mean 0.893 0.442 6.443 2.385
Standard Deviation 0.150 0.075 1.142 1.468
(e) a functional mixture model assuming independence in the random effects and Markov
random fields for the cluster memberships (FMM-MRF), and
(f) a functional mixture model with spatially dependent random effects but independent
cluster memberships (SFMM).
It is worth noting that the last three methods, namely FMM, FMM-MRF, and SFMM,
are special cases of the proposed SFMM-MRF. The FMM approach can also be seen as an
extension of James’ method by using functional principal component analysis.
Table 1 summarizes the means and standard deviation of the adjusted Rand index and
RMSE of all clustering methods based on 100 simulations. Our proposed model, SFMM-
MRF, has the largest adjusted Rand index and the smallest RMSE, which shows that the
proposed method outperforms the others. Table 2 demonstrates that our parameter estima-
tion outlined in Section 4 performs reasonably well.
5.2 Heteroscedastic Case
With the same setup for spatial locations and cluster memberships, we generate another set
of synthetic data from a heteroscedastic functional model:
Yi(t)|(Zi=k) = µk(t) +
Q
X
q=1
γi,q,k ψq,k (t) + i(t),(13)
17
where σ2
= 0.4, µ1(t) = 1
2exp(t) cos(t), and µ2(t) = cos(5π
2t), same as in Section 5.1. The
heteroscedastic model is different from the homoscedastic model in that the random effects
and FPCs of the former depend on the cluster membership, whereas those of the latter do
not. To include the heteroscedasticity with reasonable complexity, we let ψ1,k=1(t) = b2(t)
and ψ2,k=1(t) = b3(t) for Cluster 1, and ψ1,k=2(t) = b4(t) and ψ2,k=2(t) = b1(t) for Cluster
2. Here, {b1(t), b2(t), b3(t), b4(t)}Tforms a 4-dimensional cubic spline basis as defined in
Section 4.1. The spatial covariance functions of the FPC scores are cov(γi,q,k , γi0,q0,k ) =
σ2
γ,q,k exp(−||sis0
i||) where φ= 1, (σ2
γ,1,1, σ2
γ,2,1) = (4,1), and (σ2
γ,1,2, σ2
γ,2,2) = (2,0.5).
For the heteroscedastic case, we compare the heteroscedastic spatial-functional clustering
under a Markov random field (HSFMM-MRF) with others. One special case of HSFMM-
MRF is the heteroscedastic functional spatial clustering with spatial dependence in the
random effects and independence of cluster memberships (HSFMM). For each of the 100
simulations, we compare five of the seven clustering methods described in the previous sub-
section and substitute the remaining two methods using their corresponding heteroscedastic
counterparts. The adjusted Rand index and the RMSE are summarized in Table 3. On
average, our proposed method HSFMM-MRF produces the largest adjusted Rand index and
the lowest RMSE compared to other methods, demonstrating its superiority over the rest.
Moreover, the last three columns in Table 3 are quite similar in values, indicating that the
framework of spatial-functional mixture model is generally robust to theassumptions of spatial
structures. In other words, even when the spatial structure of data is misspecified, we can still
obtain relatively good parameter estimates and clustering results. In real data analy-sis, this
flexibility allows us to choose different methods for different purposes - either the more
complicated one for feature specification or the simpler one for computational advan-tages.
Table 4 consists of a summary of the parameter estimates obtained from the proposed
method. Again, they perform reasonably well. We also display both the true and the
estimated mean functions and functional principal components in Figure 3, which shows the
estimated functions are very close to the true ones.
18
µ1(t)
0.00
0.25
0.50
0.75
1.00
Estimation of Mean Function 1
µ2(t)
0
1
2
Estimation of Mean Function 2
ψ1,k=1(t)
−0.50
−0.25
0.00
0.25
Estimation of FPC1,1
ψ2,k=1(t)
−0.2
0.0
0.2
0.4
Estimation of FPC1,2
ψ1,k=2(t)
−0.8
−0.4
0.0
0.00 0.25 0.50 0.75 1.00
Time
Estimation of FPC2,1
ψ2,k=2(t)
−0.50
−0.25
0.00
0.25
0.00 0.25 0.50 0.75 1.00
Time
Estimation of FPC2,2
Figure 3: True (solid lines) and estimated (dashed lines) functions of two mean functions
(top panels) and four functional principal components (middle and bottom panels) for the
heteroscedastic case. The shaded areas are bound by the 5th and 95th percentiles of the
estimated functions.
6 Data Analysis
6.1 China
For the city-level daily spatio-temporal PM2.5 data of 313 cities from 2015 to 2016 (refer to
Section 2.1), we apply the proposed method, more specifically the SFMM-MRF, to carry out
a regionalization analysis. To model the mean functions and eigenfunctions, we use cubic
19
Table 3: Means and standard deviations (in parentheses) of the adjusted Rand index and the
RMSE using different clustering methods based on 100 simulations for the heteroscedastic
case.
k-means James Jiang FMM FMM-MRF HSFMM HSFMM-MRF
ARI 0.824 0.824 0.853 0.896 0.917 0.931 0.933
(0.103) (0.105) (0.152) (0.122) (0.100) (0.054) (0.056)
RMSE 0.321 0.629 0.630 0.303 0.298 0.283 0.284
(0.115) (0.097) (0.114) (0.116) (0.108) (0.098) (0.097)
Table 4: Means and standard deviations of parameter estimates of 100 simulations using the
proposed method (HSFMM-MRF) for the heteroscedastic case.
Parameter φ= 1 ν= 0.5σ2
γ,1,1= 4 σ2
γ,2,1= 1 σ2
γ,1,2= 2 σ2
γ,2,2= 0.5
Mean 0.729 0.437 3.440 0.875 1.844 0.509
Standard Deviation 0.267 0.071 0.774 0.244 0.468 0.170
North China Plain Northeast China Plain Guanzhong Plain
Middle Yangtze River Plain Jianghuai Plain Northwest
Sichuan Basin Pearl River Delta Yungui Plateau
Figure 4: Regionalization map of China’s PM2.5using daily average PM2.5concentrations of
313 cities from January 2015 to December 2016. Each colored symbol represents a different
cluster. The total number of clusters is 9.
20
B-spline with 16 equally spaced interior knots. To model the cluster membership, for any
given city, we consider all cities within a 500 km radius as its neighbors since the correlation
of air pollution patterns at sites more than 500 km apart is generally weak (Gao et al., 2011).
China has a topographically diverse landscape, including the highest mountains and the
largest plateau on Earth, which affects the movement of many air pollutants from one place to
another (Bryan and Adams, 2002). Two adjacent monitoring sites separated by a high
mountain may have distinctive pollution patterns and thus belong to two clusters. This
“mountain effect” coupled with other geographical factors can be easily incorporated into
our model by modifying the standard Euclidean distance with a spatial deformation to obtain
a “geographical distance.” For example, the distance between two sites that are separated
by a high mountain can be set to be much greater than their Euclidean distance while
other geometric properties of the Euclidean distance are also retained. The change in the
definition of the distance metric may lead to an alteration of neighbors. This method has
been implemented in earlier works, including Sampson and Guttorp (1992), and Anderes
and Stein (2008), among others.
Alternatively, in our study, we extend the energy function Uik(ν) defined in Section
3.2 by introducing a function g(·, ·) to model the geographical covariates between a site
and its neighbors. More specifically, we define Ueik(ν) = ν Pi0∂i g(sssi, sssi0 )I(Zi0 = k) and
e
Ni(ν) = PK
k=1 exp{e
Uik(ν)}. The function g(s
s
si,s
s
si0) captures the geographical information
between site s
s
siand its neighboring site s
s
si0. For instance, consider a simple case where we
set das the altitude threshold between s
s
siand s
s
si0. If the largest altitude between s
s
siand s
s
si0
is greater than dimplying the presence of an extremely high mountain between them, then
s
s
siand s
s
si0should not be in the same cluster. For this scenario, g(·,·) can be written as
g(s
s
si,s
s
si0) =
0,if the largest altitude between s
s
siand s
s
si0is greater than d,
1,otherwise.
Then, the conditional probability for the cluster membership becomes
P(Zi=k|Z∂i ) =
exp ne
Uik(ν)o
e
Ni(ν)=exp νPi0∂i g(s
s
si,s
s
si0)I(Zi0=k)
PK
k=1 exp νPi0∂i g(s
s
si,s
s
si0)I(Zi0=k).
In our analysis, dis set to be 1 km.
The formation of PM2.5is very complex with many important contributing factors in-
21
cluding meteorological conditions, population, local industry, traffic, instantaneous energy
consumption, secondary chemical reactions in the atmosphere, and others. PM2.5comprises
a list of primary and secondary components that can also contribute to the study of emis-
sion sources and patterns, but unfortunately their concentrations are not collected in the
monitoring network. In this study, the only accessible data are the PM2.5concentrations
obtained from the monitoring stations, and they provide partial information about local
emission characteristics. We try to cluster cities based on the spatial-temporal trends ob-
served from the PM2.5concentrations so that more effective regional policies and strategies
than the current practices may be established and implemented.
Figure 4 displays the clustering results when the number of clusters (K) is set to 9. This
choice of Kfollows the recommendation of environmental scientists (Wang et al., 2015). We
observe a clear spatial clustering with several distinct geographical regions. For example,
North China Plain, Yangtze River Delta, Pearl River Delta, and Sichuan Basin are all
classified into separate clusters. These regions coincide with the list published by CNEMC
where air pollution is severe, and prevention and control strategies are needed (China’s State
Council, 2013). Another study done by Wang et al. (2015) also reports similar regions, but
our method defines regions with clearer boundaries. In addition, our method successfully
combines sites that are geographically far apart yet showing similar PM2.5patterns into one
cluster. For example, the Northwest cluster includes cities across five provinces: northern
Shanxi, middle Inner Mongolia, Ningxia, northern Gansu, and eastern Qinghai. These cities
are mostly resource-based, and their winter weather conditions are largely influenced by
the northwest monsoons. We also include the clustering results from other methods in the
Supplementary Material S.4 online for comparison. These methods appear to have much
less clear boundaries in their regionalization maps. This “clear spatial boundary” effect is
also one of the merits of the proposed method and can be mainly attributed to the Markov
random field employed in this approach.
The estimated mean functions of all clusters are displayed in Figure 5. Despite the tem-
poral trends varying across regions, there is a consistent “W” shape in almost all regions.
PM2.5concentrations are generally higher in winter than in summer, mostly from coal burn-
ing in many parts of China. This phenomenon becomes less obvious in Pearl River Delta
and Yungui Plateau of southern China where winters are relatively warm. Moreover, the
estimated four leading functional principal components are shown in Figure 6. A spike in
22
Figure 5: Estimated mean functions of nine clusters using China’s daily-averaged PM2.5
concentrations of 313 cities from 2015 to 2016. Same legend as in Figure 4. The observed
PM2.5concentrations are marked in grey.
the first component corresponds to an increased PM2.5level in the winter of 2016 across
all regions that is not completely captured by the mean functions. The remaining compo-
nents also show some seasonality providing further evidence that PM2.5pollution is milder
in summer and worse in winter.
23
3rd principal component: 6.4%
4th principal component: 5.9%
1st principal component: 57.3%
2nd principal component: 8.5%
2015/03 2015/07 2015/11 2016/03 2016/07 2016/11 2015/03 2015/07 2015/11 2016/03 2016/07 2016/11
−0.25
0.00
0.25
−0.25
0.00
0.25
Time (Day)
Estimated Principal Components
Figure 6: Estimated four leading functional principal components for China’s daily PM2.5
data. The first four eigenvalues explain a total of 78.1% of variation (5: 82.1%; 6: 85%; 7:
87.7%; 8: 90.0%; 9: 92.2%; 10: 93.8%).
6.2 Beijing-Tianjin-Hebei
We apply our method to analyze the monthly-averaged PM2.5 concentrations of the 73 sta-
tions in the BTH region from June 2013 to December 2016; refer to Section 2.2 for more details
on the data. A cubic spline with eight equally spaced interior knots is considered. Due to the
small area and flat terrain of the BTH region, we use the five nearest neighbors to model the
Markov random fields.
We divide all stations into three regions and present the results in Figure 7. The number
of clusters K = 3 follows another study of the BTH region (Chen et al., 2018). Our results
show that stations with the lowest PM2.5 are clustered in the north these are the stations
from Zhangjiakou, Chengde, and Qinhuangdao. The most severely polluted stations are in
the southern BTH area, including Baoding, Shijiazhuang, Hengshui, Xingtai, and Handan
that are frequently on the list of most polluted cities in China. Stations with moderate
pollution, including those from Beijing, Langfang, Tianjin, Tangshan, and Cangzhou, are
clustered together. These three regions are in agreement with Chen et al. (2018). The
northern mountainous stations are separated from the rest, demonstrating the significance
of the “mountain effect.” The estimated mean functions of the three regions are plotted in
Figure 8 where the most polluted southern region has the worst pollution in winters. This is
24
37
38
39
40
41
114 116 118 120
North Middle South
Figure 7: Regionalization map of the BTH region using monthly-averaged PM2.5concen-
trations of 73 stations from June 2013 to December 2016. Each color represents a different
cluster.
in accord with many other researchers’ findings. Because the three mean functions are well
separated from each other, we recommend separate pollution control strategies in different
regions. All regions show a slowly descending trend over time indicating the positive effects
of China’s pollution-control efforts in recent years.
6.3 Clustering Results for Demeaned Data
It is worth noting that functional clustering methods do not only group observations based
on the scales of the data (e.g., the three mean functions in Figure 8 have distinct scales), but
more importantly, they cluster time-dependent data according to the shapes of the temporal
patterns. Following the recommendation of one reviewer, we also demean the data to bring
25
3.5
4.0
4.5
5.0
2013/09 2014/03 2014/09 2015/03 2015/09 2016/03 2016/09
Time
Estimated Mean Functions
North Middle South
Figure 8: Estimated mean functions of three clusters using monthly-averaged PM2.5 concen-
trations of 73 stations from the BTH region between June 2013 and December 2016. Each
color represents a different cluster.
all stations to the same scale and then apply the proposed methodology to identify clusters
of stations according to their temporal patterns. Both the global mean function µ0(t) and
the regional mean functions µk(t) representing the regional deviations from the global mean
function are also estimated.
Figure 9 displays the results using China’s demeaned data. The estimated global mean
function shows a clear “W” shape, similar to that in Figure 5, while the regional mean
functions demonstrate significant differences in their scales and patterns despite some fluc-
tuations. For example, the scales of North China Plain, Sichuan Basin, and Guanzhong Plain
are all higher than the average, yet they are clustered into separate regions because their
temporal patterns show distinct trends, with peaks and troughs at different times points.
The global and regional mean functions of BTH data are presented in Figure 10. There is a
clear seasonal pattern in the global mean function: the PM2.5 measurements are generally
higher in winter than in summer. The overall trend decreases gradually over time suggesting
air pollution control in the BTH region has been more effective in 2016 than a few years ago.
The stations in the South cluster have the worst pollution in the BTH region and show a
persistently positive deviation from the global mean. The regional mean function of the
Middle cluster is mostly 0 while that of the North cluster is always negative.
26
Figure 9: Top Panels: Estimated global mean function ˆµ0(t) using China’s PM2.5concen-
trations of 313 cities from 2015 to 2016. Bottom Panels: Estimated regional mean functions
of 9 clusters ˆµk(t) for k= 1,2, . . . , 9 after removing the estimated global mean. The dashed
lines represent the zero lines.
27
Figure 10: Top Panel: Estimated global mean function ˆµ0(t) using the PM2.5data from the
BTH region. Bottom Panel: Estimated regional mean functions of three clusters ˆµk(t) for
k= 1,2,3 after removing the estimated global mean.
7 Concluding Remarks
In this study, we propose a novel approach to jointly model and cluster spatial dependent
functional data with applications to PM2.5concentrations collected from China and the
BTH region. Our model allows data from different clusters to have different mean functions
and covariance structures, and is able to incorporate spatial dependence through the FPC
scores. Markov random fields are assumed for the cluster memberships to define spatial
boundaries between regions. Our model respects the spatio-temporal characteristics of the
data. It serves as a tool not only for data clustering but also for uncertainty quantification
and results interpretation. We use a spline basis system and a data-driven FPC analysis
approach to strike a balance between model complexity and flexibility. An efficient MCEM
algorithm is used to estimate model parameters, mean functions, and eigenfunctions. The
extensive simulation studies show that the proposed method is superior to other methods in
28
terms of cluster membership prediction and model parameter estimation.
In the analysis of the PM2.5data, our regionalization results not only are in accordance
with the findings in the literature (Wang et al., 2015; Chen et al., 2018) but also show much
more clear region boundaries that would be helpful for policy making. In addition, the es-
timated mean functions and FPC functions present distinct and interpretable time-varying
patterns, reveal important underlying emission features, and would be useful for pollution
prevention and control. As a result, for more efficient control strategies, we recommend
identifying and implementing separate interventions (e.g., adopting different control mea-
sures and pollution reduction strategies) in the nine regions of China and three regions of
the BTH. Although the focus of our data analysis is on air pollution data, our clustering
method can be easily expanded to other environmental science or meteorological datasets
with a similar structure.
As pointed out by one reviewer, PM2.5is a complex mixture with many constituents.
Including PM2.5constituents in the study may provide a better understanding of the local
emission patterns than using the particulate matter alone. Currently, the concentrations
of PM2.5constituents are not being recorded in the monitoring network. In the future,
environmental organizations may consider collecting the constituents and using them as
supplementary assessment indicators for more effective air pollution source control. The
methodology described in this work can be implemented for the PM2.5constituents as well.
Our approach also opens up some new research questions. For instance, one important
question is how to model and cluster multivariate pollutants from multiple sources simulta-
neously. For the application studies in Section 6, apart from the PM2.5measurements, we
also have the concentrations of other air pollutants (e.g., PM10, O3, and SO2) at the moni-
toring stations. Though the particulate matters (PM2.5and PM10) are the most important
air pollutants in terms of the proportion of variation explained, it is of scientific importance
to have a systematic approach to combine multiple measurements in a framework of joint
modeling and clustering. Another question of interest is how to assess the uncertainty of
cluster assignments. The memberships of cities or stations are determined using the “pos-
terior” mean of a random variable, thus we may not have a great deal of confidence in the
assignments of some “transition zones” where PM2.5patterns are highly variable. These
questions and extensions call for future research.
29
Acknowledgment
The research was partially supported by the National Natural Science Foundation of China
(Grant No. 11871485) and China’s National Key Research Special Program (Grant No.
2016YFC0207702). The authors are also grateful for the detailed and constructive comments
from an Associate Editor and three referees.
SUPPLEMENTARY MATERIAL
Technical Details: Implementation issues related to model selection, additional results for
the Monte Carlo EM algorithm, model diagnostic results for the data analysis, and
clustering results from other methods (PDF file).
Code: R code for simulation studies and data analysis (zip file).
Dataset: City-level daily PM2.5 concentrations of China's entirety from January 2015 to
December 2016, and station-level monthly PM2.5 concentrations from 73 stations in the
BTH region from June 2013 to December 2016 (CSV file), the topographic information
including the longitude and latitude for corresponding cities and stations (CSV file),
and China’s elevation with 1km resolution (TIF file).
Appendix A Computational Details for Spatial Ran-
dom Effects
For the convenience of computation, we first re-group the elements of the spatial random
effects according to the principal components, denoted as γ1, and then conditional on Z=
(Z1,...,Zn), γ1is re-grouped again into γ
Zbased on the cluster membership. Let O1and
O2,Zbe two nQ ×nQ permutation matrices. We use the following expressions:
γ= (γ11, . . . , γ1Q, . . . , γn1, . . . , γnQ )T=O1γ1,
γ1= (γ11, . . . , γn1, . . . , γ1Q, . . . , γnQ )T=O2,Zγ
Z,
γ
Z= (γ·1,1,...,γ·1,K ,...,γ·Q,1,...,γ·Q,K )T,
where for q= 1, . . . , Q and k= 1, . . . , K ,γ·q,k collects all γiq’s that belong to the same cluster;
in other words, γ·q,k = (γk1q, . . . , γknkq)T, where nkis the number of locations belonging to
30
cluster k, and k1<··· < knkare the corresponding indices, i.e, Zkj=k, for j=k1, . . . , knk.
Use Γγand Γ
γto represent the covariance matrices of γand γ
Z, respectively.
Let Γ·q,k be the covariance matrix of γ·q,k, then we have Γ·q,k =σ2
γ,q,kRk(φq,k ) under the
assumptions in Section 3.1, where Rk(φq,k ) is an nk×nkmatrix with elements Rk,ii0(φq,k ) =
ρ(ksisi0k;φq,k ), i, i0= 1, . . . , nk. For simplicity, we replace φq,k by a set of common
parameters φas follows. Since cov (γiq,k , γi0q0,k0) = 0 when q6=q0or k6=k0, the covariance
matrix of γ
Zis block diagonal Γ
γ= diag(Γ·1,. . . , Γ·Q), where Γ·q= diag(Γ·q,1,...,Γ·q,K ) for
q= 1, . . . , Q. Note that γis just a reordering of γ
Z, i.e., γ=OZγ
Z, where OZ=O1O2,Z.
It follows that the covariance matrix of γis Γγ=OZΓ
γOZ
T.
Appendix B Technical Details for the E-Step
In this section, we provide more details on the Gibbs sampling procedure used in the E-step.
The posterior distribution of (γ|Y,Z) can be derived from the following joint Gaussian
distribution:
Y
γ!ZNormal e
B
B
BTe
αZ
0!, e
B
B
BTe
ΘZe
ΓZe
ΘT
Ze
B
B
B+σ2
Ie
B
B
BTe
ΘZe
ΓZ
e
ΓZe
ΘT
Ze
B
B
Be
ΓZ!!.
Therefore, we obtain that γ|Y,Z,prevNormal (e, v), where
e=E(γ|Y,Z) = e
ΓZe
ΘT
Ze
B
B
BVar(Y|Z)1(Ye
B
B
BTe
αZ),
v=Var(γ|Y,Z) = e
ΓZe
ΓZe
ΘT
Ze
B
B
BVar(Y|Z)1e
B
B
BTe
ΘZe
ΓZ,
and Zi|γi,Yi,prevMultinomial (pi1, . . . , piK ), where
pik =f(Yi|γi,Zi=k)πk
PK
k=1 f(Yi|γi,Zi=k)πk
.
Assume we have (Z(τ1) ,γ(τ1)) at the (τ1)th step. Using the above marginal results, we
first generate γ(τ)from (γ(τ)|Y,Z(τ1),prev) and then Z(τ)
ifrom (Z(τ)
i|Yi,γ(τ1)
i,prev).
At each E-step, we repeat these two steps of Gibbs sampling T0+T times and omit the first T0
samples. In the simulation studies, we use T0= 50 and T = 100. The Sherman-Woodbury-
Morrison formula is also applied to invert the high-dimensional conditional variance of Y
31
given Zappeared in eand v, that
e
B
B
BTe
ΘZe
ΓZe
ΘT
Ze
B
B
B+σ2
I1
=σ2
Ie
B
B
BTe
ΘZe
Γ1
Zσ2
e
ΘT
Ze
B
B
Be
B
B
BTe
ΘZe
ΘT
Ze
B
B
B.
Appendix C Technical Details for the M-Step
Following is the complete procedure for the parameter updates in the M-step.
1. Estimation of αkand σ2
, for k= 1, . . . , K. By maximizing ˆ
Q1in (10), we are able to
update αkand σ2
:
b
αk=
T
X
τ=1 n
X
i=1
Z(τ)
ik B
B
BB
B
BT!
1
·
T
X
τ=1 (n
X
i=1
Z(τ)
ik B
B
BYiB
B
BTΘkγ(τ)
i),
bσ2
=1
Ten
T
X
τ=1 (n
X
i=1
K
X
k=1
Z(τ)
ik
YiB
B
BTαk+Θkγ(τ)
i
2),
where we use the notation Z(τ)
ik =IZ(τ)
i=kfor simplicity.
2. Update each column of Θksequentially. For q= 1, . . . , Q,
b
Θk,q =
T
X
τ=1 n
X
i=1
Z(τ)
ik γ2(τ)
iq B
B
BB
B
BT!
1
B
B
B
T
X
τ=1 (n
X
i=1
Z(τ)
ik γ(τ)
iq YiB
B
BTαkX
l6=q
B
B
BTΘk,lγ(τ)
il !).
Then, orthogonalize b
Θkusing a QR decomposition.
3. Update σ2
γ,q,k by maximizing b
Q2. To simplify the computation of partial derivatives,
we use the expressions of γ,γ
Z, and the block diagonal matrix Γ
γin Appendix A.
The updating formula is:
bσ2
γ,q,k =1
T
T
X
τ=1 1
nk
γ(τ)T
·q,k R1
k(φ)γ(τ)
·q,k ,for q= 1, . . . , Q.
4. Estimation of φ. Denote the components of φas φr’s. Given the current estimates of
other parameters, we minimize b
Q2and obtain the gradient with respect to φ, which is
32
a vector with elements
1
T
T
X
τ=1 (Q
X
q=1
K
X
k=1
tr R1
k
Rk
∂φr1
σ2
γ,q,k
tr R1
k
Rk
∂φr
R1
kγ(τ)
·q,k γ(τ)T
·q,k ).
Due to the lack of analytic solutions, we use the Newton-Raphson method to find the
solution as the updated estimate b
φ.
5. Update ν by maximizing Qb3. The gradient with respect to ν is
n
X
i=1
K
X
k=1
Z(τ)
ik
X
i0∂i
Z(τ)
i0kPK
k=1 nPi0∂i Z(τ)
i0koexp nνPi0∂i Z(τ)
i0ko
PK
k=1 exp nνPi0∂i Z(τ)
i0ko
.
The Newton-Raphson method is also used to obtain bν.
In the updating formulas given above, we fix the parameters on the right-hand side of each
equation at their current estimates obtained from the last EM iteration.
References
Anderes, E. B. and Stein, M. L. (2008). Estimating deformations of isotropic Gaussian
random fields on the plane. The Annals of Statistics, 36(2):719–741.
Besag, J. (1975). Statistical analysis of non-lattice data. Journal of the Royal Statistical
Society: Series D (The Statistician), 24(3):179–195.
Bouveyron, C. and Jacques, J. (2011). Model-based clustering of time series in group-specific
functional subspaces. Advances in Data Analysis and Classification, 5(4):281–300.
Bryan, B. and Adams, J. (2002). Three-dimensional neurointerpolation of annual mean
precipitation and temperature surfaces for China. Geographical Analysis, 34(2):93–111.
Chan, K. S. and Ledolter, J. (1995). Monte Carlo EM estimation for time series models
involving counts. Journal of the American Statistical Association, 90(429):242–252.
Chen, L., Guo, B., Huang, J., He, J., Wang, H., Zhang, S., and Chen, S. X. (2018). Assessing
air-quality in Beijing-Tianjin-Hebei region: The method and mixed tales of PM2.5and O3.
Atmospheric Environment, 193:290–301.
33
China’s State Council (2013). The action plan for air pollution prevention and control.
http://www.gov.cn/zwgk/2013-09/12/content_2486773.htm. In Chinese.
Chiou, J.-M. and Li, P.-L. (2007). Functional clustering and identifying substructures of lon-
gitudinal data. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
69(4):679–699.
Clifford, P. (1990). Markov random fields in statistics. In Grimmett, G. and Welsh, D. J.,
editors, Disorder in Physical Systems, Clarendon. Oxford.
Cohen, A. J., Brauer, M., Burnett, R., et al. (2017). Estimates and 25-year trends of the
global burden of disease attributable to ambient air pollution: An analysis of data from
the Global Burden of Diseases Study 2015. The Lancet, 389(10082):1907–1918.
de Boor, C. (2001). A practical guide to splines. Springer-Verlag, New York.
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and
density estimation. Journal of the American Statistical Association, 97(458):611–631.
Gao, H., Chen, J., Wang, B., Tan, S.-C., Lee, C. M., Yao, X., Yan, H., and Shi, J. (2011).
A study of air pollution of city clusters. Atmospheric Environment, 45(18):3069–3077.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, PAMI-6(6):721–741.
Giacofci, M., Lambert-Lacroix, S., Marot, G., and Picard, F. (2013). Wavelet-based cluster-
ing for mixed-effects functional models in high dimension. Biometrics, 69(1):31–40.
Giraldo, R., Delicado, P., and Mateu, J. (2012). Hierarchical clustering of spatially correlated
functional data: Clustering of spatial functional data. Statistica Neerlandica, 66(4):403–
421.
Hoek, G., Krishnan, R. M., Beelen, R., et al. (2013). Long-term air pollution exposure and
cardio- respiratory mortality: A review. Environmental Health, 12(1):43.
Huang, R.-J., Zhang, Y., Bozzetti, C., et al. (2014). High secondary aerosol contribution to
particulate pollution during haze events in China. Nature, 514(7521):218–222.
34
Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193–
218.
Jacques, J. and Preda, C. (2013). Funclust: A curves clustering method using functional
random variables density approximation. Neurocomputing, 112:164–171.
James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional data.
Journal of the American Statistical Association, 98(462):397–408.
Jiang, H. and Serban, N. (2012). Clustering random curves under spatial interdependence
with application to service accessibility. Technometrics, 54(2):108–119.
Kindermann, R. and Snell, J. L. (1980). Markov random fields and their applications. Con-
temporary Mathematics. American Mathematical Society, Providence, RI.
Lelieveld, J., Evans, J. S., Fnais, M., Giannadaki, D., and Pozzer, A. (2015). The contri-
bution of outdoor air pollution sources to premature mortality on a global scale. Nature,
525:367–371.
Li, K. (2015). Report on the Work of the Government (2015). http://english.www.gov.
cn/archive/publications/2015/03/05/content_281475066179954.htm. Delivered at
Third Session of the 12th National People’s Congress on March 5, 2015.
Li, S.-T., Chou, S.-W., and Pan, J.-J. (2000). Multi-resolution spatio-temporal data mining
for the study of air pollutant regionalization. In Proceedings of the 33rd Annual Hawaii
International Conference on System Sciences, pages 1–7.
Li, X., Zhou, W., and Chen, Y. D. (2015). Assessment of regional drought trend and risk over
China: A drought climate division perspective. Journal of Climate, 28(18):7025–7037.
Li, Y., Wang, N., and Carroll, R. J. (2013). Selecting the number of principal components
in functional data. Journal of the American Statistical Association, 108(504):1284–1294.
Liang, X., Li, S., Zhang, S., Huang, H., and Chen, S. X. (2016). PM2.5data reliability,
consistency, and air quality assessment in five Chinese cities. Journal of Geophysical
Research: Atmospheres, 121(17):10220–10236.
35
Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H., and Chen, S. X. (2015).
Assessing Beijing’s PM2.5pollution: Severity, weather impact, APEC and winter heating.
Proceedings of the Royal Society A, 471(2182):20150257.
Lin, W., Xu, X., Zhang, X., and Tang, J. (2008). Contributions of pollutants from North
China Plain to surface ozone at the shangdianzi gaw station. Atmospheric Chemistry and
Physics, 8(19):5889–5898.
Mat´ern, B. (1960). Spatial Variation. Springer-Verlag, Berlin.
Pan, W. and Shen, X. (2007). Penalized model-based clustering with application to variable
selection. Journal of Machine Learning Research, 8(May):1145–1164.
Peng, J. and M¨uller, H.-G. (2008). Distance-based clustering of sparsely observed stochas-
tic processes, with applications to online auctions. The Annals of Applied Statistics,
2(3):1056–1077.
Pope, C. A. I., Burnett, R. T., Thun, M. J., and et al (2002). Lung cancer, cardiopulmonary
mortality, and long-term exposure to fine particulate air pollution. Journal of American
Medical Association, 287(9):1132–1141.
Porcu, E., Bevilacqua, M., and Genton, M. G. (2016). Spatio-temporal covariance and cross-
covariance functions of the great circle distance on a sphere. Journal of the American
Statistical Association, 111(514):888–898.
Qian, W., Tang, X., and Quan, L. (2004). Regional characteristics of dust storms in China.
Atmospheric Environment, 38(29):4895–4907.
Ramsay, J. and Bernard, S. (2005). Functional data analysis. Springer series in statistics.
Springer.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of
the American Statistical Association, 66(336):846–850.
Romano, E., Balzanella, A., and Verde, R. (2013). A Regionalization Method for Spatial
Functional Data Based on Variogram Models: An Application on Environmental Data,
pages 99–108. Springer, Berlin, Heidelberg.
36
Sampson, P. D. and Guttorp, P. (1992). Nonparametric estimation of nonstationary spatial
covariance structure. Journal of the American Statistical Association, 87(417):108–119.
van Donkelaar, A., Martin, R. V., Brauer, M., et al. (2010). Global estimates of ambient fine
particulate matter concentrations from satellite-based aerosol optical depth: Development
and application. Environmental Health Perspectives, 118(6):847–855.
Wang, S., Li, G., Gong, Z., et al. (2015). Spatial distribution, seasonal variation and region-
alization of PM2.5 concentrations in China. Science China Chemistry, 58(9):1435–1443.
Wang, Y., Hao, J., McElroy, M. B., et al. (2009). Ozone air quality during the 2008 Beijing
Olympics: Effectiveness of emission restrictions. Atmospheric Chemistry and Physics,
9(14):5237–5251.
Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm
and the Poor Man’s data augmentation algorithms. Journal of the American Statistical
Association, 85(411):699–704.
Xu, W. Y., Zhao, C. S., Ran, L., et al. (2011). Characteristics of pollutants and their
correlation to meteorological conditions at a suburban site in the North China Plain.
Atmospheric Chemistry and Physics, 11(9):4353–4369.
Zhang, H., Zhu, Z., and Yin, S. (2016). Identifying precipitation regimes in China using
model-based clustering of spatial functional data. In Proceedings of the Sixth International
Workshop on Climate Informatics, pages 117–120.
Zhang, S., Guo, B., Dong, A., He, J., Xu, Z., and Chen, S. X. (2017). Cautionary tales on air-
quality improvement in Beijing. Proceedings of the Royal Society A, 473(2205):20170457.
Zhang, X. Y., Wang, Y. Q., Niu, T., et al. (2012). Atmospheric aerosol compositions in
China: spatial/temporal variability, chemical signature, regional haze distribution and
comparisons with global aerosols. Atmospheric Chemistry and Physics, 12(2):779–799.
Zhou, L., Huang, J. Z., and Carroll, R. J. (2008). Joint modelling of paired sparse functional
data using principal components. Biometrika, 95(3):601–619.
37
Zhou, L., Huang, J. Z., Martinez, J. G., Maity, A., Baladandayuthapani, V., and Carroll,
R. J. (2010). Reduced rank mixed effects models for spatially correlated hierarchical
functional data. Journal of the American Statistical Association, 105(489):390–400.
38
... Based on the work of Aguilera-Morillo et al. (2017), Arnone et al. (2018) proposed a regression model with differential regularization and proved the corresponding estimation problem. Liang et al. (2021) considered the spatio-temporal process during analyzing functional mixed-effect model, where the random effect is a spatio-temporal process and FPCA is used to approximate unknown functions. ...
... Inspired by Liang et al. (2021), we propose a new model to combine spatial information and the functional base-line multi-category logit model to analyze the data with spatial location information. Advantages of the proposed method include: (1) the proposed method considers the spatial information involved in functional data; (2) the functional baseline category logit model for nominal response could be considered as a functional generalized linear model, which retains the flexibility of generalization. ...
... is zero-mean and second-order stationary fields and independent across s and k. Considering that χ k (s i , t) are random functions with spatial dependence, for different s and k, according to the definition of Liang et al. (2021) ...
Article
Full-text available
In this paper, a novel method is proposed to analyze multivariate longitudinal data that contains spatial location information. The method has the advantage of analyzing the relationship between curves at neighbor time points and observing the relationship between locations. We offer the spatial covariance function and use functional PCA to estimate unknown parameter functions. A detail solving process and theoretical properties are introduced. Based on the gradient descent method and leave-one-out cross-validation method, we estimate those unknown parameters and select the principal components respectively. Furthermore, compared with other four methods, the proposed method shows a better category effect on simulation studies and air quality data analysis.
... Under this assumption, they borrowed information across functions corresponding to nearby spatial locations resulting in enhanced estimation accuracy of the cluster effects and the cluster membership. Further, Liang et al. [27] developed a functional mixture model based on a spline basis system and a data-driven functional principal component analysis (FPCA) for modeling and clustering PM 2.5 concentrations across China. Cluster memberships of monitoring stations are also modeled as a Markov random field prior. ...
Article
We propose a probabilistic model for clustering spatially correlated functional data with multiple scalar covariates. The motivating application is to partition the 29 provinces of the Chinese mainland into a few groups characterized by the epidemic severity of COVID-19, while the spatial dependence and effects of risk factors are considered. It can be regarded as an extension of mixture models, which allows different subsets of covariates to influence the component weights and the component densities by modeling the parameters of the mixture as functions of the covariates. In this way, provinces with similar spatial factors are a priori more likely to be clustered together. Posterior predictive inference in this model formalizes the desired prediction. Further, the identifiability of the proposed model is analyzed, and sufficient conditions to guarantee ``generic'' identifiability are provided. An L₁-penalized estimator is developed to assist variable selection and robust estimation when the number of explanatory covariates is large. An efficient expectation-minimization algorithm is presented for parameter estimation. Simulation studies and real-data examples are presented to investigate the empirical performance of the proposed method. Finally, it is worth noting that the proposed model has a wide range of practical applications, e.g., health management, environmental science, ecological studies, and so on.
... Another challenge in clustering analysis is to determine the number of clusters. The most common solution is to pre-specify the number of clusters based on certain empirical criteria (Jacques and Preda, 2014;Liang et al., 2020). Despite its computational convenience, this strategy does not take the uncertainty associated with cluster number selection into account when conducting the inference for the final clustering results. ...
Preprint
We study the spatial heterogeneity effect on regional COVID-19 pandemic timing and severity by analyzing the COVID-19 growth rate curves in the United States. We propose a geographically detailed functional data grouping method equipped with a functional conditional autoregressive (CAR) prior to fully capture the spatial correlation in the pandemic curves. The spatial homogeneity pattern can then be detected by a geographically weighted Chinese restaurant process prior which allows both locally spatially contiguous groups and globally discontiguous groups. We design an efficient Markov chain Monte Carlo (MCMC) algorithm to simultaneously infer the posterior distributions of the number of groups and the grouping configuration of spatial functional data. The superior numerical performance of the proposed method over competing methods is demonstrated using simulated studies and an application to COVID-19 state-level and county-level data study in the United States.
... In many applications, functional data collected at different times or locations are naturally correlated. There have been a lot of recent theory and methodology developments for dependent functional data, including multi-level functional data (Crainiceanu et al., 2009;Xu et al., 2018), functional time series (Hörmann and Kokoszka, 2010;Aue et al., 2015), and spatially dependent functional data (Staicu et al., 2010;Zhou et al., 2010;Gromenko et al., 2012;Zhang et al., 2016a,b;Kuenzer et al., 2020;Liang et al., 2020). There has also been some work on modeling spatio-temporal point process data using a functional data approach (Li and Guan, 2014). ...
Preprint
We consider spatially dependent functional data collected under a geostatistics setting, where spatial locations are irregular and random. The functional response is the sum of a spatially dependent functional effect and a spatially independent functional nugget effect. Observations on each function are made on discrete time points and contaminated with measurement errors. Under the assumption of spatial stationarity and isotropy, we propose a tensor product spline estimator for the spatio-temporal covariance function. When a coregionalization covariance structure is further assumed, we propose a new functional principal component analysis method that borrows information from neighboring functions. The proposed method also generates nonparametric estimators for the spatial covariance functions, which can be used for functional kriging. Under a unified framework for sparse and dense functional data, infill and increasing domain asymptotic paradigms, we develop the asymptotic convergence rates for the proposed estimators. Advantages of the proposed approach are demonstrated through simulation studies and two real data applications representing sparse and dense functional data, respectively.
Article
Functional data analysis (FDA), which is a branch of statistics on modeling infinite dimensional random vectors resided in functional spaces, has become a major research area for Journal of Multivariate Analysis. We review some fundamental concepts of FDA, their origins and connections from multivariate analysis, and some of its recent developments, including multi-level functional data analysis, high-dimensional functional regression, and dependent functional data analysis. We also discuss the impact of these new methodology developments on genetics, plant science, wearable device data analysis, image data analysis, and business analytics. Two real data examples are provided to motivate our discussions.
Article
We consider spatially dependent functional data collected under a geostatistics setting, where locations are sampled from a spatial point process. The functional response is the sum of a spatially dependent functional effect and a spatially independent functional nugget effect. Observations on each function are made on discrete time points and contaminated with measurement errors. Under the assumption of spatial stationarity and isotropy, we propose a tensor product spline estimator for the spatio-temporal covariance function. When a coregionalization covariance structure is further assumed, we propose a new functional principal component analysis method that borrows information from neighboring functions. The proposed method also generates nonparametric estimators for the spatial covariance functions, which can be used for functional kriging. Under a unified framework for sparse and dense functional data, infill and increasing domain asymptotic paradigms, we develop the asymptotic convergence rates for the proposed estimators. Advantages of the proposed approach are demonstrated through simulation studies and two real data applications representing sparse and dense functional data, respectively.
Article
Full-text available
The official air-quality statistic reported that Beijing had a 9.9% decline in the annual concentration of PM2.5 in 2016. While this statistic offered some relief for the inhabitants of the capital, we present several analyses based on Beijing's PM2.5 data of the past 4 years at 36 monitoring sites along with meteorological data of the past 7 years. The analyses reveal the air pollution situation in 2016 was not as rosy as the 9.9% decline would convey, and improvement if any was rather uncertain. The paper also provides an assessment on the city's PM2.5 situation in the past 4 years.
Article
Full-text available
Background: Exposure to ambient air pollution increases morbidity and mortality, and is a leading contributor to global disease burden. We explored spatial and temporal trends in mortality and burden of disease attributable to ambient air pollution from 1990 to 2015 at global, regional, and country levels. Methods: We estimated global population-weighted mean concentrations of particle mass with aerodynamic diameter less than 2·5 μm (PM2·5) and ozone at an approximate 11 km × 11 km resolution with satellite-based estimates, chemical transport models, and ground-level measurements. Using integrated exposure-response functions for each cause of death, we estimated the relative risk of mortality from ischaemic heart disease, cerebrovascular disease, chronic obstructive pulmonary disease, lung cancer, and lower respiratory infections from epidemiological studies using non-linear exposure-response functions spanning the global range of exposure. Findings: Ambient PM2·5 was the fifth-ranking mortality risk factor in 2015. Exposure to PM2·5 caused 4·2 million (95% uncertainty interval [UI] 3·7 million to 4·8 million) deaths and 103·1 million (90·8 million 115·1 million) disability-adjusted life-years (DALYs) in 2015, representing 7·6% of total global deaths and 4·2% of global DALYs, 59% of these in east and south Asia. Deaths attributable to ambient PM2·5 increased from 3·5 million (95% UI 3·0 million to 4·0 million) in 1990 to 4·2 million (3·7 million to 4·8 million) in 2015. Exposure to ozone caused an additional 254 000 (95% UI 97 000-422 000) deaths and a loss of 4·1 million (1·6 million to 6·8 million) DALYs from chronic obstructive pulmonary disease in 2015. Interpretation: Ambient air pollution contributed substantially to the global burden of disease in 2015, which increased over the past 25 years, due to population ageing, changes in non-communicable disease rates, and increasing air pollution in low-income and middle-income countries. Modest reductions in burden will occur in the most polluted countries unless PM2·5 values are decreased substantially, but there is potential for substantial health benefits from exposure reduction. Funding: Bill & Melinda Gates Foundation and Health Effects Institute.
Article
Full-text available
Human mobility is known to be distributed across several orders of magnitude of physical distances , which makes it generally difficult to endogenously find or define typical and meaningful scales. Relevant analyses, from movements to geographical partitions, seem to be relative to some ad-hoc scale, or no scale at all. Relying on geotagged data collected from photo-sharing social media, we apply community detection to movement networks constrained by increasing percentiles of the distance distribution. Using a simple parameter-free discontinuity detection algorithm, we discover clear phase transitions in the community partition space. The detection of these phases constitutes the first objective method of characterising endogenous, natural scales of human movement. Our study covers nine regions, ranging from cities to countries of various sizes and a transnational area. For all regions, the number of natural scales is remarkably low (2 or 3). Further, our results hint at scale-related behaviours rather than scale-related users. The partitions of the natural scales allow us to draw discrete multi-scale geographical boundaries, potentially capable of providing key insights in fields such as epidemiology or cultural contagion where the introduction of spatial boundaries is pivotal.
Article
Full-text available
Regional ozone pollution has become one of the top environmental concerns in China, especially in those economically vibrant and densely populated regions, such as North China region including Beijing. To address this issue, surface ozone and ancillary data over the period 2004–2006 from the Shangdianzi Regional Background Station in north China were analyzed. Due to the suitable location and valley topography of the site, transport of pollutants from the North China Plain was easily observed and quantified according to surface wind directions. Regional (polluted) and natural (clean) background ozone concentrations were obtained by detailed statistic analysis. Contribution of pollutants from North China Plain to surface ozone at SDZ was estimated by comparing ozone concentrations observed under SW wind conditions and that under NE wind conditions. The average daily accumulated ozone contribution was estimated to be 240 ppb·hr. The average regional contributions to surface ozone at SDZ from the North China Plain were 21.8 ppb for the whole year, and 19.2, 28.9, 25.0, and 10.0 ppb for spring, summer, autumn, and winter, respectively. The strong ozone contribution in summer led to disappearance of the spring ozone maximum phenomenon at SDZ under winds other than from the WNN to E sectors. High winter NOx concentrations in the North China Plain caused negative ozone contribution in winter.
Article
Full-text available
From 2006 to 2007, the daily concentrations of major inorganic water-soluble constituents, mineral aerosol, organic carbon (OC) and elemental carbon (EC) in ambient PM<sub>10</sub> samples were investigated from 16 urban, rural and remote sites in various regions of China, and were compared with global aerosol measurements. A large difference between urban and rural chemical species was found, normally with 1.5 to 2.5 factors higher in urban than in rural sites. Optically-scattering aerosols, such as sulfate (~16%), OC (~15%), nitrate (~7%), ammonium (~5%) and mineral aerosol (~35%) in most circumstance, are majorities of the total aerosols, indicating a dominant scattering feature of aerosols in China. Of the total OC, ~55%–60% can be attributed to the formation of the secondary organic carbon (SOC). The absorbing aerosol EC only accounts for ~3.5% of the total PM<sub>10</sub>. Seasonally, maximum concentrations of most aerosol species were found in winter while mineral aerosol peaks in spring. In addition to the regular seasonal maximum, secondary peaks were found for sulfate and ammonium in summer and for OC and EC in May and June. This can be considered as a typical seasonal pattern in various aerosol components in China. Aerosol acidity was normally neutral in most of urban areas, but becomes some acidic in rural areas. Based on the surface visibility observations from 681 meteorological stations in China between 1957 and 2005, four major haze areas are identified with similar visibility changes, namely, (1) Hua Bei Plain in N. China, and the Guanzhong Plain; (2) E. China with the main body in the Yangtze River Delta area; (3) S. China with most areas of Guangdong and the Pearl River Delta area; (4) The Si Chuan Basin in S.W. China. The degradation of visibility in these areas is linked with the emission changes and high PM concentrations. Such quantitative chemical characterization of aerosols is essential in assessing their role in atmospheric chemistry and weather-climate effects, and in validating atmospheric models.
Article
Full-text available
We investigate PM2.5 data reliability in five major Chinese cities: Beijing, Shanghai, Guangzhou, Chengdu and Shenyang by cross-validating data from the US diplomatic posts and the nearby Ministry of Environmental Protection sites based on three years' data from January 2013. The investigation focuses on the consistency in air quality assessment derived from the two data sources. It consists of studying (i) the occurrence length and percentage of different PM2.5 concentration ranges; (ii) the air quality assessment for each city; and (iii) the winter-heating effects in Beijing and Shenyang. Our analysis indicates that the two data sources produced highly consistent air quality assessment in the five cities. This is encouraging as it would inject a much needed confidence on the air pollution measurements from China. We also provide air quality assessments on the severity and trends of the fine particulate matter pollution in the five cities. The assessments are produced by statistically constructing the standard monthly meteorological conditions for each city, which are designed to minimize the effects of confounding factors due to yearly variations of some important meteorological variables. Our studies show that Beijing and Chengdu had the worst air quality, while Guangzhou and Shanghai faired the best among the five cities. Most of the five cities had their PM2.5 concentration decreased significantly in the last two years. By linking the air quality with the amount of energy consumed, our study suggests that the geographical configuration is a significant factor in a city's air quality management and economic development.
Article
Motivated by a need to evaluate the effectiveness of a campaign to alleviate the notorious air pollution in China's Beijing-Tianjin-Hebei (BTH) region, we outline a temporal statistical adjustment method which is demonstrated from several aspects on its ability to remove the meteorological confounding existed in the air quality data. The adjustment makes the adjusted average concentration temporally comparable, and hence can be used to evaluate the effectiveness of the emission reduction strategies over time. By applying the method on four major pollutants from 73 air quality monitoring sites along with meteorological data, the adjusted averages indicate a substantial regional reduction from 2013 to 2016 in PM2.5 by 27% and SO2 by 51% benefited from the elimination of high energy consumption and high polluting equipments and a 20.7% decline of the coal consumption, while average NO2 levels had been static with a mere 4.5% decline. Our study also reveals a significant increase in the ground O3 by 11.3%. These suggests that future air quality management plans in BTH have to be based on dual targets of PM2.5 and O3.
Book
This book is based on the author's experience with calculations involving polynomial splines. It presents those parts of the theory which are especially useful in calculations and stresses the representation of splines as linear combinations of B-splines. After two chapters summarizing polynomial approximation, a rigorous discussion of elementary spline theory is given involving linear, cubic and parabolic splines. The computational handling of piecewise polynomial functions (of one variable) of arbitrary order is the subject of chapters VII and VIII, while chapters IX, X, and XI are devoted to B-splines. The distances from splines with fixed and with variable knots is discussed in chapter XII. The remaining five chapters concern specific approximation methods, interpolation, smoothing and least-squares approximation, the solution of an ordinary differential equation by collocation, curve fitting, and surface fitting. The present text version differs from the original in several respects. The book is now typeset (in plain TeX), the Fortran programs now make use of Fortran 77 features. The figures have been redrawn with the aid of Matlab, various errors have been corrected, and many more formal statements have been provided with proofs. Further, all formal statements and equations have been numbered by the same numbering system, to make it easier to find any particular item. A major change has occured in Chapters IX-XI where the B-spline theory is now developed directly from the recurrence relations without recourse to divided differences. This has brought in knot insertion as a powerful tool for providing simple proofs concerning the shape-preserving properties of the B-spline series.
Article
Climatic variables such as annual mean precipitation and temperature display complex and nonlinear variation with latitude, longitude, and elevation. Neural networks are universal approximators and very good at detecting and representing nonlinear relationships between dependent and independent variables. In this paper we use resilient backpropagation (Rprop) neural networks to interpolate annual mean precipitation and temperature surfaces for China. Climate surfaces are interpolated from a total of 288 long-term climate station data points using latitude, longitude, and elevation derived from a 5-kilometer resolution digital elevation model. Initial trials of Rprop suggested very fast learning, insensitivity to selection of learning parameters, and a tendency not to overtrain. Cross-validation was used to determine the best network structure and assess the error inherent in climate interpolation. With the error explicit, the final neurointerpolations Of annual mean precipitation and temperature were constructed using all 288 climate station data points. Maps of residuals are also presented. The neurointerpolation of temperature was very successful and captures most of the regional trends found in established climate maps of China as well as significant topographically defined detail. For annual mean temperature the Rprop neural network was found to be an accurate and robust global spatial interpolator However, the precipitation surface captures only the major latitudinally and continentally defined trends and misses many subregional rainfall features probably because of the influence of other nonparameterized atmospheric and topographic factors.