Content uploaded by Xiaohui Chang

Author content

All content in this area was uploaded by Xiaohui Chang on May 20, 2020

Content may be subject to copyright.

Modeling and Regionalization of China’s PM2.5

Using Spatial-Functional Mixture Models

Decai Liang, Haozhe Zhang, Xiaohui Chang, and Hui Huang

Abstract

Severe air pollution aﬀects billions of people around the world, particularly in de-

veloping countries such as China. Eﬀective emission control policies rely primarily on a

proper assessment of air pollutants and accurate spatial clustering outcomes. Unfortu-

nately, emission patterns are diﬃcult to observe as they are highly confounded by many

meteorological and geographical factors. In this study, we propose a novel approach

for modeling and clustering PM2.5concentrations across China. We model observed

concentrations from monitoring stations as spatially dependent functional data and

assume latent emission processes originate from a functional mixture model with each

component as a spatio-temporal process. Cluster memberships of monitoring stations

are modeled as a Markov random ﬁeld, in which confounding eﬀects are controlled

through energy functions. The superior performance of our approach is demonstrated

using extensive simulation studies. Our method is eﬀective in dividing China and

the Beijing-Tianjin-Hebei region into several regions based on PM2.5concentrations,

suggesting that separate local emission control policies are needed.

Keywords: Latent emission process; Model-based clustering; Markov random ﬁeld; Environ-

mental policies.

Decai Liang is a Ph.D. candidate at the School of Mathematical Science and Center for Statistical

Science, Peking University, Beijing, P.R. China, 100871 (Email: liangdecai@pku.edu.cn). Haozhe Zhang

is a Data & Applied Scientist at Microsoft Corporation, Redmond, WA 98052 (Email: haozhe.

zhang@microsoft.com). Xiaohui Chang is an assistant professor of Business Analytics at the College of

Business, Oregon State University, Corvallis, OR 97331 (Email: xiaohui.chang@oregonstate.edu). Hui

Huang is a professor of Statistics at the School of Mathematics, Sun Yat-sen University, Guangzhou, P.R.

China, 510275 (Email: huangh89@mail.sysu.edu.cn). For correspondence, please contact Hui Huang.

1

1 Introduction

Among all air pollutants, ﬁne particulate matters with aerodynamic diameters less than 2.5

µm, also known as PM2.5, are generally regarded as the most health-damaging because they

easily penetrate the lung barrier and directly enter into the circulatory system. Numerous

studies have shown that chronic exposure to high concentrations of PM2.5contributes to the

risk of developing cardiovascular and respiratory diseases and lung cancers (Pope et al., 2002;

Hoek et al., 2013; Lelieveld et al., 2015). Global Burden of Disease estimated that long-term

exposure to PM2.5caused 4.2 million deaths worldwide in 2015, making it the ﬁfth-ranked

global risk factor that year (Cohen et al., 2017).

Due to the rapid industrialization and urbanization in recent decades, many areas of

China have experienced the most chronic and severe air pollution in the world with the

highest PM2.5levels (van Donkelaar et al., 2010). In the ﬁrst quarter of 2013, extremely

severe smog aﬀected more than 800 million people in China. About 70% of the days in

January registered daily average PM2.5concentrations that exceeded 75 µg/m3in numerous

cities (Huang et al., 2014), more than seven times the World Health Organization’s (WHO)

recommended level of 10 µg/m3. In response to the consistently poor air quality, the Chinese

government directed massive eﬀorts to assess air quality and evaluate the health impacts of

air pollution for the entire country. For instance, real-time high-quality air pollutant con-

centration measurements have been collected from a large national monitoring network since

2013. This dataset quickly became one of the key pillars for the development of environ-

mental policies and emission control strategies (Zhang et al., 2017). Unfortunately, the mea-

surements may not provide an accurate depiction of the true characteristics of air pollutant

emission, as the distribution and transmission patterns of PM2.5are highly confounded by

factors including meteorological conditions, topography, local emissions, secondary aerosols,

and regional transportation (Liang et al., 2015). These uncertainties, along with the large

variability of PM2.5in space, bring challenges for assessing and monitoring PM2.5in China.

Thus, to carry out the “coordinated inter-regional prevention and control eﬀorts” initiated

by the Chinese government (Li, 2015), a comprehensive statistical methodology that exam-

ines underlying emission patterns and incorporates spatio-temporal variations is urgently

needed.

In this work, we propose a novel approach for modeling China’s PM2.5data collected

from the national monitoring network. The observed PM2.5concentration at each station is

2

modeled as functional data (Ramsay and Bernard, 2005). The unobserved true emission is

assumed to be a latent process that employs a functional ﬁnite mixture model to account for

spatial heteroscedasticity. Each component of the mixture model is a spatio-temporal process

with a temporal functional principal component (FPC) expansion and spatially correlated

FPC scores from diﬀerent stations. Under our framework, stations are clustered into various

regions based on their emission patterns. The cluster memberships (weights of components)

follow a Markov random ﬁeld (MRF) model, and topological factors are also exploited to

deﬁne the similarity measures between stations. This approach makes inferences in both

space and time while performing a model-based clustering for PM2.5emission.

In environmental science, cluster analysis has gained considerable attention in recent

years due to its wide applicability to many real-life environmental problems. Clustering is

frequently referred to as “regionalization” in the ﬁeld because the outcomes are speciﬁcations

of locations or regions. Researchers organize environmental units into homogeneous zones

with the goal of establishing local environmental control strategies in diﬀerent regions. Some

applications that regionalization has played a signiﬁcant role are dust storms (Qian et al.,

2004), precipitation (Zhang et al., 2016), and air pollution (Wang et al., 2015). The conven-

tional regionalization methods adopted in the ﬁeld include empirical orthogonal functions

(EOF) and its rotated version (REOF) (Zhang et al., 2012; Wang et al., 2015), which are

basically spatial principal component analysis and the corresponding component rotations,

respectively. Several new regionalization techniques have emerged, such as self-organizing

maps (SOM) (Li et al., 2000) and k-means (Li et al., 2015). There are clear drawbacks to

these approaches. The EOF and its extensions may be useful for initial data exploration but

are unsuitable for investigation and interpretation of data characteristics. The determination

of cluster boundaries using EOF is subjective. Moreover, it is very challenging to handle

multi-scale data like ours through EOF, SOM or k-means. The most serious drawback of

these approaches is that they either completely ignore or pay little attention to the intrinsic

spatio-temporal structures of data, precluding accurate inferences for data with strong space-

or time-varying features.

Cluster analysis has been well studied in the functional data analysis literature for its

practical applications. For instance, James and Sugar (2003) developed a ﬂexible model-

based procedure for sparsely sampled longitudinal data. Chiou and Li (2007) proposed k-

center functional clustering that is a functional version of k-means. Peng and M¨uller (2008)

3

introduced a distance-based method with multi-dimensional scaling. For high-dimensional

functional data, some of the commonly used clustering methods largely rely on penalized

likelihood (Pan and Shen, 2007), high-dimensional data clustering (Bouveyron and Jacques,

2011), an approximation for the density of functional random variables (Jacques and Preda,

2013), or wavelets (Giacofci et al., 2013). Nevertheless, these methods were all designed for

independent curves and are unsuitable for spatially dependent data.

Earlier works on clustering spatial-functional data are relatively scarce. Romano et al.

(2013) considered the spatial dependence among functions based on variogram models. Gi-

raldo et al. (2012) proposed a hierarchical approach based on a dissimilarity matrix among

curves. A recent technique introduced by Jiang and Serban (2012) incorporated an MRF

into the modeling process to characterize spatial correlation and cluster dependence. MRF

originated from the ﬁeld of statistical physics and is a general version of the Ising model (Kin-

dermann and Snell, 1980). In cluster analysis, especially for model-based methods, cluster

memberships of stations are usually assumed to be random and their probabilities are mod-

eled using a multinomial distribution. In this work, we employ the MRF-based approach

to model both the spatial dependence and cluster memberships, and k-nearest neighbors

combined with geographical information are also included for neighborhood deﬁnition.

This work contributes to the literature in the following ﬁve dimensions. First, we in-

troduce a uniﬁed framework for joint modeling and clustering. The spatial dependence

among the latent emission processes is embedded into the functional mixture model while

the cluster memberships are assigned using an MRF model. Second, our method allows for

heteroscedastic spatial dependence structures for diﬀerent clusters, a much more realistic

assumption compared to having the same spatial structure for all clusters in Jiang and Ser-

ban (2012), and greatly enhances the ﬂexibility and applicability of our method. Third, this

procedure has numerous practical advantages over other regionalization methods in real-

life applications, including but not limited to, improved interpretability and clear cluster

boundaries with stations connected within the same cluster, easy adaption to multi-scale

data, possible extension to multi-pollutant regionalization, and more comprehensive statisti-

cal inferences on data features. Fourth, the numerical performance of this method is shown

to be superior compared to others using extensive simulation studies. Last but not least,

we also propose a Monte Carlo EM approach to compute the likelihood in the presence of

multiple latent variables.

4

The structure of this paper is as follows. In Section 2, we introduce two PM2.5datasets

from China and Beijing-Tianjin-Hebei (BTH) region, one of the most populated and polluted

areas in the country, and discuss why regionalization is needed. We describe our method

and the estimation procedures in Sections 3 and 4, respectively. The simulation studies are

presented in Section 5. In Section 6, we demonstrate two examples of the application of the

method using data from China and the BTH region. The paper concludes with a discussion

in Section 7. Technical details of the Monte Carlo EM algorithm are in the Appendices.

Other details including the datasets, R code, and additional results can be found in the

Supplementary Material online.

2 Data Description

We analyze two datasets of diﬀerent spatio-temporal scales: (1) city-level daily PM2.5concen-

tration data for the entire country, and (2) station-level monthly concentration data collected

from the BTH region. Performing regionalization for the entire country is highly challeng-

ing due to the widely diverse landscapes and drastically diﬀerent meteorological conditions

that are critical for modeling air pollutant data. In contrast, the smaller BTH region is

more homogeneous in landscape and meteorological conditions, and this region is adopted

to demonstrate the performance of the proposed methodology.

To smooth data variability and reduce extreme values, we apply a logarithmic transforma-

tion to pollutant measurements. For both datasets, the topographic information (including

longitude, latitude, and elevation) is available. Pairwise distances between locations are cal-

culated using the great circle distance that is deﬁned as the shortest distance between two

points on the surface of a sphere measured along the surface of the sphere (Porcu et al.,

2016).

2.1 China

China’s Ministry of Ecology and Environment has established a large monitoring network

for air quality assessment since 2013. This national network had expanded to more than

1,500 monitoring stations in 338 cities in 2015 and 2016. Real-time measurements of major

pollutants are continuously recorded and directly transferred to the China National Envi-

ronmental Monitoring Center (CNEMC). Pollutant measurements are collected using con-

5

Figure 1: Locations of 338 cities (marked by dots) in China with monitoring stations, and

the time series subplots of PM2.5 daily concentrations (µg/m3) of four megacities (Beijing,

Tianjin, Shanghai, and Guangzhou) from January 2015 to December 2016.

tinuous automated methods through either tapered element oscillating microbalance or Beta

ray attenuation (Wang et al., 2015). All equipment meet the standards of CNEMC.

Our city-level daily data are obtained by averaging hourly PM2.5 concentrations from all

monitoring stations in each city. A total of 731 measurements are available for each of the

338 cities from January 1, 2015, to December 31, 2016. We also remove 25 cities with low

data quality from Xinjiang and Tibet. For the remaining 313 cities, missing data (< 0.6%)

are imputed using linear interpolation.

The locations of all 338 cities are presented in Figure 1. As an illustration, we also

highlight four megacities (i.e., Beijing, Tianjin, Shanghai, and Guangzhou) and display their

average daily observations from January 2015 to December 2016. The PM2.5 time series

of Beijing and Tianjin are similar and highly correlated, especially during winter, partially

due to their close geographical proximity. On cold days, particle pollution is severe in

North China as a result of coal-burning for heating, and the large variation is explained

6

by the frequent and strong northern winds that can blow away air pollutants. The PM2.5

concentrations of Shanghai and Guangzhou are quite stable throughout the year but with

some clear diﬀerences in their mean functions and variations. Refer to Liang et al. (2016) for

a detailed discussion on PM2.5 patterns and weather inﬂuences in these Chinese megacities.

In this paper, we focus on feature extraction and separation of long-term and large-spatial-

scale patterns of PM2.5 across regions.

2.2 Beijing-Tianjin-Hebei

The Beijing-Tianjin-Hebei (BTH) region is one of the most polluted areas in the world,

mostly due to the emission of primary pollutants and weather conditions (Chen et al., 2018).

It is the national capital region of China, consisting of two of the most populated Chinese

cities (Beijing and Tianjin) and eleven cities in Hebei Province. Serious environmental

concerns have been raised when preparing for the 2008 Summer Olympics in Beijing, and

many studies focusing on this area have followed since then (Lin et al., 2008; Wang et al.,

2009; Xu et al., 2011).

There is a clear disparity in the air pollution level in the BTH region. For example,

the northern cities, including Zhangjiakou, Chengde, and Qinhuangdao, are located in a

mountainous area and marginally aﬀected by emissions from factories and plants, whereas

the cities to the south of Beijing are burdened with emissions from steel and cement factories,

coal mines, and coking plants in Hebei Province. Therefore, the regionalization of the BTH

region using air pollutant data would provide signiﬁcant information for both policy makers

and researchers.

The BTH region consists of a total of 73 national monitoring stations; see Figure 2 for

a map of all stations and cities. These stations are a subset of the stations in the national

network described in Section 2.1 and identical to those examined in earlier studies, e.g., Chen

et al. (2018). Due to the high proportions of missing values in the ﬁrst half of 2013, we limit

our analysis to between June 2013 and December 2016. We also use monthly-averaged data

instead of daily measurements to eliminate any local patterns and focus on the long-term

trend. There are 43 months in total.

7

Figure 2: Locations of 13 cities and 73 monitoring stations (marked by dots) in the Beijing-

Tianjin-Hebei (BTH) region. The lines represent the administrative borders of the cities.

3 Model and Assumptions

In this section, we propose a functional mixture model with spatially correlated random ef-

fects to measure the spatio-temporal dependencies of PM2.5concentrations. We also suppose

the random functions of time are deﬁned in a time domain Tand sampled from locations in

a spatial domain D. Let Y(s

s

si, tij ) be the discrete observation at time tij ∈ T on the random

curve of PM2.5concentration at location s

s

si∈ D ⊂ R2, where i= 1, . . . , n, and j= 1, . . . , mi.

The total number of spatial locations is n, and the total number of discrete observations at

location s

s

siis mi.

We denote Zias the cluster membership of the random curve at location s

s

si. In particular,

8

the membership Ziis a random variable following a multinomial distribution with support

{1, . . . K}, and Zi=kif the random curve at s

s

sibelongs to the kth cluster, where Kis the

total number of clusters. An MRF model is employed for the cluster membership to account

for spatial dependence intrinsically embedded in the data. For any location s

s

si, we assume

the marginal probability P(Zi=k) = πk, where πk≥0 for all kand PK

k=1 πk= 1.

3.1 Reduced-Rank Functional Mixture Model for PM2.5Data

Conditional on the cluster membership Zi=kfor k= 1, . . . , K, we assume the following

functional mixture model for PM2.5measurements:

Y(s

s

si, tij )(Zi=k) = µk(tij) + ηk(s

s

si, tij ) + ij ,(1)

where µk(·) is the mean function for the kth cluster, ηk(·,·) is a zero-mean spatio-temporal

process on D × T representing a spatially correlated functional random eﬀect for the kth

cluster, and ij ’s are iid measurement errors with E (ij ) = 0 and var (ij) = σ2

. We consider

µk(·)+ηk(s

s

si,·) as the latent and smooth random function of PM2.5concentrations at location

s

s

siafter removing the measurement errors. This framework allows for spatially correlated

random eﬀects, making it more general than the identical random eﬀects for all clusters

adopted in James and Sugar (2003). Also, we consider ηk(s

s

s, t) as a spatial extension of a

temporal process with a standard Karhunen-Lo`eve expansion:

ηk(s

s

s, t) =

∞

X

q=1

γq,k (s

s

s)ψq,k (t),(2)

where ψq,k (·)’s are orthonormal eigenfunctions known as the functional principal components

(FPC), and the functional principal component score γq,k (s

s

s) = Rτηk(s

s

s, t)ψq,k (t)dt is the

loading of ηk(s

s

s, t) on the qth principal component. We assume γq,k (s

s

s)’s are zero-mean and

second-order stationary random ﬁelds that are independent across qand k.

Spatial structure of the function data is modeled through the spatial covariance between

the FPC scores. More speciﬁcally,

cov{γq,k(s

s

s1), γq0,k0(s

s

s2)}=

σ2

γ,q,kρ(s

s

s1−s

s

s2;φq,k ),if q=q0and k=k0,

0,otherwise,

where σ2

γ,1,k ≥σ2

γ,2,k ≥. . . > 0, and ρ(·) is a spatial correlation function that depends on

9

the distance measure between two sites and some parameter vector φq,k . A wide range of

spatial correlation functions can be employed to model the spatial dependence structure. For

example, when the most popular Mat´ern function in geostatistics is selected, φq ,k includes a

range parameter and a smoothness parameter (Mat´ern, 1960). It is worth mentioning that

the parameters σ2

γ,q,k’s, φq,k ’s, and principal components ψq,k(·)’s can vary across qand k,

thus this model allows for a heteroscedastic random eﬀect structure, which is a much more

realistic assumption for many real-life applications compared to a homoscedastic random

eﬀect structure.

In practice, the equation in (2) is often approximated by truncating the series using

the ﬁrst Qleading functional principal components. Then, the reduced-rank version of the

functional mixture model can be written as:

Y(s

s

si, tij )(Zi=k) = µk(tij) +

Q

X

q=1

γq,k (s

s

si)ψq,k (tij ) + ij ,(3)

where Qis considered as a tuning parameter of interest (Li et al., 2013). A data-driven

method for selecting Qis discussed in more detail in Section 4.3.

3.2 Markov Random Field Model for Cluster Membership

Cluster membership is critical for specifying the joint distribution of the complete data in

our framework. Following Jiang and Serban (2012) with some modiﬁcations, we assume the

clustering conﬁguration follows a locally dependent MRF model. The Markov property in

space implies that the state space (namely the cluster membership) of any given location,

s

s

si, would depend on the states of its neighboring locations, denoted by ∂i. This assumption

is reasonable because air pollutant data are usually locally dependent, and the degree of

dependence is highly inﬂuenced by geographical factors, such as the elevation of mountains.

To account for the spatial dependence in the cluster membership, we model the probabil-

ity mass function of one cluster membership, conditioning on its neighbors, as the Gibbs

distribution:

P(Zi=k|Z∂i ) = exp{Uik(ν)}

Ni(ν),(4)

where Z∂i is a vector of cluster memberships of the neighbor locations of s

s

si,Uik(ν) =

νPi0∈∂i I(Zi0=k) where ν≥0 is known as the energy function, I(·) is an indicator func-

tion, and Ni(ν) = PK

k=1 exp{Uik(ν)}is a normalizing constant. The energy function Uik (ν)

10

determines the spatial pattern of the entire region. A large value of Uik(ν) corresponds to a

spatial pattern where many spatially connected locations belong to the same cluster, whereas

a small Uik(ν) implies a weak spatial dependence in the cluster membership. The parameter

νreﬂects the degree of interaction among the nearby sites in the MRF: a large νrepresents

a highly spatially dependent cluster membership, and ν= 0 when there is no spatial depen-

dence at all with an equal chance of belonging to any cluster. This idea of using the Gibbs

distribution to model dependence structure originates from statistical physics (Kindermann

and Snell, 1980) and has been frequently used in spatial statistics (Cliﬀord, 1990).

4 Estimation and Implementation

4.1 Spline Approximation

We use polynomial splines to approximate and estimate the mean functions and eigenfunc-

tions deﬁned in Section 3. For simplicity, we assume the time domain is T= [0,1]. Let

B(t) = {b1(t), . . . , bp(t)}Tbe a spline basis with dimension p(de Boor, 2001). For simplic-

ity, we use equally spaced knots in T. We can approximately write µk(t) = BT(t)αkand

[ψ1,k(t), . . . , ψQ,k (t)] = BT(t)Θk, where αkand Θkare p×1 and p×Qmatrix of spline coeﬃ-

cients, respectively. According to Zhou et al. (2008) and Zhou et al. (2010), we impose iden-

tiﬁability restrictions on B(t) and Θksuch that: RB(t)BT(t)dt =Ip×pand ΘT

kΘk=IQ×Q,

and further require the ﬁrst nonzero element of each column of Θkto be positive. See

Appendix 1 of Zhou et al. (2008) on how to construct a spline basis that satisﬁes the or-

thonormal constraints described above. For each station i, we use γi= [γi1, . . . , γiQ]Tto

represent the spatial random eﬀect. Conditional on {Zi=k, γi}, the reduced-rank model

(3) takes the form:

Y(s

s

si, tij )|(Zi=k, γi) = BT(tij )αk+BT(tij )Θkγi+ij .(5)

A discussion on the selection of the spline basis is given in Section 4.3.

4.2 Monte Carlo EM Algorithm

As the standard approach for model-based functional clustering (e.g., James and Sugar,

2003), we use a likelihood-based procedure and treat both the memberships Zand the spatial

11

random eﬀects γas latent variables. To overcome the computational challenges associated

with the joint distribution of (Z,γ), we use the Monte Carlo EM (MCEM) algorithm based

on Gibbs sampling for parameter estimation (Wei and Tanner, 1990).

With the spline approximation, the collection of parameters that need to be estimated

becomes Ω={σ2

, ν, Ωk;k= 1, . . . , K}, where Ωk={αk,Θk,φq,k , σ2

γ,q,k}. We represent the

model in a hierarchical matrix form:

Y|Z,γ=e

B

B

BTe

αZ+e

B

B

BTe

ΘZγ+,

γ|Z∼Normal 0,e

ΓZ,

Z∼Markov random ﬁelds (ν).

Here, we use Yi={Y(s

s

si, ti1), . . . , Y (s

s

si, timi)}Tto denote the vector of all midiscrete observa-

tions at location s

s

siand then deﬁne Y=YT

1,...,YT

nT. Let e

B

B

BTbe an en×np block diagonal

matrix of spline basis functions B=BT(t), where en=Pn

i=1 miis the total sample size.

For the cluster memberships, we write Z= (Z1,...,Zn)Tas a vector of all cluster member-

ships with Zitaking values from 1 to K. This means one realization of Zis (k1, . . . , kn)T,

where ki∈ {1,··· , K }for all i. We deﬁne the spline coeﬃcients e

αZ=αT

Z1,...,αT

ZnT

and e

ΘZ= diag (ΘZ1,...,ΘZn). The FPC scores are γ=γT

1,...,γT

nT, and errors are

=T

1,...,T

nT, where i= (i1, . . . , imi)T. The conditional covariance matrix of γgiven

the cluster membership Zis e

ΓZ, which depends on the parameters φq,k and σ2

γ,q,k. More

computational details on γand e

ΓZare given in Appendix A.

Based on f(Y,Z,γ) = f(Y|Z,γ)f(γ|Z)f(Z) and the assumptions made in Section

3.1, we write out the log-likelihood for the complete data:

`(Ω;Y,Z,γ) = log nfY|Z,γ;e

αZ,e

ΘZ, σ2

o+ log nfγ|Z;e

ΓZo+ log {f(Z;ν)}.(6)

We adopt the pseudo-likelihood proposed in Besag (1975) to approximate the last term

in (6), i.e., log f(Z;ν)≈Pn

i=1 log {f(Zi|Z∂i ;ν)}. Given Z= (Z1,...,Zn)Tand γ=

γT

1,...,γT

nT,Y0

isare conditionally independent from a mi-dim multivariate normal distri-

bution with mean BT(αZi+ΘZiγi) and variance σ2

. As a result, the complete log-likelihood

in (6) can be written as:

`(Ω;Y,Z,γ)≈log nfY|Z,γ;e

αZ,e

ΘZ, σ2

o+ log nfγ|Z;e

ΓZo+

n

X

i=1

log {f(Zi|Z∂i ;ν)}

12

∝ −1

2

n

X

i=1

K

X

k=1

I(Zi=k)nmilog σ2

+

Yi−BT(αZi+ΘZiγi)

2/σ2

o

−1

2I(Z1=k1,...,Zn=kn)log e

ΓZ+γTe

Γ−1

Zγ

+

n

X

i=1

K

X

k=1

I(Zi=k) [Uik(ν)−log {Ni(ν)}].(7)

The complete likelihood depends on latent random variables Z’s and γ’s, so it cannot be

maximized directly. Instead, we treat these latent variables as missing data and estimate

the unknown parameters using the EM algorithm by iterating between the E-step and the

M-step until convergence. To determine the cluster memberships, we assign location s

s

sito

the cluster kthat maximizes that location’s conditional probability πk|i=P(Zi=k|Yi).

4.2.1 E-Step

In the E-step, the expected log-likelihood is ﬁrst calculated:

Q(Ω|Ωprev) = E `(Ω;Y,Z,γ|Y,Ωprev),(8)

where Ωprev represents the value of the parameters from the previous EM iteration.

To calculate the expectation on the right side of (8), we would need the exact joint

conditional distribution f(Z,γ|Y,Ωprev). But because Zand γare not conditionally in-

dependent, this joint distribution f(Z,γ|Y,Ωprev) cannot be directly approximated using

the product of f(Z|Y,Ωprev) and f(γ|Y,Ωprev). This is diﬀerent from James and Sugar

(2003). Instead, we use the MCEM of Wei and Tanner (1990) that approximates the con-

ditional expectation using a Monte Carlo approximation and was shown to converge to the

maximum likelihood estimate under some general regularity conditions (Chan and Ledolter,

1995). More speciﬁcally, we use Gibbs sampling (Geman and Geman, 1984) based on the

full conditional distributions f(γ|Y,Z,Ωprev) and f(Z|Y,γ,Ωprev) to simulate the joint

conditional distribution. Then Q(·) in (8) can be estimated using the Monte Carlo average:

b

Q(Ω|Ωprev) = 1

T

T

X

τ=1

`Ω;Y,Z(τ),γ(τ),(9)

where T is the size of the Monte Carlo samples, and Z(τ)and γ(τ)are samples from the

conditional distribution (Z,γ|Y,Ωprev) using Gibbs sampling. More details on the E-step

13

are given in Appendix B.

4.2.2 M-Step

In the M-step, we update the parameter values to the values maximizing the approximated

conditional expectation in (9). According to (7), we obtain:

b

Q(Ω|Ωprev) = −1

2T

T

X

τ=1

n

X

i=1

K

X

k=1

IZ(τ)

i=kmilog σ2

+

Yi−BTαk+Θkγ(τ)

i

2/σ2

−1

2T

T

X

τ=1

IZ(τ)

1=k1,...,Z(τ)

n=knlog e

ΓZ(τ)+γ(τ)Te

Γ−1

Z(τ)γ(τ)

+1

T

T

X

τ=1

n

X

i=1

K

X

k=1

IZ(τ)

i=knU(τ)

ik (ν)−log N(τ)

i(ν)o

=b

Q1Ω|Ωprev+b

Q2(Ω|Ωprev) + b

Q3(Ω|Ωprev).(10)

Because b

Q1,b

Q2, and b

Q3depend on mutually disjoint collections of parameters in Ω, we can

maximize them separately. To be more speciﬁc, (σ2

,αk,Θk) are updated by maximizing b

Q1,

(φq,k ,σ2

γ,q,k) by b

Q2, and νby b

Q3, respectively. The detailed M-step algorithm is provided in

Appendix C.

4.3 Tuning Parameter Selection

There are three key tuning parameters in the model: number of clusters (K), number of

FPCs (Q), and dimension of the spline basis (p). We develop a data-driven method to select

them.

Bayesian Information Criterion (BIC) is one of the most popular methods for model

selection. It adds a penalty term for the dimension of the parameter space to the log-

likelihood function. Under our modeling framework, exact likelihood calculations using the

standard EM algorithm can be quite challenging due to the presence of latent variables

(Zand γ). For a candidate model M, we propose to approximate its log-likelihood by the

Monte Carlo average b

QMdeﬁned in (9), which is computed using Gibbs sampling in the ﬁnal

EM iteration. Note that this approximate likelihood coincides with the integrated likelihood

introduced in Fraley and Raftery (2002), f(Y|Ω) = Rf(Y|Z,γ,Ω)fγ,Z|e

ΓZ, νdγdZ.

Through numerical approximation and Monte Carlo samples, Fraley and Raftery (2002)

14

approximates the integrated likelihood by the BIC. Similarly, we deﬁne a Monte Carlo BIC

for the model M:

Monte Carlo BIC(M) = −2b

QM+cM·log(en),(11)

where cM is the number of parameters in M. The tuning parameters are then selected

simultaneously by minimizing (11).

Among all tuning parameters, the number of clusters (K) is the most critical and also

the most challenging parameter to estimate. In real-life applications of this method, the

BIC score gradually decreases with increasing K (James and Sugar, 2003; Zhou et al., 2010)

and achieves its minimum when K > 30. However, such a large number of regions is not

supported by any scientiﬁc evidence, and it becomes impractical to establish and implement

a separate policy for each region. As a result, we use the expert knowledge of environmental

scientists and do not consider K as a tuning parameter in our regionalization studies. Other

implementation issues to expedite the model selection are addressed in the Supplementary

Material S.1.

5 Simulation Studies

We carry out two simulation studies to compare the performance of the proposed spatial-

functional clustering methodology against other methods in the literature. The ﬁrst sim-

ulation study focuses on the homoscedastic case where diﬀerent clusters share the same

covariance structure, while the second study considers the heteroscedastic case where diﬀer-

ent clusters have diﬀerent covariance structures. For each study, we ﬁrst simulate a synthetic

dataset, carry out several clustering methods, and then evaluate their performances. This

procedure is repeated 100 times.

Two metrics are adopted to quantify the accuracy of assigning cluster memberships to

curves. The ﬁrst is the adjusted Rand index (ARI) (Hubert and Arabie, 1985), which is

an improved version of the Rand index (Rand, 1971) with expected value 0 and bounded

by ±1. It measures the similarity between the true cluster membership and the clustering

result obtained from a clustering method. A larger value of the adjusted Rand index implies a

more accurate clustering method. The second metric is the standardized Root Mean Squared

Error (RMSE), deﬁned as RMSE = q||µk(t)−b

µk(t)||2

||µk(t)||2, measuring the accuracy of the estimated

mean pattern b

µkagainst the truth µk. We also evaluate the accuracy of the parameters and

15

functional principal components estimated from the model.

5.1 Homoscedastic Case

We simulate n= 156 points with coordinates s1,s2, . . . , s156 ∈R2over a rectangular region

(107◦E∼125◦E, 28◦N∼43◦N) in North China. The cluster memberships are simulated

by generating a Markov random ﬁeld using Gibbs sampling with ν= 0.5 , K= 2, and

a multinomial distribution. The neighbors Z∂ i are chosen using 5 nearest neighbors. The

synthetic data are generated from the following functional model:

Yi(t)|(γi,Zi=k) = µk(t) +

Q

X

q=1

γi,qψq(t) + i(t),(12)

for i∈ {1, . . . , n}, t ∈ { 1

30 ,2

30 , . . . , 29

30 ,1},Q= 2, and k∈ {1,2}. Following the simulation

setup of the mean functions in Jiang and Serban (2012), we let the two cluster-dependent

mean functions be µ1(t) = 1

2exp(t) cos(t) and µ2(t) = cos 5π

2t. The two functional principal

component functions are orthogonalized by ψ1(t) = √2 sin(2πt) and ψ2(t) = b3(t), where

b3(t) is the third basis function of the 4-dimensional cubic spline basis B(t) deﬁned in Section

4.1. The FPC scores, γq’s, are generated with the isotropic exponential covariance structure

cov(γi,q, γi0,q ) = σ2

γ,q exp(−||si−s0

i||/φ) with φ= 1 and (σ2

γ,1, σ2

γ,2) = (7,2). The error term

iis a white-noise process with variance σ2

= 0.4. For each simulated dataset, we use the

proposed Monte Carlo BIC to determine K, and the true K= 2 is correctly selected for 79%

of the time. We closely monitor the convergence of the algorithm and provide additional

trace plots for the MCEM iterations in the online Supplementary Material S.2.

We compare our proposed spatial-functional mixture model under a Markov random ﬁeld

(SFMM-MRF) with the following clustering methods:

(a) k-means clustering,

(b) James’ method (James and Sugar, 2003), a classical functional mixture model assum-

ing the same covariance structure for the random eﬀects across all clusters,

(c) Jiang’s method (Jiang and Serban, 2012), spatial clustering method assuming a locally

dependent Markov random ﬁeld model for memberships,

(d) a functional mixture model assuming independence in both the random eﬀects and

cluster memberships (FMM),

16

Table 1: Means and standard deviations (in parentheses) of the adjusted Rand index (larger

is better) and the RMSE (smaller is better) using diﬀerent clustering methods, based on 100

simulations for the homoscedastic case.

k-means Jiang James FMM FMM-MRF SFMM SFMM-MRF

ARI 0.600 0.453 0.673 0.889 0.858 0.903 0.909

(0.307) (0.304) (0.378) (0.299) (0.325) (0.282) (0.278)

RMSE 0.639 0.959 0.905 0.418 0.433 0.393 0.392

(0.324) (0.254) (0.380) (0.323) (0.325) (0.265) (0.268)

Table 2: Means and standard deviations of parameter estimates of 100 simulations using the

proposed method (SFMM-MRF) for the homoscedastic case. The ﬁrst row shows parameters

and their true values.

Parameter φ= 1 ν= 0.5σ2

γ,1= 7 σ2

γ,2= 2

Mean 0.893 0.442 6.443 2.385

Standard Deviation 0.150 0.075 1.142 1.468

(e) a functional mixture model assuming independence in the random eﬀects and Markov

random ﬁelds for the cluster memberships (FMM-MRF), and

(f) a functional mixture model with spatially dependent random eﬀects but independent

cluster memberships (SFMM).

It is worth noting that the last three methods, namely FMM, FMM-MRF, and SFMM,

are special cases of the proposed SFMM-MRF. The FMM approach can also be seen as an

extension of James’ method by using functional principal component analysis.

Table 1 summarizes the means and standard deviation of the adjusted Rand index and

RMSE of all clustering methods based on 100 simulations. Our proposed model, SFMM-

MRF, has the largest adjusted Rand index and the smallest RMSE, which shows that the

proposed method outperforms the others. Table 2 demonstrates that our parameter estima-

tion outlined in Section 4 performs reasonably well.

5.2 Heteroscedastic Case

With the same setup for spatial locations and cluster memberships, we generate another set

of synthetic data from a heteroscedastic functional model:

Yi(t)|(Zi=k) = µk(t) +

Q

X

q=1

γi,q,k ψq,k (t) + i(t),(13)

17

where σ2

= 0.4, µ1(t) = 1

2exp(t) cos(t), and µ2(t) = cos(5π

2t), same as in Section 5.1. The

heteroscedastic model is diﬀerent from the homoscedastic model in that the random eﬀects

and FPCs of the former depend on the cluster membership, whereas those of the latter do

not. To include the heteroscedasticity with reasonable complexity, we let ψ1,k=1(t) = b2(t)

and ψ2,k=1(t) = b3(t) for Cluster 1, and ψ1,k=2(t) = b4(t) and ψ2,k=2(t) = b1(t) for Cluster

2. Here, {b1(t), b2(t), b3(t), b4(t)}Tforms a 4-dimensional cubic spline basis as deﬁned in

Section 4.1. The spatial covariance functions of the FPC scores are cov(γi,q,k , γi0,q0,k ) =

σ2

γ,q,k exp(−||si−s0

i||/φ) where φ= 1, (σ2

γ,1,1, σ2

γ,2,1) = (4,1), and (σ2

γ,1,2, σ2

γ,2,2) = (2,0.5).

For the heteroscedastic case, we compare the heteroscedastic spatial-functional clustering

under a Markov random ﬁeld (HSFMM-MRF) with others. One special case of HSFMM-

MRF is the heteroscedastic functional spatial clustering with spatial dependence in the

random eﬀects and independence of cluster memberships (HSFMM). For each of the 100

simulations, we compare ﬁve of the seven clustering methods described in the previous sub-

section and substitute the remaining two methods using their corresponding heteroscedastic

counterparts. The adjusted Rand index and the RMSE are summarized in Table 3. On

average, our proposed method HSFMM-MRF produces the largest adjusted Rand index and

the lowest RMSE compared to other methods, demonstrating its superiority over the rest.

Moreover, the last three columns in Table 3 are quite similar in values, indicating that the

framework of spatial-functional mixture model is generally robust to theassumptions of spatial

structures. In other words, even when the spatial structure of data is misspeciﬁed, we can still

obtain relatively good parameter estimates and clustering results. In real data analy-sis, this

ﬂexibility allows us to choose diﬀerent methods for diﬀerent purposes - either the more

complicated one for feature speciﬁcation or the simpler one for computational advan-tages.

Table 4 consists of a summary of the parameter estimates obtained from the proposed

method. Again, they perform reasonably well. We also display both the true and the

estimated mean functions and functional principal components in Figure 3, which shows the

estimated functions are very close to the true ones.

18

µ1(t)

0.00

0.25

0.50

0.75

1.00

Estimation of Mean Function 1

µ2(t)

0

1

2

Estimation of Mean Function 2

ψ1,k=1(t)

−0.50

−0.25

0.00

0.25

Estimation of FPC1,1

ψ2,k=1(t)

−0.2

0.0

0.2

0.4

Estimation of FPC1,2

ψ1,k=2(t)

−0.8

−0.4

0.0

0.00 0.25 0.50 0.75 1.00

Time

Estimation of FPC2,1

ψ2,k=2(t)

−0.50

−0.25

0.00

0.25

0.00 0.25 0.50 0.75 1.00

Time

Estimation of FPC2,2

Figure 3: True (solid lines) and estimated (dashed lines) functions of two mean functions

(top panels) and four functional principal components (middle and bottom panels) for the

heteroscedastic case. The shaded areas are bound by the 5th and 95th percentiles of the

estimated functions.

6 Data Analysis

6.1 China

For the city-level daily spatio-temporal PM2.5 data of 313 cities from 2015 to 2016 (refer to

Section 2.1), we apply the proposed method, more speciﬁcally the SFMM-MRF, to carry out

a regionalization analysis. To model the mean functions and eigenfunctions, we use cubic

19

Table 3: Means and standard deviations (in parentheses) of the adjusted Rand index and the

RMSE using diﬀerent clustering methods based on 100 simulations for the heteroscedastic

case.

k-means James Jiang FMM FMM-MRF HSFMM HSFMM-MRF

ARI 0.824 0.824 0.853 0.896 0.917 0.931 0.933

(0.103) (0.105) (0.152) (0.122) (0.100) (0.054) (0.056)

RMSE 0.321 0.629 0.630 0.303 0.298 0.283 0.284

(0.115) (0.097) (0.114) (0.116) (0.108) (0.098) (0.097)

Table 4: Means and standard deviations of parameter estimates of 100 simulations using the

proposed method (HSFMM-MRF) for the heteroscedastic case.

Parameter φ= 1 ν= 0.5σ2

γ,1,1= 4 σ2

γ,2,1= 1 σ2

γ,1,2= 2 σ2

γ,2,2= 0.5

Mean 0.729 0.437 3.440 0.875 1.844 0.509

Standard Deviation 0.267 0.071 0.774 0.244 0.468 0.170

●●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

North China Plain Northeast China Plain Guanzhong Plain

Middle Yangtze River Plain Jianghuai Plain Northwest

Sichuan Basin Pearl River Delta Yungui Plateau

Figure 4: Regionalization map of China’s PM2.5using daily average PM2.5concentrations of

313 cities from January 2015 to December 2016. Each colored symbol represents a diﬀerent

cluster. The total number of clusters is 9.

20

B-spline with 16 equally spaced interior knots. To model the cluster membership, for any

given city, we consider all cities within a 500 km radius as its neighbors since the correlation

of air pollution patterns at sites more than 500 km apart is generally weak (Gao et al., 2011).

China has a topographically diverse landscape, including the highest mountains and the

largest plateau on Earth, which aﬀects the movement of many air pollutants from one place to

another (Bryan and Adams, 2002). Two adjacent monitoring sites separated by a high

mountain may have distinctive pollution patterns and thus belong to two clusters. This

“mountain eﬀect” coupled with other geographical factors can be easily incorporated into

our model by modifying the standard Euclidean distance with a spatial deformation to obtain

a “geographical distance.” For example, the distance between two sites that are separated

by a high mountain can be set to be much greater than their Euclidean distance while

other geometric properties of the Euclidean distance are also retained. The change in the

deﬁnition of the distance metric may lead to an alteration of neighbors. This method has

been implemented in earlier works, including Sampson and Guttorp (1992), and Anderes

and Stein (2008), among others.

Alternatively, in our study, we extend the energy function Uik(ν) deﬁned in Section

3.2 by introducing a function g(·, ·) to model the geographical covariates between a site

and its neighbors. More speciﬁcally, we deﬁne Ueik(ν) = ν Pi0∈∂i g(sssi, sssi0 )I(Zi0 = k) and

e

Ni(ν) = PK

k=1 exp{e

Uik(ν)}. The function g(s

s

si,s

s

si0) captures the geographical information

between site s

s

siand its neighboring site s

s

si0. For instance, consider a simple case where we

set das the altitude threshold between s

s

siand s

s

si0. If the largest altitude between s

s

siand s

s

si0

is greater than dimplying the presence of an extremely high mountain between them, then

s

s

siand s

s

si0should not be in the same cluster. For this scenario, g(·,·) can be written as

g(s

s

si,s

s

si0) =

0,if the largest altitude between s

s

siand s

s

si0is greater than d,

1,otherwise.

Then, the conditional probability for the cluster membership becomes

P(Zi=k|Z∂i ) =

exp ne

Uik(ν)o

e

Ni(ν)=exp νPi0∈∂i g(s

s

si,s

s

si0)I(Zi0=k)

PK

k=1 exp νPi0∈∂i g(s

s

si,s

s

si0)I(Zi0=k).

In our analysis, dis set to be 1 km.

The formation of PM2.5is very complex with many important contributing factors in-

21

cluding meteorological conditions, population, local industry, traﬃc, instantaneous energy

consumption, secondary chemical reactions in the atmosphere, and others. PM2.5comprises

a list of primary and secondary components that can also contribute to the study of emis-

sion sources and patterns, but unfortunately their concentrations are not collected in the

monitoring network. In this study, the only accessible data are the PM2.5concentrations

obtained from the monitoring stations, and they provide partial information about local

emission characteristics. We try to cluster cities based on the spatial-temporal trends ob-

served from the PM2.5concentrations so that more eﬀective regional policies and strategies

than the current practices may be established and implemented.

Figure 4 displays the clustering results when the number of clusters (K) is set to 9. This

choice of Kfollows the recommendation of environmental scientists (Wang et al., 2015). We

observe a clear spatial clustering with several distinct geographical regions. For example,

North China Plain, Yangtze River Delta, Pearl River Delta, and Sichuan Basin are all

classiﬁed into separate clusters. These regions coincide with the list published by CNEMC

where air pollution is severe, and prevention and control strategies are needed (China’s State

Council, 2013). Another study done by Wang et al. (2015) also reports similar regions, but

our method deﬁnes regions with clearer boundaries. In addition, our method successfully

combines sites that are geographically far apart yet showing similar PM2.5patterns into one

cluster. For example, the Northwest cluster includes cities across ﬁve provinces: northern

Shanxi, middle Inner Mongolia, Ningxia, northern Gansu, and eastern Qinghai. These cities

are mostly resource-based, and their winter weather conditions are largely inﬂuenced by

the northwest monsoons. We also include the clustering results from other methods in the

Supplementary Material S.4 online for comparison. These methods appear to have much

less clear boundaries in their regionalization maps. This “clear spatial boundary” eﬀect is

also one of the merits of the proposed method and can be mainly attributed to the Markov

random ﬁeld employed in this approach.

The estimated mean functions of all clusters are displayed in Figure 5. Despite the tem-

poral trends varying across regions, there is a consistent “W” shape in almost all regions.

PM2.5concentrations are generally higher in winter than in summer, mostly from coal burn-

ing in many parts of China. This phenomenon becomes less obvious in Pearl River Delta

and Yungui Plateau of southern China where winters are relatively warm. Moreover, the

estimated four leading functional principal components are shown in Figure 6. A spike in

22

Figure 5: Estimated mean functions of nine clusters using China’s daily-averaged PM2.5

concentrations of 313 cities from 2015 to 2016. Same legend as in Figure 4. The observed

PM2.5concentrations are marked in grey.

the ﬁrst component corresponds to an increased PM2.5level in the winter of 2016 across

all regions that is not completely captured by the mean functions. The remaining compo-

nents also show some seasonality providing further evidence that PM2.5pollution is milder

in summer and worse in winter.

23

3rd principal component: 6.4%

4th principal component: 5.9%

1st principal component: 57.3%

2nd principal component: 8.5%

2015/03 2015/07 2015/11 2016/03 2016/07 2016/11 2015/03 2015/07 2015/11 2016/03 2016/07 2016/11

−0.25

0.00

0.25

−0.25

0.00

0.25

Time (Day)

Estimated Principal Components

Figure 6: Estimated four leading functional principal components for China’s daily PM2.5

data. The ﬁrst four eigenvalues explain a total of 78.1% of variation (5: 82.1%; 6: 85%; 7:

87.7%; 8: 90.0%; 9: 92.2%; 10: 93.8%).

6.2 Beijing-Tianjin-Hebei

We apply our method to analyze the monthly-averaged PM2.5 concentrations of the 73 sta-

tions in the BTH region from June 2013 to December 2016; refer to Section 2.2 for more details

on the data. A cubic spline with eight equally spaced interior knots is considered. Due to the

small area and ﬂat terrain of the BTH region, we use the five nearest neighbors to model the

Markov random ﬁelds.

We divide all stations into three regions and present the results in Figure 7. The number

of clusters K = 3 follows another study of the BTH region (Chen et al., 2018). Our results

show that stations with the lowest PM2.5 are clustered in the north – these are the stations

from Zhangjiakou, Chengde, and Qinhuangdao. The most severely polluted stations are in

the southern BTH area, including Baoding, Shijiazhuang, Hengshui, Xingtai, and Handan

that are frequently on the list of most polluted cities in China. Stations with moderate

pollution, including those from Beijing, Langfang, Tianjin, Tangshan, and Cangzhou, are

clustered together. These three regions are in agreement with Chen et al. (2018). The

northern mountainous stations are separated from the rest, demonstrating the signiﬁcance

of the “mountain eﬀect.” The estimated mean functions of the three regions are plotted in

Figure 8 where the most polluted southern region has the worst pollution in winters. This is

24

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

37

38

39

40

41

114 116 118 120

●●●

North Middle South

Figure 7: Regionalization map of the BTH region using monthly-averaged PM2.5concen-

trations of 73 stations from June 2013 to December 2016. Each color represents a diﬀerent

cluster.

in accord with many other researchers’ ﬁndings. Because the three mean functions are well

separated from each other, we recommend separate pollution control strategies in diﬀerent

regions. All regions show a slowly descending trend over time indicating the positive eﬀects

of China’s pollution-control eﬀorts in recent years.

6.3 Clustering Results for Demeaned Data

It is worth noting that functional clustering methods do not only group observations based

on the scales of the data (e.g., the three mean functions in Figure 8 have distinct scales), but

more importantly, they cluster time-dependent data according to the shapes of the temporal

patterns. Following the recommendation of one reviewer, we also demean the data to bring

25

3.5

4.0

4.5

5.0

2013/09 2014/03 2014/09 2015/03 2015/09 2016/03 2016/09

Time

Estimated Mean Functions

North Middle South

Figure 8: Estimated mean functions of three clusters using monthly-averaged PM2.5 concen-

trations of 73 stations from the BTH region between June 2013 and December 2016. Each

color represents a diﬀerent cluster.

all stations to the same scale and then apply the proposed methodology to identify clusters

of stations according to their temporal patterns. Both the global mean function µ0(t) and

the regional mean functions µk(t) representing the regional deviations from the global mean

function are also estimated.

Figure 9 displays the results using China’s demeaned data. The estimated global mean

function shows a clear “W” shape, similar to that in Figure 5, while the regional mean

functions demonstrate signiﬁcant diﬀerences in their scales and patterns despite some ﬂuc-

tuations. For example, the scales of North China Plain, Sichuan Basin, and Guanzhong Plain

are all higher than the average, yet they are clustered into separate regions because their

temporal patterns show distinct trends, with peaks and troughs at diﬀerent times points.

The global and regional mean functions of BTH data are presented in Figure 10. There is a

clear seasonal pattern in the global mean function: the PM2.5 measurements are generally

higher in winter than in summer. The overall trend decreases gradually over time suggesting

air pollution control in the BTH region has been more eﬀective in 2016 than a few years ago.

The stations in the South cluster have the worst pollution in the BTH region and show a

persistently positive deviation from the global mean. The regional mean function of the

Middle cluster is mostly 0 while that of the North cluster is always negative.

26

Figure 9: Top Panels: Estimated global mean function ˆµ0(t) using China’s PM2.5concen-

trations of 313 cities from 2015 to 2016. Bottom Panels: Estimated regional mean functions

of 9 clusters ˆµk(t) for k= 1,2, . . . , 9 after removing the estimated global mean. The dashed

lines represent the zero lines.

27

Figure 10: Top Panel: Estimated global mean function ˆµ0(t) using the PM2.5data from the

BTH region. Bottom Panel: Estimated regional mean functions of three clusters ˆµk(t) for

k= 1,2,3 after removing the estimated global mean.

7 Concluding Remarks

In this study, we propose a novel approach to jointly model and cluster spatial dependent

functional data with applications to PM2.5concentrations collected from China and the

BTH region. Our model allows data from diﬀerent clusters to have diﬀerent mean functions

and covariance structures, and is able to incorporate spatial dependence through the FPC

scores. Markov random ﬁelds are assumed for the cluster memberships to deﬁne spatial

boundaries between regions. Our model respects the spatio-temporal characteristics of the

data. It serves as a tool not only for data clustering but also for uncertainty quantiﬁcation

and results interpretation. We use a spline basis system and a data-driven FPC analysis

approach to strike a balance between model complexity and ﬂexibility. An eﬃcient MCEM

algorithm is used to estimate model parameters, mean functions, and eigenfunctions. The

extensive simulation studies show that the proposed method is superior to other methods in

28

terms of cluster membership prediction and model parameter estimation.

In the analysis of the PM2.5data, our regionalization results not only are in accordance

with the ﬁndings in the literature (Wang et al., 2015; Chen et al., 2018) but also show much

more clear region boundaries that would be helpful for policy making. In addition, the es-

timated mean functions and FPC functions present distinct and interpretable time-varying

patterns, reveal important underlying emission features, and would be useful for pollution

prevention and control. As a result, for more eﬃcient control strategies, we recommend

identifying and implementing separate interventions (e.g., adopting diﬀerent control mea-

sures and pollution reduction strategies) in the nine regions of China and three regions of

the BTH. Although the focus of our data analysis is on air pollution data, our clustering

method can be easily expanded to other environmental science or meteorological datasets

with a similar structure.

As pointed out by one reviewer, PM2.5is a complex mixture with many constituents.

Including PM2.5constituents in the study may provide a better understanding of the local

emission patterns than using the particulate matter alone. Currently, the concentrations

of PM2.5constituents are not being recorded in the monitoring network. In the future,

environmental organizations may consider collecting the constituents and using them as

supplementary assessment indicators for more eﬀective air pollution source control. The

methodology described in this work can be implemented for the PM2.5constituents as well.

Our approach also opens up some new research questions. For instance, one important

question is how to model and cluster multivariate pollutants from multiple sources simulta-

neously. For the application studies in Section 6, apart from the PM2.5measurements, we

also have the concentrations of other air pollutants (e.g., PM10, O3, and SO2) at the moni-

toring stations. Though the particulate matters (PM2.5and PM10) are the most important

air pollutants in terms of the proportion of variation explained, it is of scientiﬁc importance

to have a systematic approach to combine multiple measurements in a framework of joint

modeling and clustering. Another question of interest is how to assess the uncertainty of

cluster assignments. The memberships of cities or stations are determined using the “pos-

terior” mean of a random variable, thus we may not have a great deal of conﬁdence in the

assignments of some “transition zones” where PM2.5patterns are highly variable. These

questions and extensions call for future research.

29

Acknowledgment

The research was partially supported by the National Natural Science Foundation of China

(Grant No. 11871485) and China’s National Key Research Special Program (Grant No.

2016YFC0207702). The authors are also grateful for the detailed and constructive comments

from an Associate Editor and three referees.

SUPPLEMENTARY MATERIAL

Technical Details: Implementation issues related to model selection, additional results for

the Monte Carlo EM algorithm, model diagnostic results for the data analysis, and

clustering results from other methods (PDF ﬁle).

Code: R code for simulation studies and data analysis (zip ﬁle).

Dataset: City-level daily PM2.5 concentrations of China's entirety from January 2015 to

December 2016, and station-level monthly PM2.5 concentrations from 73 stations in the

BTH region from June 2013 to December 2016 (CSV ﬁle), the topographic information

including the longitude and latitude for corresponding cities and stations (CSV ﬁle),

and China’s elevation with 1km resolution (TIF ﬁle).

Appendix A Computational Details for Spatial Ran-

dom Eﬀects

For the convenience of computation, we ﬁrst re-group the elements of the spatial random

eﬀects according to the principal components, denoted as γ1, and then conditional on Z=

(Z1,...,Zn), γ1is re-grouped again into γ∗

Zbased on the cluster membership. Let O1and

O2,Zbe two nQ ×nQ permutation matrices. We use the following expressions:

γ= (γ11, . . . , γ1Q, . . . , γn1, . . . , γnQ )T=O1γ1,

γ1= (γ11, . . . , γn1, . . . , γ1Q, . . . , γnQ )T=O2,Zγ∗

Z,

γ∗

Z= (γ·1,1,...,γ·1,K ,...,γ·Q,1,...,γ·Q,K )T,

where for q= 1, . . . , Q and k= 1, . . . , K ,γ·q,k collects all γiq’s that belong to the same cluster;

in other words, γ·q,k = (γk1q, . . . , γknkq)T, where nkis the number of locations belonging to

30

cluster k, and k1<··· < knkare the corresponding indices, i.e, Zkj=k, for j=k1, . . . , knk.

Use Γγand Γ∗

γto represent the covariance matrices of γand γ∗

Z, respectively.

Let Γ·q,k be the covariance matrix of γ·q,k, then we have Γ·q,k =σ2

γ,q,kRk(φq,k ) under the

assumptions in Section 3.1, where Rk(φq,k ) is an nk×nkmatrix with elements Rk,ii0(φq,k ) =

ρ(ksi−si0k;φq,k ), i, i0= 1, . . . , nk. For simplicity, we replace φq,k by a set of common

parameters φas follows. Since cov (γiq,k , γi0q0,k0) = 0 when q6=q0or k6=k0, the covariance

matrix of γ∗

Zis block diagonal Γ∗

γ= diag(Γ·1,. . . , Γ·Q), where Γ·q= diag(Γ·q,1,...,Γ·q,K ) for

q= 1, . . . , Q. Note that γis just a reordering of γ∗

Z, i.e., γ=OZγ∗

Z, where OZ=O1O2,Z.

It follows that the covariance matrix of γis Γγ=OZΓ∗

γOZ

T.

Appendix B Technical Details for the E-Step

In this section, we provide more details on the Gibbs sampling procedure used in the E-step.

The posterior distribution of (γ|Y,Z) can be derived from the following joint Gaussian

distribution:

Y

γ!Z∼Normal e

B

B

BTe

αZ

0!, e

B

B

BTe

ΘZe

ΓZe

ΘT

Ze

B

B

B+σ2

Ie

B

B

BTe

ΘZe

ΓZ

e

ΓZe

ΘT

Ze

B

B

Be

ΓZ!!.

Therefore, we obtain that γ|Y,Z,Ωprev∼Normal (e, v), where

e=E(γ|Y,Z) = e

ΓZe

ΘT

Ze

B

B

BVar(Y|Z)−1(Y−e

B

B

BTe

αZ),

v=Var(γ|Y,Z) = e

ΓZ−e

ΓZe

ΘT

Ze

B

B

BVar(Y|Z)−1e

B

B

BTe

ΘZe

ΓZ,

and Zi|γi,Yi,Ωprev∼Multinomial (pi1, . . . , piK ), where

pik =f(Yi|γi,Zi=k)πk

PK

k=1 f(Yi|γi,Zi=k)πk

.

Assume we have (Z(τ−1) ,γ(τ−1)) at the (τ−1)th step. Using the above marginal results, we

ﬁrst generate γ(τ)from (γ(τ)|Y,Z(τ−1),Ωprev) and then Z(τ)

ifrom (Z(τ)

i|Yi,γ(τ−1)

i,Ωprev).

At each E-step, we repeat these two steps of Gibbs sampling T0+T times and omit the ﬁrst T0

samples. In the simulation studies, we use T0= 50 and T = 100. The Sherman-Woodbury-

Morrison formula is also applied to invert the high-dimensional conditional variance of Y

31

given Zappeared in eand v, that

e

B

B

BTe

ΘZe

ΓZe

ΘT

Ze

B

B

B+σ2

I−1

=σ−2

I−e

B

B

BTe

ΘZe

Γ−1

Z−σ−2

e

ΘT

Ze

B

B

Be

B

B

BTe

ΘZe

ΘT

Ze

B

B

B.

Appendix C Technical Details for the M-Step

Following is the complete procedure for the parameter updates in the M-step.

1. Estimation of αkand σ2

, for k= 1, . . . , K. By maximizing ˆ

Q1in (10), we are able to

update αkand σ2

:

b

αk=

T

X

τ=1 n

X

i=1

Z(τ)

ik B

B

BB

B

BT!

−1

·

T

X

τ=1 (n

X

i=1

Z(τ)

ik B

B

BYi−B

B

BTΘkγ(τ)

i),

bσ2

=1

Ten

T

X

τ=1 (n

X

i=1

K

X

k=1

Z(τ)

ik

Yi−B

B

BTαk+Θkγ(τ)

i

2),

where we use the notation Z(τ)

ik =IZ(τ)

i=kfor simplicity.

2. Update each column of Θksequentially. For q= 1, . . . , Q,

b

Θk,q =

T

X

τ=1 n

X

i=1

Z(τ)

ik γ2(τ)

iq B

B

BB

B

BT!

−1

B

B

B

T

X

τ=1 (n

X

i=1

Z(τ)

ik γ(τ)

iq Yi−B

B

BTαk−X

l6=q

B

B

BTΘk,lγ(τ)

il !).

Then, orthogonalize b

Θkusing a QR decomposition.

3. Update σ2

γ,q,k by maximizing b

Q2. To simplify the computation of partial derivatives,

we use the expressions of γ,γ∗

Z, and the block diagonal matrix Γ∗

γin Appendix A.

The updating formula is:

bσ2

γ,q,k =1

T

T

X

τ=1 1

nk

γ(τ)T

·q,k R−1

k(φ)γ(τ)

·q,k ,for q= 1, . . . , Q.

4. Estimation of φ. Denote the components of φas φr’s. Given the current estimates of

other parameters, we minimize b

Q2and obtain the gradient with respect to φ, which is

32

a vector with elements

1

T

T

X

τ=1 (Q

X

q=1

K

X

k=1

tr R−1

k

∂Rk

∂φr−1

σ2

γ,q,k

tr R−1

k

∂Rk

∂φr

R−1

kγ(τ)

·q,k γ(τ)T

·q,k ).

Due to the lack of analytic solutions, we use the Newton-Raphson method to ﬁnd the

solution as the updated estimate b

φ.

5. Update ν by maximizing Qb3. The gradient with respect to ν is

n

X

i=1

K

X

k=1

Z(τ)

ik

X

i0∈∂i

Z(τ)

i0k−PK

k=1 nPi0∈∂i Z(τ)

i0koexp nνPi0∈∂i Z(τ)

i0ko

PK

k=1 exp nνPi0∈∂i Z(τ)

i0ko

.

The Newton-Raphson method is also used to obtain bν.

In the updating formulas given above, we ﬁx the parameters on the right-hand side of each

equation at their current estimates obtained from the last EM iteration.

References

Anderes, E. B. and Stein, M. L. (2008). Estimating deformations of isotropic Gaussian

random ﬁelds on the plane. The Annals of Statistics, 36(2):719–741.

Besag, J. (1975). Statistical analysis of non-lattice data. Journal of the Royal Statistical

Society: Series D (The Statistician), 24(3):179–195.

Bouveyron, C. and Jacques, J. (2011). Model-based clustering of time series in group-speciﬁc

functional subspaces. Advances in Data Analysis and Classiﬁcation, 5(4):281–300.

Bryan, B. and Adams, J. (2002). Three-dimensional neurointerpolation of annual mean

precipitation and temperature surfaces for China. Geographical Analysis, 34(2):93–111.

Chan, K. S. and Ledolter, J. (1995). Monte Carlo EM estimation for time series models

involving counts. Journal of the American Statistical Association, 90(429):242–252.

Chen, L., Guo, B., Huang, J., He, J., Wang, H., Zhang, S., and Chen, S. X. (2018). Assessing

air-quality in Beijing-Tianjin-Hebei region: The method and mixed tales of PM2.5and O3.

Atmospheric Environment, 193:290–301.

33

China’s State Council (2013). The action plan for air pollution prevention and control.

http://www.gov.cn/zwgk/2013-09/12/content_2486773.htm. In Chinese.

Chiou, J.-M. and Li, P.-L. (2007). Functional clustering and identifying substructures of lon-

gitudinal data. Journal of the Royal Statistical Society: Series B (Statistical Methodology),

69(4):679–699.

Cliﬀord, P. (1990). Markov random ﬁelds in statistics. In Grimmett, G. and Welsh, D. J.,

editors, Disorder in Physical Systems, Clarendon. Oxford.

Cohen, A. J., Brauer, M., Burnett, R., et al. (2017). Estimates and 25-year trends of the

global burden of disease attributable to ambient air pollution: An analysis of data from

the Global Burden of Diseases Study 2015. The Lancet, 389(10082):1907–1918.

de Boor, C. (2001). A practical guide to splines. Springer-Verlag, New York.

Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and

density estimation. Journal of the American Statistical Association, 97(458):611–631.

Gao, H., Chen, J., Wang, B., Tan, S.-C., Lee, C. M., Yao, X., Yan, H., and Shi, J. (2011).

A study of air pollution of city clusters. Atmospheric Environment, 45(18):3069–3077.

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the

Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine

Intelligence, PAMI-6(6):721–741.

Giacofci, M., Lambert-Lacroix, S., Marot, G., and Picard, F. (2013). Wavelet-based cluster-

ing for mixed-eﬀects functional models in high dimension. Biometrics, 69(1):31–40.

Giraldo, R., Delicado, P., and Mateu, J. (2012). Hierarchical clustering of spatially correlated

functional data: Clustering of spatial functional data. Statistica Neerlandica, 66(4):403–

421.

Hoek, G., Krishnan, R. M., Beelen, R., et al. (2013). Long-term air pollution exposure and

cardio- respiratory mortality: A review. Environmental Health, 12(1):43.

Huang, R.-J., Zhang, Y., Bozzetti, C., et al. (2014). High secondary aerosol contribution to

particulate pollution during haze events in China. Nature, 514(7521):218–222.

34

Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classiﬁcation, 2(1):193–

218.

Jacques, J. and Preda, C. (2013). Funclust: A curves clustering method using functional

random variables density approximation. Neurocomputing, 112:164–171.

James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional data.

Journal of the American Statistical Association, 98(462):397–408.

Jiang, H. and Serban, N. (2012). Clustering random curves under spatial interdependence

with application to service accessibility. Technometrics, 54(2):108–119.

Kindermann, R. and Snell, J. L. (1980). Markov random ﬁelds and their applications. Con-

temporary Mathematics. American Mathematical Society, Providence, RI.

Lelieveld, J., Evans, J. S., Fnais, M., Giannadaki, D., and Pozzer, A. (2015). The contri-

bution of outdoor air pollution sources to premature mortality on a global scale. Nature,

525:367–371.

Li, K. (2015). Report on the Work of the Government (2015). http://english.www.gov.

cn/archive/publications/2015/03/05/content_281475066179954.htm. Delivered at

Third Session of the 12th National People’s Congress on March 5, 2015.

Li, S.-T., Chou, S.-W., and Pan, J.-J. (2000). Multi-resolution spatio-temporal data mining

for the study of air pollutant regionalization. In Proceedings of the 33rd Annual Hawaii

International Conference on System Sciences, pages 1–7.

Li, X., Zhou, W., and Chen, Y. D. (2015). Assessment of regional drought trend and risk over

China: A drought climate division perspective. Journal of Climate, 28(18):7025–7037.

Li, Y., Wang, N., and Carroll, R. J. (2013). Selecting the number of principal components

in functional data. Journal of the American Statistical Association, 108(504):1284–1294.

Liang, X., Li, S., Zhang, S., Huang, H., and Chen, S. X. (2016). PM2.5data reliability,

consistency, and air quality assessment in ﬁve Chinese cities. Journal of Geophysical

Research: Atmospheres, 121(17):10220–10236.

35

Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H., and Chen, S. X. (2015).

Assessing Beijing’s PM2.5pollution: Severity, weather impact, APEC and winter heating.

Proceedings of the Royal Society A, 471(2182):20150257.

Lin, W., Xu, X., Zhang, X., and Tang, J. (2008). Contributions of pollutants from North

China Plain to surface ozone at the shangdianzi gaw station. Atmospheric Chemistry and

Physics, 8(19):5889–5898.

Mat´ern, B. (1960). Spatial Variation. Springer-Verlag, Berlin.

Pan, W. and Shen, X. (2007). Penalized model-based clustering with application to variable

selection. Journal of Machine Learning Research, 8(May):1145–1164.

Peng, J. and M¨uller, H.-G. (2008). Distance-based clustering of sparsely observed stochas-

tic processes, with applications to online auctions. The Annals of Applied Statistics,

2(3):1056–1077.

Pope, C. A. I., Burnett, R. T., Thun, M. J., and et al (2002). Lung cancer, cardiopulmonary

mortality, and long-term exposure to ﬁne particulate air pollution. Journal of American

Medical Association, 287(9):1132–1141.

Porcu, E., Bevilacqua, M., and Genton, M. G. (2016). Spatio-temporal covariance and cross-

covariance functions of the great circle distance on a sphere. Journal of the American

Statistical Association, 111(514):888–898.

Qian, W., Tang, X., and Quan, L. (2004). Regional characteristics of dust storms in China.

Atmospheric Environment, 38(29):4895–4907.

Ramsay, J. and Bernard, S. (2005). Functional data analysis. Springer series in statistics.

Springer.

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of

the American Statistical Association, 66(336):846–850.

Romano, E., Balzanella, A., and Verde, R. (2013). A Regionalization Method for Spatial

Functional Data Based on Variogram Models: An Application on Environmental Data,

pages 99–108. Springer, Berlin, Heidelberg.

36

Sampson, P. D. and Guttorp, P. (1992). Nonparametric estimation of nonstationary spatial

covariance structure. Journal of the American Statistical Association, 87(417):108–119.

van Donkelaar, A., Martin, R. V., Brauer, M., et al. (2010). Global estimates of ambient ﬁne

particulate matter concentrations from satellite-based aerosol optical depth: Development

and application. Environmental Health Perspectives, 118(6):847–855.

Wang, S., Li, G., Gong, Z., et al. (2015). Spatial distribution, seasonal variation and region-

alization of PM2.5 concentrations in China. Science China Chemistry, 58(9):1435–1443.

Wang, Y., Hao, J., McElroy, M. B., et al. (2009). Ozone air quality during the 2008 Beijing

Olympics: Eﬀectiveness of emission restrictions. Atmospheric Chemistry and Physics,

9(14):5237–5251.

Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm

and the Poor Man’s data augmentation algorithms. Journal of the American Statistical

Association, 85(411):699–704.

Xu, W. Y., Zhao, C. S., Ran, L., et al. (2011). Characteristics of pollutants and their

correlation to meteorological conditions at a suburban site in the North China Plain.

Atmospheric Chemistry and Physics, 11(9):4353–4369.

Zhang, H., Zhu, Z., and Yin, S. (2016). Identifying precipitation regimes in China using

model-based clustering of spatial functional data. In Proceedings of the Sixth International

Workshop on Climate Informatics, pages 117–120.

Zhang, S., Guo, B., Dong, A., He, J., Xu, Z., and Chen, S. X. (2017). Cautionary tales on air-

quality improvement in Beijing. Proceedings of the Royal Society A, 473(2205):20170457.

Zhang, X. Y., Wang, Y. Q., Niu, T., et al. (2012). Atmospheric aerosol compositions in

China: spatial/temporal variability, chemical signature, regional haze distribution and

comparisons with global aerosols. Atmospheric Chemistry and Physics, 12(2):779–799.

Zhou, L., Huang, J. Z., and Carroll, R. J. (2008). Joint modelling of paired sparse functional

data using principal components. Biometrika, 95(3):601–619.

37

Zhou, L., Huang, J. Z., Martinez, J. G., Maity, A., Baladandayuthapani, V., and Carroll,

R. J. (2010). Reduced rank mixed eﬀects models for spatially correlated hierarchical

functional data. Journal of the American Statistical Association, 105(489):390–400.

38